Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?
The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll: twitter.com/LChoshen/statu…
Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.
Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)
Empirical Finding: Warmup helps performance and gneralization. (This was known in the literature [although hard to look as is a sidenote not the papers point, help?] but also repeated in the thread, e.g. @DrorSimon )