Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?

The intuitions behind warmup, a summary šŸ§µ I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll: twitter.com/LChoshen/statuā€¦

Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.

Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)

Empirical Finding: Warmup helps performance and gneralization. (This was known in the literature [although hard to look as is a sidenote not the papers point, help?] but also repeated in the thread, e.g. @DrorSimon )

@LChoshen @DrorSimon @priy2201 arxiv.org/abs/1706.02677 was one of the main papers making the empirical benefit of warm-up widely known.