Once and for all What is the intuition behind warming up learning rate? I understand why it makes sense to decay the learning rate. But why should it start small and rise?

The intuitions behind warmup, a summary šŸ§µ I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll: twitter.com/LChoshen/statuā€¦

Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.

Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)

@LChoshen I think you typo'd: s/weight decay/learning-rate decay/ Because weight decay is yet another related thing.

@giffmana Right, so let's fill it: Weight decay - the parameters of the network are decreased towards zero with the batch updates (similar but not always equal to l2loss over weights size) Learning rate decay - the learning rate is decreased with training.