The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:
The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:
Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.
Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)
@LChoshen I think you typo'd: s/weight decay/learning-rate decay/ Because weight decay is yet another related thing.
@giffmana Right, so let's fill it: Weight decay - the parameters of the network are decreased towards zero with the batch updates (similar but not always equal to l2loss over weights size) Learning rate decay - the learning rate is decreased with training.