The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:
The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:
Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.
Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)
Empirical Finding: Warmup helps performance and gneralization. (This was known in the literature [although hard to look as is a sidenote not the papers point, help?] but also repeated in the thread, e.g. @DrorSimon )
Next, I present the intuitions that occur, note that it does not have to be just the one. It could be that all are part of the reason, and some explain more of the improvement we see than others.
1⃣The first intuition is quite undeniable. Many optimizers approximate traits by a rolling average. An average needs a bit of data to be reliable. In extreme averaging 0 batchs, we essentially have SGD. Example: Adam’s momentum approximates average direction by previous steps
2⃣Consider an untrained network (pretrained) to be on the peak of the loss mountain. First updates would be large, fast down the steep mountain. We don’t want to make those too large with our learning rate. Cold
After a few steps we get closer to a plataue. This is the place where large learning rates help the most. A lot of distance to pass, no fine tunings needed. Getting warmer.
Right, but why would that be the case? Well, who knows. Regardless of why, this paper, suggests it is the case. (maybe authors have more intuitions?) @jmgilmer @_ghorbani Ankush Garg @snehaark @bneyshabur @dcardoza @GeorgeEDahl @zacharynado @orf_bnw arxiv.org/abs/2110.04369
3⃣A repeating intuition focuses on the pretrained weights (#NLProc but general #MachineLearning too recently). We worked hard for a good initialization for most layers. But not all, at the end you add a fresh layer, e.g. FC for BERT (unless you use seq2seq like T5).
If you init just the last layer, it needs more tuning then the rest. Supposedly, warmup slowly adapts it to a lot of data (more stable), the top would change but the bottom less so.
While this may be the case, I find it a bit hard to understand why gradual change would change less than non-gradual one. But even so,