♻️ Leshem Choshen ♻️ @LChoshen, Twitter Profile

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

12 9 76 0 36

4 9 29 0 13

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)

2 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Empirical Finding: Warmup helps performance and gneralization. (This was known in the literature [although hard to look as is a sidenote not the papers point, help?] but also repeated in the thread, e.g. @DrorSimon )

2 0 2 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Next, I present the intuitions that occur, note that it does not have to be just the one. It could be that all are part of the reason, and some explain more of the improvement we see than others.

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

1⃣The first intuition is quite undeniable. Many optimizers approximate traits by a rolling average. An average needs a bit of data to be reliable. In extreme averaging 0 batchs, we essentially have SGD. Example: Adam’s momentum approximates average direction by previous steps

1 0 6 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

2⃣Consider an untrained network (pretrained) to be on the peak of the loss mountain. First updates would be large, fast down the steep mountain. We don’t want to make those too large with our learning rate. Cold

1 0 2 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

After a few steps we get closer to a plataue. This is the place where large learning rates help the most. A lot of distance to pass, no fine tunings needed. Getting warmer.

1 0 5 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Right, but why would that be the case? Well, who knows. Regardless of why, this paper, suggests it is the case. (maybe authors have more intuitions?) @jmgilmer @_ghorbani Ankush Garg @snehaark @bneyshabur @dcardoza @GeorgeEDahl @zacharynado @orf_bnw arxiv.org/abs/2110.04369

3 1 11 0 1

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

3⃣A repeating intuition focuses on the pretrained weights (#NLProc but general #MachineLearning too recently). We worked hard for a good initialization for most layers. But not all, at the end you add a fresh layer, e.g. FC for BERT (unless you use seq2seq like T5).

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

If you init just the last layer, it needs more tuning then the rest. Supposedly, warmup slowly adapts it to a lot of data (more stable), the top would change but the bottom less so.

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

While this may be the case, I find it a bit hard to understand why gradual change would change less than non-gradual one. But even so,

1 0 1 0 0