♻️ Leshem Choshen ♻️ @LChoshen, Twitter Profile

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

12 9 76 0 36

4 9 29 0 13

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)

2 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Empirical Finding: Warmup helps performance and gneralization. (This was known in the literature [although hard to look as is a sidenote not the papers point, help?] but also repeated in the thread, e.g. @DrorSimon )

2 0 2 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Next, I present the intuitions that occur, note that it does not have to be just the one. It could be that all are part of the reason, and some explain more of the improvement we see than others.

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

1⃣The first intuition is quite undeniable. Many optimizers approximate traits by a rolling average. An average needs a bit of data to be reliable. In extreme averaging 0 batchs, we essentially have SGD. Example: Adam’s momentum approximates average direction by previous steps

1 0 6 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

2⃣Consider an untrained network (pretrained) to be on the peak of the loss mountain. First updates would be large, fast down the steep mountain. We don’t want to make those too large with our learning rate. Cold

1 0 2 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

After a few steps we get closer to a plataue. This is the place where large learning rates help the most. A lot of distance to pass, no fine tunings needed. Getting warmer.

1 0 5 0 0