♻️ Leshem Choshen ♻️ @LChoshen, Twitter Profile

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

The intuitions behind warmup, a summary 🧵 I asked what are the intuitions behind warm-up (I had none). I got many answers (and 2 papers) in the cited tweet and thought to give something back. Now they are digestible Thread unroll:

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

12 9 76 0 36

4 9 29 0 13

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Warm up: The practice of starting with a low-ish learning rate and then increasing it. This is done early in training.

1 0 1 0 0

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

Do not confuse it with weight decay where: throughout training, learning rate decreases. How decay helps: At the beginning you move fast and get near the loss pit (minimum) and then make smaller steps to avoid overshooting (jumping over the minimum pit)

2 0 1 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@LChoshen I think you typo'd: s/weight decay/learning-rate decay/ Because weight decay is yet another related thing.

1 0 1 0 1

♻️ Leshem Choshen ♻️ @LChoshen

2 years ago

@giffmana Right, so let's fill it: Weight decay - the parameters of the network are decreased towards zero with the batch updates (similar but not always equal to l2loss over weights size) Learning rate decay - the learning rate is decreased with training.

0 0 1 0 1