Samuel "curry-howard fanboi" Ainsworth @SamuelAinsworth, Twitter Profile

Samuel "curry-howard fanboi" Ainsworth @SamuelAinsworth

2 years ago

ML twitter: Has anyone studied what's so special about the noise distribution of small-batch gradients such that they're better than large-batch gradients? Surely it's richer than just large-batch gradients + some gaussian noise?

4 1 21 0 5

Zachary Nado @zacharynado

2 years ago

@SamuelAinsworth what are you measuring where small batches are better? if you tune away the hyperparams, batch sizes should scale just fine (arxiv.org/abs/1811.03600, arxiv.org/abs/1907.04164, arxiv.org/abs/2102.06356)

3 1 11 0 2

Konstantin Mishchenko @konstmish

2 years ago

@zacharynado @SamuelAinsworth Maybe I miss something, but I think for some problems large batch sizes do lead to bad convergence. For instance, Table 2 in arxiv.org/abs/2109.14119 reports that without extra tricks full-batch gradient descent reaches accuracy 87.36(±1.23) as opposed to SGD 95.70(±0.11).

1 0 3 0 0

Zachary Nado @zacharynado

2 years ago

@konstmish @SamuelAinsworth what I shouldve said in my prev tweet was, I think you can scale batches very large w/ tuning existing regularization methods. I wonder how good GD would be if they just tuned WD? it does make sense to me that at extremes like that paper you may need to introduce new reg tricks

2 0 3 0 0

Samuel "curry-howard fanboi" Ainsworth @SamuelAinsworth

2 years ago

@zacharynado @konstmish I guess an alternate spin on my questions is: small batches have some sort of reg. effect, but what is it formally? eg. adding gaussian noise to my gradients corresponds to a gaussian blur of my obj. func. Could we formalize of what small batch reg. corresponds to?

2 1 2 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@SamuelAinsworth @zacharynado @konstmish I remember reading this take, but don't remember where: As soon as you use BN, *the batch is an example* and the "dataset" is actually the collection of all possible unique batches. From this perspective, there should be a sweet spot BS that corresponds to the largest "dataset".

3 0 1 0 0

Lucas Beyer (bl16) @giffmana

2 years ago

@SamuelAinsworth @zacharynado @konstmish It would be interesting to check empirically whether there's qualitatively different behaviour for BN models vs non-BN models.

0 0 0 0 0