ML twitter: Has anyone studied what's so special about the noise distribution of small-batch gradients such that they're better than large-batch gradients? Surely it's richer than just large-batch gradients + some gaussian noise?
@SamuelAinsworth what are you measuring where small batches are better? if you tune away the hyperparams, batch sizes should scale just fine (arxiv.org/abs/1811.03600, arxiv.org/abs/1907.04164, arxiv.org/abs/2102.06356)
@zacharynado @SamuelAinsworth Maybe I miss something, but I think for some problems large batch sizes do lead to bad convergence. For instance, Table 2 in arxiv.org/abs/2109.14119 reports that without extra tricks full-batch gradient descent reaches accuracy 87.36(±1.23) as opposed to SGD 95.70(±0.11).
@konstmish @SamuelAinsworth what I shouldve said in my prev tweet was, I think you can scale batches very large w/ tuning existing regularization methods. I wonder how good GD would be if they just tuned WD? it does make sense to me that at extremes like that paper you may need to introduce new reg tricks
@zacharynado @konstmish I guess an alternate spin on my questions is: small batches have some sort of reg. effect, but what is it formally? eg. adding gaussian noise to my gradients corresponds to a gaussian blur of my obj. func. Could we formalize of what small batch reg. corresponds to?
@SamuelAinsworth @zacharynado @konstmish I remember reading this take, but don't remember where: As soon as you use BN, *the batch is an example* and the "dataset" is actually the collection of all possible unique batches. From this perspective, there should be a sweet spot BS that corresponds to the largest "dataset".
@SamuelAinsworth @zacharynado @konstmish It would be interesting to check empirically whether there's qualitatively different behaviour for BN models vs non-BN models.