One trick that I like to use when training my neural networks is to add some noise ε~Laplace(time(), sqrt(time())) to the gradients of the 13th layer at epoch 3 for batch 7.
32
68
1K
0
115
@tetraduzione why not just use random.seed(19385724)?
@tetraduzione I also found that starting my training at the strike of the hour helps the training dynamics align with the natural fabric of the universe and converge better
@tetraduzione Tried this; only works on GPU # mod 3 = 1; why?
@tetraduzione I found it works much better to add a normalizing flow bijector to shape your 1d noise, I got better results the one time I did it, so it must be true.
@tetraduzione Careful: this trick does not play nicely with batch norm