François Fleuret @francoisfleuret, Twitter Profile

François Fleuret @francoisfleuret

a year ago

"activation functions have to be differentiable". Which one do you choose?

21 12 218 117K 77

Download Image

Yann LeCun @ylecun

a year ago

@francoisfleuret They have to be non-linear, continuous, differentiable almost everywhere, preferably monotonic, possibly homogeneous (equivariant to scaling), and if possible with zero integral over the relevant domain.

19 27 561 98K 138

François Fleuret @francoisfleuret

a year ago

@ylecun Intuitively, when seen through the expectation of the gradient, monotonicity seems far more important than continuity.

5 0 12 8K 2

Yann LeCun @ylecun

a year ago

@francoisfleuret Non-continuity may cause divergence with gradient-based algorithms: the gradient information may be inconsistent with the behavior of the function.

1 1 15 3K 1

François Fleuret @francoisfleuret

a year ago

@ylecun Is there a analytical example that makes gd with e.g. standard straight through do something very bad?

1 0 2 2K 0

Yann LeCun @ylecun

a year ago

@francoisfleuret Use a slanted sawtooth function that globally decreases but locally increases. Train a 1D linear regression with a single training sample x=1, y=1, initial weight w=0. The weight will keep increasing to infinity, while the output will keep decreasing to minus infinity.

2 0 15 3K 2

Lucas Beyer (bl16) @giffmana

a year ago

@ylecun @francoisfleuret This is not even an arbitrary exemple: when regressing angles and putting mod 360, something like this realistically happens.

0 0 1 1K 1