@francoisfleuret They have to be non-linear, continuous, differentiable almost everywhere, preferably monotonic, possibly homogeneous (equivariant to scaling), and if possible with zero integral over the relevant domain.
@ylecun Intuitively, when seen through the expectation of the gradient, monotonicity seems far more important than continuity.
@francoisfleuret Non-continuity may cause divergence with gradient-based algorithms: the gradient information may be inconsistent with the behavior of the function.
@ylecun Is there a analytical example that makes gd with e.g. standard straight through do something very bad?
@francoisfleuret Use a slanted sawtooth function that globally decreases but locally increases. Train a 1D linear regression with a single training sample x=1, y=1, initial weight w=0. The weight will keep increasing to infinity, while the output will keep decreasing to minus infinity.
@ylecun @francoisfleuret This is not even an arbitrary exemple: when regressing angles and putting mod 360, something like this realistically happens.