We show that a token representation can be viewed as a changing distribution over the output vocabulary, and that FFN layers induce additive updates to that distribution (2/6)
We show that a token representation can be viewed as a changing distribution over the output vocabulary, and that FFN layers induce additive updates to that distribution (2/6)