Case in point: arxiv.org/abs/2207.06366 Can do better than a bigger Transformers (stacking layers) by adding this single layer to the bottom of the network. While we beat baselines already on inference latency too there is still lot of overheads in there to be sped up!
Case in point: arxiv.org/abs/2207.06366 Can do better than a bigger Transformers (stacking layers) by adding this single layer to the bottom of the network. While we beat baselines already on inference latency too there is still lot of overheads in there to be sped up!
@_arohan_ Any tips for using this for machine translation? I'm gonna implement it next week!
@robinschmidt_ Look at cluster assignments for debugging this (first time I tried I had an incorrect axis bug 🤦♂️) Multiheaded aspect is important Decoder > Encoder (but worth ablating)