Case in point: arxiv.org/abs/2207.06366 Can do better than a bigger Transformers (stacking layers) by adding this single layer to the bottom of the network. While we beat baselines already on inference latency too there is still lot of overheads in there to be sped up!
Case in point: arxiv.org/abs/2207.06366 Can do better than a bigger Transformers (stacking layers) by adding this single layer to the bottom of the network. While we beat baselines already on inference latency too there is still lot of overheads in there to be sped up!
@_arohan_ Any tips for using this for machine translation? I'm gonna implement it next week!
@robinschmidt_ Look at cluster assignments for debugging this (first time I tried I had an incorrect axis bug 🤦♂️) Multiheaded aspect is important Decoder > Encoder (but worth ablating)
@_arohan_ Really helpful! I'm guessing multilinguality is still unexplored?
@robinschmidt_ Yes, forgot to tell how important LayerNorm is and switching to AdaGrad with small initial accumulator/epsilon (it’s in the paper, don’t use Adam on this giant table / because it’s sparser )
@_arohan_ Got it, multilingual embeddings introduce a few interesting ablations such as keeping clusters separate or merging subsets for cross-lingual transfer! Let’s see how the experiments turn out =)
@robinschmidt_ Thats’s awesome, you can inject priors too instead/in addition to training the clusters.