rohan anil @_arohan_, Twitter Profile

rohan anil @_arohan_

2 years ago

Case in point: arxiv.org/abs/2207.06366 Can do better than a bigger Transformers (stacking layers) by adding this single layer to the bottom of the network. While we beat baselines already on inference latency too there is still lot of overheads in there to be sped up!

Sara Hooker @sarahookr

2 years ago

3 11 47 0 7

Download Image

4 2 22 0 5

Robin M. Schmidt @robinschmidt_

2 years ago

@_arohan_ Any tips for using this for machine translation? I'm gonna implement it next week!

1 0 2 0 0

rohan anil @_arohan_

2 years ago

@robinschmidt_ Look at cluster assignments for debugging this (first time I tried I had an incorrect axis bug 🤦‍♂️) Multiheaded aspect is important Decoder > Encoder (but worth ablating)

1 0 2 0 0

Robin M. Schmidt @robinschmidt_

2 years ago

@_arohan_ Really helpful! I'm guessing multilinguality is still unexplored?

1 0 2 0 0

rohan anil @_arohan_

2 years ago

@robinschmidt_ Yes, forgot to tell how important LayerNorm is and switching to AdaGrad with small initial accumulator/epsilon (it’s in the paper, don’t use Adam on this giant table / because it’s sparser )

1 0 2 0 0

Robin M. Schmidt @robinschmidt_

2 years ago

@_arohan_ Got it, multilingual embeddings introduce a few interesting ablations such as keeping clusters separate or merging subsets for cross-lingual transfer! Let’s see how the experiments turn out =)

1 0 2 0 0

rohan anil @_arohan_

2 years ago

@robinschmidt_ Thats’s awesome, you can inject priors too instead/in addition to training the clusters.

0 0 2 0 0