rohan anil @_arohan_, Twitter Profile

rohan anil @_arohan_

2 years ago

Case in point: arxiv.org/abs/2207.06366 Can do better than a bigger Transformers (stacking layers) by adding this single layer to the bottom of the network. While we beat baselines already on inference latency too there is still lot of overheads in there to be sped up!

Sara Hooker @sarahookr

2 years ago

3 11 47 0 7

Download Image

4 2 22 0 5

Robin M. Schmidt @robinschmidt_

2 years ago

@_arohan_ Any tips for using this for machine translation? I'm gonna implement it next week!

1 0 2 0 0

rohan anil @_arohan_

2 years ago

@robinschmidt_ Look at cluster assignments for debugging this (first time I tried I had an incorrect axis bug 🤦‍♂️) Multiheaded aspect is important Decoder > Encoder (but worth ablating)

1 0 2 0 0