Nathan Godey @nthngdy, Twitter Profile

Nathan Godey @nthngdy

2 weeks ago

🤏 Why do small Language Models underperform? We prove empirically and theoretically that the LM head on top of language models can limit performance through the softmax bottleneck phenomenon, especially when the hidden dimension <1000. 📄Paper: arxiv.org/pdf/2404.07647… (1/10)

14 121 594 61K 531

Download Image

Nathan Godey @nthngdy

2 weeks ago

(2/10) When analyzing small-scale LM suites, such as GPT-2 or Pythia, we observe that the smaller models have degenerated last-layer representations, but not the larger ones. Average pairwise cosine similarity can capture this degeneration phenomenon:

1 0 17 2K 6

Download Image