🤏 Why do small Language Models underperform? We prove empirically and theoretically that the LM head on top of language models can limit performance through the softmax bottleneck phenomenon, especially when the hidden dimension <1000. 📄Paper: arxiv.org/pdf/2404.07647… (1/10)
14
121
594
61K
531
Download Image
(2/10) When analyzing small-scale LM suites, such as GPT-2 or Pythia, we observe that the smaller models have degenerated last-layer representations, but not the larger ones. Average pairwise cosine similarity can capture this degeneration phenomenon: