Nikhil Anand @nikhil_anand91

3 months ago

Excited to share this work on understanding low-precision instabilities in model training! See our thread below for more details. Paper: arxiv.org/abs/2506.20752 Blogpost: tinyurl.com/lowprecinstabi…

Chloe H. Su @Huangyu58589918

3 months ago

10 18 61 21K 29

Download Image

0 1 9 686 1

David Alvarez Melis @elmelis

6 months ago

🚨 New preprint! TL;DR: Backtracking is not the "holy grail" for smarter LLMs. It’s praised for helping models “fix mistakes” and improve reasoning—but is it really the best use of test-time compute? 🤔

1 10 26 4K 15

Eran Malach @EranMalach

6 months ago

How does RL improve performance on math reasoning? Studying RL from pretrained models is hard, as behavior depends on choice of base model. 🚨 In our new work, we train models *from scratch* to study the effect of the data mix on the behavior of RL. arxiv.org/abs/2504.07912

3 35 139 12K 122

Download Image

Nikhil Anand @nikhil_anand91

10 months ago

At NeurIPS? Come discuss loss-to-loss prediction and scaling laws with us!

Kempner Institute at Harvard University @KempnerInst

10 months ago

At NeurIPS? Come discuss loss-to-loss prediction and scaling laws with us!

0 5 22 3K 6

0 0 3 317 0

Nikhil Anand @nikhil_anand91

10 months ago

How do different data distributions interact with scaling laws? And how does training data affect test loss? We find simple shifted power law fits can relate performance across (sometimes very disparate) datasets and losses. See David's thread for more details!

David Brandfonbrener @brandfonbrener

10 months ago

4 17 98 15K 61

Download Image

0 0 3 281 0

Eran Malach @EranMalach

11 months ago

MoEs increase parameter count but not FLOPs. Do they offer "free lunch", improving performance without paying in compute? Our answer: for memorization, MoEs give performance gains "for free", but have limited benefit for reasoning! Arxiv: arxiv.org/pdf/2410.19034 🦜🦜🦜

6 86 474 116K 326

Download Image

Nikhil Anand @nikhil_anand91

2 years ago

Really cool work led by Devin Kwok (McGill/Mila) on making sense of example difficulty. Addresses some key ?s: E.g, How consistent is measured difficulty across inits and for different architectures? Can we fingerprint models using a few key sensitive/hard examples?

fly51fly @fly51fly

2 years ago

0 4 23 2K 11

Download Image

0 0 2 227 0

Nikhil Anand @nikhil_anand91

2 years ago

Happy to share our EMNLP paper w/ @jtan189 where we apply Variance of Gradients (VoG) – originally developed by @_cagarwal, @mrdanieldsouza, and @sarahookr – for selecting important data in language-based tasks. At EMNLP? Let's connect to discuss data quality and/or LLMs! #EMNLP