Yifan Zhang @yifan_zhang_

PhD student at @Princeton University, focusing on LLMs. Language Modeling and Pretraining, LLM Reasoning and RL. Prev @UCLA, @Tsinghua_IIIS yifzhang.com New York Metropolitan Area Joined October 2022

Tweets

85
Followers

392
Following

514
Likes

391

Thinking Machines @thinkymachines

17 hours ago

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…

62 393 3K 641K 2K

Download Image

Quanquan Gu @QuanquanGu

13 hours ago

Nice blog post! Essentially, this shows that μP + LoRA, when done right, makes the optimal learning rate transferable and nearly matches full fine-tuning performance. One subtle but important point worth mentioning is that there is an additional dimension of scaling to consider:…

Thinking Machines @thinkymachines

17 hours ago

62 393 3K 641K 2K

Download Image

4 13 180 19K 115

You Jiacheng @YouJiacheng

23 hours ago

As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer. Asymptotic cost ratio = 128/576. In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.

11 47 361 44K 129

Download Image

Nathan Lambert @natolambert

2 days ago

RL research is becoming like pretraining/modeling. This is a huge vibe shift. Most research published on RL isn't using enough compute to make many of these decisions matter as much. This is slowly shifting.

Tanishq Mathew Abraham, Ph.D. @iScienceLuvr

2 days ago

13 85 873 235K 899

Download Image

9 87 775 90K 530

Grad @Grad62304977

a day ago

Funnily enough, deepseek r1 didn't even use the original GRPO

Jerry Tworek @MillionInt

2 days ago

Funnily enough, deepseek r1 didn't even use the original GRPO

14 10 429 49K 66

1 1 73 5K 8

Rafael Rafailov @ NeurIPS @rm_rafailov

2 days ago

It’s weird how people still blindly copy it. There was a whole paper about this.

Quanquan Gu @QuanquanGu

2 days ago

It’s weird how people still blindly copy it. There was a whole paper about this.

7 18 180 91K 232

Download Image

3 17 282 44K 256

Quanquan Gu @QuanquanGu

2 days ago

@zjasper666 The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:

7 18 180 91K 232

Download Image

Quanquan Gu @QuanquanGu

2 days ago

@rm_rafailov Actually, this is not from the paper you mentioned. It comes from our earlier work, which predates the one you referred to: arxiv.org/abs/2505.17508

1 4 53 4K 46

Jeremy Bernstein @jxbz

3 days ago

I wrote this blog post that tries to go further toward design principles for neural nets and optimizers The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number x.com/thinkymachines…

Thinking Machines @thinkymachines

4 days ago

113 452 3K 1.4M 2K

Download Image

21 49 459 60K 149

Yifan Zhang @yifan_zhang_

4 weeks ago

The future of something great is now within reach.

1 1 3 504 0

Download Image

Chuang Gan @gan_chuang

2 weeks ago

The NeurIPS acceptance bar is very high, and papers with a negative score rarely get accepted. Even when a negative review is poorly written—sometimes clearly generated by ChatGPT—it can still strongly influence the final decision, since the AC must keep the acceptance rate.

8 4 98 12K 7

Yifan Zhang @yifan_zhang_

9 months ago

1/ Introducing “Tensor Product Attention Is All You Need” (TPA) and Tensor ProducT ATTenTion Transformer (T6)! 🚀 Ever wondered if there’s a more memory-efficient way to handle long contexts in LLMs? Homepage: tensorgi.github.io/T6

7 63 320 87K 212

Download Image

fly51fly @fly51fly

8 months ago

[LG] Tensor Product Attention Is All You Need Y Zhang, Y Liu, H Yuan, Z Qin... [singhua University & University of California, Los Angeles] (2025) arxiv.org/abs/2501.06425

1 9 44 3K 26

Download Image

Zeyuan Allen-Zhu, Sc.D. @ZeyuanAllenZhu

4 weeks ago

Also big congrats on Nemotron-CC-Math! 🎉 NVIDIA is not only leading, but continuing to lead, and setting the pace across multiple subareas of open pretraining data. @KarimiRabeeh and @issanjeev are the leading authors there! arxiv.org/pdf/2508.15096

Rabeeh Karimi @KarimiRabeeh

4 weeks ago

1 1 17 15K 5

3 9 92 15K 32

YIFENG LIU @YIFENGLIU_AI

4 months ago

1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: arxiv.org/abs/2505.17508 Code:…

5 41 206 63K 172

Download Image

Junyang Lin @JustinLin610

4 weeks ago

I just have a feeling that... it is much smarter. Not reflected by the common benchmarks, but it is just way better than the models before. This gives us much confidence in scaling, either model or data size.