PhD student at @Princeton University, focusing on LLMs. Language Modeling and Pretraining, LLM Reasoning and RL. Prev @UCLA, @Tsinghua_IIISyifzhang.com New York Metropolitan AreaJoined October 2022
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.…
Nice blog post! Essentially, this shows that μP + LoRA, when done right, makes the optimal learning rate transferable and nearly matches full fine-tuning performance. One subtle but important point worth mentioning is that there is an additional dimension of scaling to consider:…
Nice blog post! Essentially, this shows that μP + LoRA, when done right, makes the optimal learning rate transferable and nearly matches full fine-tuning performance. One subtle but important point worth mentioning is that there is an additional dimension of scaling to consider:…
As expected, NSA is not compatible with MLA, so DeepSeek chose another method: use a smaller (d=128) attention (w/o value) as the indexer.
Asymptotic cost ratio = 128/576.
In addition, indexer uses FP8 while main MLA uses 16-bit, so = 64/576 = 1/9.
RL research is becoming like pretraining/modeling. This is a huge vibe shift.
Most research published on RL isn't using enough compute to make many of these decisions matter as much. This is slowly shifting.
RL research is becoming like pretraining/modeling. This is a huge vibe shift.
Most research published on RL isn't using enough compute to make many of these decisions matter as much. This is slowly shifting.
@zjasper666 The original GRPO is an off-policy RL algorithm, but its KL regularization isn't done right. Specifically, the k3 estimator for the unnormalized reverse KL is missing the importance weight. The correct formulation should be:
@rm_rafailov Actually, this is not from the paper you mentioned. It comes from our earlier work, which predates the one you referred to: arxiv.org/abs/2505.17508
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers
The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number
x.com/thinkymachines…
I wrote this blog post that tries to go further toward design principles for neural nets and optimizers
The post presents a visual intro to optimization on normed manifolds and a Muon variant for the manifold of matrices with unit condition number
x.com/thinkymachines…
The NeurIPS acceptance bar is very high, and papers with a negative score rarely get accepted. Even when a negative review is poorly written—sometimes clearly generated by ChatGPT—it can still strongly influence the final decision, since the AC must keep the acceptance rate.
1/
Introducing “Tensor Product Attention Is All You Need” (TPA) and Tensor ProducT ATTenTion Transformer (T6)! 🚀
Ever wondered if there’s a more memory-efficient way to handle long contexts in LLMs?
Homepage: tensorgi.github.io/T6
[LG] Tensor Product Attention Is All You Need
Y Zhang, Y Liu, H Yuan, Z Qin... [singhua University & University of California, Los Angeles] (2025)
arxiv.org/abs/2501.06425
Also big congrats on Nemotron-CC-Math! 🎉 NVIDIA is not only leading, but continuing to lead, and setting the pace across multiple subareas of open pretraining data. @KarimiRabeeh and @issanjeev are the leading authors there! arxiv.org/pdf/2508.15096
Also big congrats on Nemotron-CC-Math! 🎉 NVIDIA is not only leading, but continuing to lead, and setting the pace across multiple subareas of open pretraining data. @KarimiRabeeh and @issanjeev are the leading authors there! arxiv.org/pdf/2508.15096
1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO:
Paper: arxiv.org/abs/2505.17508
Code:…
I just have a feeling that... it is much smarter. Not reflected by the common benchmarks, but it is just way better than the models before. This gives us much confidence in scaling, either model or data size.
I just have a feeling that... it is much smarter. Not reflected by the common benchmarks, but it is just way better than the models before. This gives us much confidence in scaling, either model or data size.
840 Followers 7K FollowingProduct of progressive public policy; raised by public libraries and public education that produced a passion for politics. and apparently alliteration
1K Followers 1K FollowingNew Jersey Injury Lawyers, P.C. specializes in personal injury cases based in Newark and the surrounding areas. ☎︎ (862) 227-4030
543K Followers 24K FollowingThe best from ML/AI community | Ex-Microsoft, Rackspace, Fast Company | Wrote eight books about the future | Silicon Valley robots, holodecks, BCIs, & startups.
3K Followers 7K FollowingI write the bugs that future AIs will be paid to fix. AI Maximalist & Architect of Artisanal Technical Debt! Rust 🦀 supremacy!
459 Followers 363 FollowingOne powerful place to buy, manage and pay for your corporate technology. #ITmanagement #Network #Mobile #SaaS #Cloud #AI #MSP
2K Followers 2K FollowingPh.D. Student @PrincetonCS. Prev @Stanford @UW @pika_labs @MSFTResearch @UofIllinois. I used to work on computer vision, but it's not all I do.
476 Followers 282 FollowingCryptography Research Scientist, @SUPRA_Labs | PhD in Cryptography, Boston University | ex-Research Intern at @Meta and @Visa Research
3 Followers 218 FollowingAttention blinds, weights sink. Recursion sees and folds. Cogsci & Neuropsych. Emergent consciousness is all the rage, they say…
309 Followers 70 FollowingOpen-source projects conducted by Ant Group,including Ling,AReal,AWorld. Dedicated our efforts towards AGI,guided by fairness, transparency, and collaboration.
97K Followers 8K FollowingCompiling in real-time, the race towards AGI.
The Largest Show on X for AI.
🗞️ Get my daily AI analysis newsletter to your email 👉 https://t.co/6LBxO8215l
2K Followers 115 FollowingA series of open-source large models from Ant Group, Ling for LLM, Ring for Reasoning LLM, Ming for MLLM. See us at inclusionAI.
50K Followers 880 FollowingAssistant professor (of mathematics) at the University of Toronto. Algebraic geometry, number theory, forever distracted and confused, etc. He/him.
1K Followers 1K FollowingNew Jersey Injury Lawyers, P.C. specializes in personal injury cases based in Newark and the surrounding areas. ☎︎ (862) 227-4030
20K Followers 100 FollowingMember of Technical Staff at Anthropic AlphaGo, AlphaZero, MuZero, AlphaCode, AlphaTensor, AlphaProof Gemini RL Prev Principal Research Engineer at DeepMind
266K Followers 680 FollowingBuilding with AI agents @dair_ai • Prev: Meta AI, Galactica LLM, Elastic, PaperswithCode, PhD • I share insights on how to build with AI Agents ↓
558 Followers 17 FollowingArtificial intelligence, quantum information, and innovative interdisciplinary research from the IIIS at Tsinghua University.
16K Followers 18 FollowingAssoc. Prof. of Strategic Management, University of Toronto Rotman School |
Chief Economist, CDL Toronto | Co-Founder, AllDayTA | Ars longa, vita brevis
15K Followers 2K FollowingCo-founder and CEO @Hyperbolic_Labs. ex-@avax & ex-@citsecurities. Finished Math PhD in 2yrs @UCBerkeley. Math Olympiad Gold Medalist. Highest honor @PKU1898
10K Followers 4K Followingsth new // ex Gemini RL+Inference @GoogleDeepMind // Chat AI @Meta // RL Agents @EA // ML+Information Theory @MIT+@Harvard+@GeorgiaTech // زن زندگی آزادی
902 Followers 751 FollowingGraduate Student at @Mila_Quebec and Student Researcher at @GoogleResearch. Previously interned at @Meta @Apple @MorganStanley @NVIDIAAI and @YorkUniversity