Yihao Feng @yihaocs

research scientist at AIML @Apple ;Ex AI Researcher @SFResearch; Ph.D alumni UT Austin @UTCompSci . Reinforcement learning, diffusion model and LLMs. Palo Alto, CA Joined September 2013

Tweets

217
Followers

103
Following

478
Likes

3K

elvis @omarsar0

a day ago

Great work showing prompt synthesis as a new scaling axis for reasoning. Good training data is scarce. This work showcases a framework that might make it possible to construct high-quality training problems for reasoning-focused LLMs. Technical details below:

19 64 327 56K 321

Download Image

Tanishq Mathew Abraham, Ph.D. @iScienceLuvr

5 days ago

Language Models that Think, Chat Better "This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities." "RLMT consistently outperforms standard RLHF pipelines. This…

9 36 247 17K 162

Download Image

Tanishq Mathew Abraham, Ph.D. @iScienceLuvr

6 days ago

APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation "we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency." "Experiments show that APRIL improves rollout throughput by at most 44% across…

4 19 145 11K 84

Download Image

Wenhao Yu @wyu_nd

a week ago

Also strongly recommend this paper on diversity reward in RL! The insights line up closely -- well worth reading together. https:// arxiv.org/abs/2509.15194 (Tencent) https:// arxiv.org/abs/2509.02534 (Meta) Not sure which diversity reward wins out 😀 (embedding vs…

Jason Weston @jaseweston

4 weeks ago

5 87 422 83K 345

Download Image

5 39 251 23K 192

Download Image

Wenhao Yu @wyu_nd

2 weeks ago

RL often cause 𝐞𝐧𝐭𝐫𝐨𝐩𝐲 𝐜𝐨𝐥𝐥𝐚𝐩𝐬𝐞: generations become shorter, less diverse, and brittle. A simple fix is 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 reward to boost exploration. I use it in many of my projects — surprisingly effective! Details in our NEW paper: arxiv.org/abs/2509.15194

6 65 379 23K 282

Download Image

Wenting Zhao @wzhao_nlp

2 weeks ago

What strikes me in the work is that as long as the data recipe is right everything can just work with RL, generalizes super well, even at 1.7B level. Even people said it’s hard to improve RL’ed qwen models, we just did it! Thanks @_akhaliq for featuring my work the third time…

AK @_akhaliq

3 weeks ago

3 38 248 99K 184

Download Image

4 39 388 68K 310

Lucas Beyer (bl16) @giffmana

3 weeks ago

Very cool thread about whether LLMs can multi hop reason without CoT or not. If you're curious, read the full thread, it's well written and clearly answers.

Mikita Balesni 🇺🇦 @balesni

3 weeks ago

Very cool thread about whether LLMs can multi hop reason without CoT or not. If you're curious, read the full thread, it's well written and clearly answers.

2 8 88 50K 82

Download Image

9 26 214 41K 158

Mikita Balesni 🇺🇦 @balesni

3 weeks ago

The puzzle: * Synthetic + real fact: ✓ works * Synthetic + synthetic: ✗ fails * Synthetic facts in same training document or in-context: ✓ works

2 8 88 50K 82

Download Image

Vik Paruchuri @VikParuchuri

3 weeks ago

High quality math is the secret sauce for reasoning models. The best math data is in old papers. But OCRing that math is full of insane edge cases. Let's talk about how to solve this, and how you can get better math data than many frontier labs 🧵

19 70 741 91K 669

Download Image

Kyle Corbitt @corbtt

4 weeks ago

🚨 We’ve just published a recipe to train a frontier-level deep research agent using RL. With just 30 hours on an H200, any developer can now beat Sonnet-4 on DeepResearch Bench using open-source tools. (Thread 🧵)

38 175 1K 211K 2K

Download Image

Aleksa Gordić (水平问题) @gordic_aleksa

4 weeks ago

New in-depth blog post - "Inside vLLM: Anatomy of a High-Throughput LLM Inference System". Probably the most in depth explanation of how LLM inference engines and vLLM in particular work! Took me a while to get this level of understanding of the codebase and then to write up…

62 404 3K 311K 3K

Download Image

AK @_akhaliq

a month ago

Microsoft presents rStar2-Agent Agentic Reasoning Technical Report rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with…

14 68 362 41K 236

Download Image

Guohao Li 🐫 @guohao_li

a month ago

Sir, we built this. A RL environment for learning reasoning at scale. GitHub: github.com/camel-ai/loong HF dataset: huggingface.co/datasets/camel… We extracted seed datasets from sources like textbooks, code libraries like sympy, networkX, Gurobi (math programming lib), rdkit…

Andrej Karpathy @karpathy

a month ago

288 727 6K 688K 5K

Download Image

6 75 614 76K 618

Download Image

Tanishq Mathew Abraham, Ph.D. @iScienceLuvr

a month ago

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration "We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy…

4 46 231 29K 204

Download Image

François Chollet @fchollet

2 months ago

We were able to reproduce the strong findings of the HRM paper on ARC-AGI-1. Further, we ran a series of ablation experiments to get to the bottom of what's behind it. Key findings: 1. The HRM model architecture itself (the centerpiece of the paper) is not an important factor.…

46 297 3K 350K 2K

will brown @willccbb

2 months ago

super cool to see this come together, incredible work spearheaded by @brendanh0gan, all-in-all an incredibly detailed recipe of what it takes to craft a specialist model for OOD tasks where frontier models really struggle paper/weights/data/code in brendan’s thread :)

Brendan Hogan @brendanh0gan

2 months ago

21 92 726 122K 672

Download Image

7 18 142 19K 69

Dimitris Papailiopoulos @DimitrisPapail

2 months ago

A neat observation: Rejection sampling during GRPO allows you to directly factor in properties in your reward, and allows you to go from optimizing max_model Expected Reward(response) to max_model E {Reward(response) * Property(response)} iIn our case it's "small length", but…

Dimitris Papailiopoulos @DimitrisPapail

2 months ago

19 42 362 96K 273

Download Image

5 13 127 14K 103

Mika Senghaas @mikasenghaas

2 months ago

moving from vllm v0 to v1 made our async rl training crash! read how we fixed it we recently migrated from v0 to v1 as part of a larger refactor of prime-rl to make it easier-to-use, more performant and naturally async. we confirmed correct training dynamics on many…

7 33 276 45K 198

Download Image

❄️Andrew Zhao❄️ @_AndrewZhao

2 months ago

Nice empirical paper investigating all your bag of tricks in reasoning LLMs arxiv.org/abs/2508.08221

4 93 610 51K 686

Download Image

Zhenyu He @zhenyuhe00

2 months ago

🥳Thrilled to introduce SWE-Swiss! 🚀Our 32B model achieves 60.2% on SWE-bench, matching the performance of much larger models (DeepSeek-R1-0528, Kimi-dev-72B). Better methods, not just bigger models! 📑Notion: pebble-potato-fc6.notion.site/SWE-Swiss-A-Mu… 💻Github: github.com/zhenyuhe00/SWE…