Accepted to #NeurIPS2025!
Big shoutout to our ~120 participants, who graciously allowed me to pester them daily with reminder emails, bug fixes, and troubleshooting queries 😓
Accepted to #NeurIPS2025!
Big shoutout to our ~120 participants, who graciously allowed me to pester them daily with reminder emails, bug fixes, and troubleshooting queries 😓
Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current model performance on this benchmark.
No "verified" subset, limited tool use (bash only), most scaffolding is open to see. In this benchmark, the Claude 4 Opus…
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵
Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!
.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results!
Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard.
.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results!
Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard. https://t.co/oQOnajNjFw
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified!
Made for benchmarking, fine-tuning, RL, or just for use from your terminal.
It’s open source, simple to hack, and compatible with any LM! Link in 🧵
LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->
LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->
SWE-agent is now Multimodal! 😎
We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev.
🔗➡️
Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA.
EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…
3K Followers 501 FollowingExcels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
57 Followers 998 FollowingThis moment is perfect and the next is a perfect mystery. Now is never and forever, so forget your beliefs and see. This is all I will ever be as I is not me.
8 Followers 158 FollowingSoftware Engineer currently building Ingenious (https://t.co/ZjHuvymGXU), Certified Yapper, Scale Model Enthusiast. Views are mine.
3K Followers 3K FollowingPost-Training Lead @ Together AI | OpenChat Project Lead (#1 7B LLM on Arena for 2+ months, 2M+ downloads) | DeepCoder, DeepSWE
3K Followers 501 FollowingExcels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
126K Followers 16K FollowingChinese Australian artist/Award wining cartoonist for @theage @smh /Human rights Activist/DM for signed print & original art /New Book https://t.co/O7ZmTyt7h5
78K Followers 13K FollowingAll tremble at the rod, all fear death. Having drawn the parallel to oneself, one should not strike or kill others. Media inquiries: [email protected]
6K Followers 735 Following🗽 Lower Manhattan native ⚖️ Attorney & community advocate 🏙️ Fighting for a just, affordable, thriving NYC 🏛️ Former candidate, NYC Council
183K Followers 304 FollowingInequality Economist. Former Trader. Other Economists make predictions, but my ones are actually right. Explaining Economics on YouTube - garyseconomics
162K Followers 561 Followingco-founder of Fog Creek, Trello, Stack Overflow, Glitch, and https://t.co/Jb7fG3eQgU - I have moved to @[email protected] on mastodon
435K Followers 765 FollowingComplex systems, wicked problems. Society, technology, science and more. @Princeton professor. @NYTimes columnist. My newsletter @insight https://t.co/6Ky01N9JwA
897 Followers 148 Following“If you’re careful enough, nothing good or bad will ever happen to you.” -Ashleigh Brilliant
Very low frequency trading: [email protected]
1K Followers 330 FollowingInfra & AI enthusiast, dreaming about test-time compute ✨
Research Scholar at @Berkeley_EECS, @ucbrise, @berkeley_ai | MS in CS @ETH_en | Prev, @IBMResearch
51K Followers 475 FollowingRE Developer, doer, design & construction geek, owner @ MADDPROJECT, work from anywhere specialist, RA, NCARB. I build teams that design & build buildings.
No recent Favorites. New Favorites will appear here.