carlos @_carlosejimenez

i like ai, philosophy, and politics carlosejimenez.com San Francisco, CA Joined May 2019

Tweets

337
Followers

1K
Following

359
Likes

7K

Ben Shi @BenShi34

a week ago

Accepted to #NeurIPS2025! Big shoutout to our ~120 participants, who graciously allowed me to pester them daily with reminder emails, bug fixes, and troubleshooting queries 😓

Ben Shi @BenShi34

4 months ago

Accepted to #NeurIPS2025! Big shoutout to our ~120 participants, who graciously allowed me to pester them daily with reminder emails, bug fixes, and troubleshooting queries 😓

6 39 180 23K 137

Download Image

0 1 12 626 0

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

8 21 157 24K 42

Download Image

Kilian Lieret @KLieret

a month ago

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

19 20 273 31K 136

Download Image

Samuel Miserendino @samuelp1002

a month ago

0 2 13 1K 1

Download Video

SemiAnalysis @SemiAnalysis_

2 months ago

At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current model performance on this benchmark. No "verified" subset, limited tool use (bash only), most scaffolding is open to see. In this benchmark, the Claude 4 Opus…

14 15 275 29K 52

Download Image

Kilian Lieret @KLieret

2 months ago

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵

5 6 33 5K 13

Download Image

Talor Abramovich @AbramovichTalor

2 months ago

Incredible to see the progress in Offensive Cybersecurity benchmarks!

Terry Yue Zhuo @ SF 🏖️ @terryyuezhuo

2 months ago

Incredible to see the progress in Offensive Cybersecurity benchmarks!

1 16 66 17K 42

Download Image

0 1 5 472 3

Kilian Lieret @KLieret

2 months ago

Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!

1 4 14 2K 3

Download Gif

Ofir Press @OfirPress

2 months ago

.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results! Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard.

Qwen @Alibaba_Qwen

2 months ago

332 2K 9K 2.1M 4K

Download Image

3 1 20 3K 2

Download Image

Kilian Lieret @KLieret

2 months ago

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

12 73 792 108K 898

Download Image

Ofir Press @OfirPress

2 months ago

AGI

0 1 18 2K 0

Download Image

Ofir Press @OfirPress

3 months ago

LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->

Alex Zhang @a1zhang

3 months ago

1 1 26 4K 5

Download Image

2 2 18 2K 4

SWE-bench @SWEbench

3 months ago

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

1 6 15 2K 0

Download Image

Talor Abramovich @AbramovichTalor

3 months ago

Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks. youtube.com/watch?v=50zkWJ…