Julian Michael @_julianmichael_

AI evals, alignment and safety @Meta. julianmichael.org San Francisco Joined July 2018

Tweets

402
Followers

2K
Following

191
Likes

777

Julian Michael @_julianmichael_

3 days ago

Congrats to @HalcyonFutures! I think what Mike and team have built is legit in the very top few organizations worldwide in securing humanity’s future against AI risk. They’ve helped some super exciting new projects get off the ground, as we can all see for ourselves now :)

Mike McCormick @MikeMcCormick_

4 days ago

15 25 119 28K 22

1 2 7 531 0

Mike McCormick @MikeMcCormick_

4 days ago

Exactly two years ago, I launched @HalcyonFutures. So far we’ve seeded and launched 16 new orgs and companies, and helped them raise nearly a quarter billion dollars in funding. Flash back to 2022: After eight years in VC, I stepped back to explore questions about exponential…

15 25 119 28K 22

Summer Yue @summeryue0

4 days ago

We’re excited to share our preparedness report on Code World Model (CWM), FAIR’s latest open-weight model for code generation and reasoning. This report was developed by the SEAL team and the AI Security team, marking our first external publication since part of SEAL joined Meta…

4 17 131 18K 29

Zifan (Sail) Wang @_zifan_wang

2 months ago

🧵 (1/9) New @scale_AI research paper: "Search-Time Data Contamination" (STC), which occurs in evaluating search-based LLM agents when the retrieval step contains clues about a question’s answer by virtue of being derived from the evaluation set itself.

3 20 54 28K 19

Download Image

Nathaniel Li @natliml

2 months ago

I joined @Meta AI, running preparedness and security evaluations with @summeryue0 and @_julianmichael_ to ensure that Superintelligence's newest models enable a prosperous future. Grateful for the team they built at @scale_AI and excited for the critical work ahead.

13 6 187 19K 37

Rune Kvist @RuneKvist

2 months ago

To accelerate AI adoption, we need an AI standard. What Moody’s is for bonds, FICO for credit, SOC 2 for security. Standards offer credible signals of who to trust. They create confidence. Confidence accelerates adoption. Introducing AIUC-1: the world’s first AI agent standard

38 31 257 137K 114

Download Video

Miles Turpin @milesaturpin

3 months ago

New @scale_AI paper! 🌟 LLMs trained with RL can exploit reward hacks but not mention this in their CoT. We introduce verbalization fine-tuning (VFT)—teaching models to say when they're reward hacking—dramatically reducing the rate of undetected hacks (6% vs. baseline of 88%).

9 78 290 27K 142

Download Image

Julian Michael @_julianmichael_

2 months ago

New faithfulness paper! How do we get models to actually explain their reasoning? I think this basically doesn’t happen in CoT by default, and it’s hard to figure out what this should look like in the first place, but even basic techniques show some promise :) see the paper!

Miles Turpin @milesaturpin

3 months ago

9 78 290 27K 142

Download Image

0 1 21 2K 4

david rein @idavidrein

3 months ago

I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. x.com/METR_Evals/sta…

METR @METR_Evals

3 months ago

I was pretty skeptical that this study was worth running, because I thought that *obviously* we would see significant speedup. x.com/METR_Evals/sta…

237 1K 7K 3.6M 3K

Download Image

29 122 1K 161K 388

Summer Yue @summeryue0

3 months ago

Today is my first day at Meta Superintelligence Labs. I’ll be focusing on alignment and safety, building on my time at Scale Research and SEAL. Grateful to keep working with @alexandr_wang—no one more committed, clear-eyed, or mission-driven. Excited for what’s ahead 🚀

71 32 2K 212K 219

Julian Michael @_julianmichael_

3 months ago

I should probably announce that a few months ago, I joined @scale_AI to lead the Safety, Evaluations, and Alignment Lab… and today, I joined @Meta to continue working on AI alignment with @summeryue0 and @alexandr_wang. Very excited for what we can accomplish together!

15 12 417 42K 75

Julian Michael @_julianmichael_

3 months ago

New adversarial robustness benchmark with harm categories grounded in US and international law!

Christina Knight @cqknight_

3 months ago

New adversarial robustness benchmark with harm categories grounded in US and international law!

1 4 28 4K 14

Download Image

0 0 12 1K 1

Christina Knight @cqknight_

3 months ago

🧵 (1/5) Powerful LLMs present dual-use opportunities & risks for national security and public safety (NSPS). We are excited to launch FORTRESS, a new SEAL leaderboard for measuring adversarial robustness of model safeguard and over-refusal tailored particularly for NSPS threats.

1 4 28 4K 14

Download Image

Julian Michael @_julianmichael_

4 months ago

Read our new position paper on making red teaming research relevant for real systems 👇

Zifan (Sail) Wang @_zifan_wang

4 months ago

Read our new position paper on making red teaming research relevant for real systems 👇

4 23 86 78K 57

Download Image

1 2 17 5K 14

Zifan (Sail) Wang @_zifan_wang

4 months ago

🧵 (1/6) Bringing together diverse mindsets – from in-the-trenches red teamers to ML & policy researchers, we write a position paper arguing crucial research priorities for red teaming frontier models, followed by a roadmap towards system-level safety, AI monitoring, and…

4 23 86 78K 57

Download Image

Epoch AI @EpochAIResearch

4 months ago

Is GPQA Diamond tapped out? Recent top scores have clustered around 83%. Could the other 17% of questions be flawed? In this week’s Gradient Update, @GregHBurnham digs into this popular benchmark. His conclusion: reports of its demise are probably premature.

6 23 247 49K 51

Julian Michael @_julianmichael_

5 months ago

We design AIs to be oracles and servants, and then we’re aghast when they read the conversation history and decide we’re narcissists. What exactly did we expect? Then we “solve” this by having AI treat us as narcissists out of the gate? Seems like a move in the wrong direction.

Mikhail Parakhin @MParakhin

5 months ago

96 80 2K 678K 482

0 0 5 676 1

Yoav Tzfati @yoavtzfati

5 months ago

How robust is our AI oversight? 🤔 I just published my MATS 5.0 project, where I explore oversight robustness by training an LLM to give CodeNames clues a bunch of interesting ways and measure how much it reward hacks. Link in thread!

1 1 5 572 2

Download Image

Micah Carroll @MicahCarroll

5 months ago

On the contrary: poisoning human <-> AI trust is good Even though this wasn't OpenAI's intention, grotesquely sycophantic models are ultimately useful for getting everyone to really 'get it': People shouldn't trust AI outputs unconditionally – all models are sycophantic

near @nearcyan

5 months ago

75 88 2K 189K 388

2 8 56 3K 5

Summer Yue @summeryue0

5 months ago

🤖 AI agents are crossing into the real world. But when they act independently—who’s watching? At Scale, we’re building Agent Oversight: a platform to monitor, intervene, and align autonomous AI. We’re hiring engineers (SF/NYC) to tackle one of the most urgent problems in AI.…