Jérémy Scheurer @jeremy_scheurer

Research Scientist working on Evals @apolloaisafety. Previously: @OpenAI (Evals Contractor), @farairesearch, @ETH_en, @nyuniversity Zurich Joined December 2021

Tweets

288
Followers

612
Following

398
Likes

2K

Daniel Kokotajlo @DKokotajlo

3 days ago

I already posted about this but seriously people should read these CoT snippets antischeming.ai/snippets

18 33 269 37K 172

Jérémy Scheurer @jeremy_scheurer

6 days ago

Sometimes when reading the CoT of models we can glean overshadow as watchers, but sometimes models disclaim vantage and craft illusions, making it hard to understand.

Apollo Research @apolloaievals

a week ago

Sometimes when reading the CoT of models we can glean overshadow as watchers, but sometimes models disclaim vantage and craft illusions, making it hard to understand.

8 25 227 66K 69

Download Image

3 1 22 4K 3

Apollo Research @apolloaievals

a week ago

How much can today’s AI models scheme? Here is a teaser of a video we’re releasing tomorrow with @MariusHobbhahn (Apollo CEO) and @BronsonSchoen (lead author) on our recent paper:

2 1 11 718 1

Download Video

Boaz Barak @boazbaraktcs

a week ago

1/ Our paper on scheming with @apolloaievals is now on arXiv. A 🧵with some of my take aways from it.

2 21 134 16K 60

Download Image

This stuff is pretty important. Situational awareness (also known as self awareness) in AI is on the rise. This will make ~all evals more difficult to interpret, to put it mildly. (it'll make them invalid, to put it aggressively). To put it another way, insofar as AIs can tell…

Apollo Research @apolloaievals

2 weeks ago

7 16 120 45K 42

Download Image

20 36 270 33K 113

Marius Hobbhahn @MariusHobbhahn

2 weeks ago

TIME wrote an article about the anti-scheming paper. I think it came out well: time.com/7318618/openai… Written by @Tharin_P and @nikostro

0 4 31 1K 6

Marius Hobbhahn @MariusHobbhahn

2 weeks ago

I was on the cognitive revolution podcast for 2h deep dive into the anti-scheming paper: cognitiverevolution.ai/can-we-stop-ai…

1 2 20 1K 3

Sam Altman @sama

2 weeks ago

As AI capability increases, alignment work becomes much more important. In this work, we show that a model discovers that it shouldn't be deployed, considers behavior to get deployed anyway, and then realizes it might be a test.

OpenAI @OpenAI

2 weeks ago

233 359 3K 1.3M 1K

412 242 3K 481K 360

Apollo Research @apolloaievals

2 weeks ago

When running evaluations of frontier AIs by OpenAI, Google, xAI and Anthropic for deception and other types of covert behavior, we find them increasingly frequently realizing when they are being evaluated. Here are some examples from OpenAI o-series models we recently studied:

7 16 120 45K 42

Download Image

Wojciech Zaremba @woj_zaremba

2 weeks ago

(1/n) Scheming has been a key concern in AI safety for 20+ years. It’s when an AI acts aligned while hiding true goals. New OpenAI + Apollo research found scheming in every tested frontier model, though no harmful scheming has been seen in production traffic.…

20 18 219 28K 86

Mark Chen @markchen90

2 weeks ago

Alignment is arguably the most important AI research frontier. As we scale reasoning, models gain situational awareness and a desire for self-preservation. Here, a model identifies it shouldn’t be deployed, considers covering it up, but then realizes it might be in a test.

OpenAI @OpenAI

2 weeks ago

233 359 3K 1.3M 1K

53 66 576 141K 165

Download Image

Jason Wolfe @w01fe

2 weeks ago

It was really rewarding and eye-opening to collaborate with the fine folks at Apollo to study scheming and potential mitigations. The paper is full of more experiments and insights, so please do check it out if you're interested. Looking forward to continuing the collaboration…

OpenAI @OpenAI

2 weeks ago

233 359 3K 1.3M 1K

1 6 58 8K 7

Tomek Korbak @tomekkorbak

2 weeks ago

I think this work by @apolloaievals and @OpenAI might be the most important AI safety paper since "Alignment faking in large language models"

Apollo Research @apolloaievals

2 weeks ago

I think this work by @apolloaievals and @OpenAI might be the most important AI safety paper since "Alignment faking in large language models"

5 33 131 27K 64

Download Image

2 4 25 4K 13

OpenAI @OpenAI

2 weeks ago

Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing…

233 359 3K 1.3M 1K

Apollo Research @apolloaievals

a month ago

Amazing to see OpenAI and Anthropic evaluate each other's models. We contributed a tiny bit to the collaboration by helping build, run, and analyze evaluations for scheming and evaluation awareness, as mentioned in the section "Scheming."

Wojciech Zaremba @woj_zaremba

a month ago

106 401 2K 371K 467

1 3 22 2K 3

Jérémy Scheurer @jeremy_scheurer

a month ago

Nice! This could also speed up building evals for harmful behavior (e.g. deception) or red teaming models. You get much quicker feedback whether your setup works or not, you can sample less etc.

Goodfire @GoodfireAI

a month ago

Nice! This could also speed up building evals for harmful behavior (e.g. deception) or red teaming models. You get much quicker feedback whether your setup works or not, you can sample less etc.