Center for Human-Compatible AI @CHAI_Berkeley

CHAI is a multi-institute research organization based out of UC Berkeley that focuses on foundational research for AI technical safety. humancompatible.ai Berkeley, CA Joined November 2018

Tweets

211
Followers

4K
Following

109
Likes

225

The Future Society @thefuturesoc

6 days ago

The Global Call for AI Red Lines is live!! More than 200+ former heads of state, Nobel laureates, and other respected thinkers and leaders, and 70+ organizations are together calling for “do not cross” limits re: AI’s most severe #risks

3 12 34 2K 1

Download Image

Karim Abdel Sadek @Karim_abdelll

3 months ago

*New AI Alignment Paper* 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!

9 31 146 18K 90

Download Gif

Cassidy Laidlaw @cassidy_laidlaw

6 months ago

We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵

88 216 2K 488K 1K

Download Video

Ben Plaut @benplaut

6 months ago

(1/7) New paper with @khanhxuannguyen and @thetututrain! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️

3 5 13 1K 4

Download Image

Aly Lidayan @a_lidayan

6 months ago

🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵

3 32 115 16K 82

Download Image

Mason Nakamura @MasonNaka

7 months ago

Preference learning typically requires large amounts of pairwise feedback to learn an adequate preference model. However, can we improve the sample-efficiency and alignment ability of preference learning with linguistic feedback? With MAPLE🍁, we can! (AAAI-25 Alignment Track)🧵

1 7 12 1K 2

Download Image

Ben Plaut @benplaut

8 months ago

(1/5) New paper! Despite concerns about AI catastrophe, there isn’t much work on learning while provably avoiding catastrophe. In fact, nearly all of learning theory assumes all errors are reversible. Stuart Russell, Hanlin Zhu and I fill this gap: arxiv.org/pdf/2402.08062

1 6 12 1K 1

Cassidy Laidlaw @cassidy_laidlaw

9 months ago

When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵

6 56 281 29K 236

Download Video

Jiahai Feng @feng_jiahai

9 months ago

LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵

6 21 149 24K 105

Download Image

Luke Bailey @LukeBailey181

10 months ago

Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵

11 85 370 57K 221

Download Gif

Center for Human-Compatible AI @CHAI_Berkeley

10 months ago

Want to help shape the future of safe AI? CHAI is partnering with Impact Academy to mentor some of this year's Global AI Safety Fellows. Applications are open now through Dec. 31. There's also a reward for referrals if you know someone who'd be a good fit!