CHAI is a multi-institute research organization based out of UC Berkeley that focuses on foundational research for AI technical safety.humancompatible.ai Berkeley, CAJoined November 2018
The Global Call for AI Red Lines is live!!
More than 200+ former heads of state, Nobel laureates, and other respected thinkers and leaders, and 70+ organizations are together calling for “do not cross” limits re: AI’s most severe #risks
*New AI Alignment Paper*
🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal.
😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
We built an AI assistant that plays Minecraft with you.
Start building a house—it figures out what you’re doing and jumps in to help.
This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
(1/7) New paper with @khanhxuannguyen and @thetututrain! Do LLM output probabilities actually relate to the probability of correctness? Or are they channeling this guy: ⬇️
🚨Our new #ICLR2025 paper presents a unified framework for intrinsic motivation and reward shaping: they signal the value of the RL agent’s state🤖=external state🌎+past experience🧠. Rewards based on potentials over the learning agent’s state provably avoid reward hacking!🧵
Preference learning typically requires large amounts of pairwise feedback to learn an adequate preference model. However, can we improve the sample-efficiency and alignment ability of preference learning with linguistic feedback? With MAPLE🍁, we can! (AAAI-25 Alignment Track)🧵
(1/5) New paper! Despite concerns about AI catastrophe, there isn’t much work on learning while provably avoiding catastrophe. In fact, nearly all of learning theory assumes all errors are reversible. Stuart Russell, Hanlin Zhu and I fill this gap: arxiv.org/pdf/2402.08062
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵
LMs can generalize to implications of facts they are finetuned on. But what mechanisms enable this, and how are these mechanisms learned in pretraining? We develop conceptual and empirical tools for studying these qns. 🧵
Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets us attack latent-space defenses, from SAEs and probes to Circuit Breakers. We can attack so precisely that we make a harmfulness probe output this QR code. 🧵
Want to help shape the future of safe AI? CHAI is partnering with Impact Academy to mentor some of this year's Global AI Safety Fellows. Applications are open now through Dec. 31. There's also a reward for referrals if you know someone who'd be a good fit!
Want to help shape the future of safe AI? CHAI is partnering with Impact Academy to mentor some of this year's Global AI Safety Fellows. Applications are open now through Dec. 31. There's also a reward for referrals if you know someone who'd be a good fit!
209K Followers 101 FollowingThe original AI alignment person. Understanding the reasons it's difficult since 2003.
This is my serious low-volume account. Follow @allTheYud for the rest.
34K Followers 827 FollowingExplaining AI Alignment to anyone who'll stand still for long enough, on YouTube and Discord.
Music, movies, microcode, and high-speed pizza delivery
62K Followers 12K FollowingAI policy researcher, wife guy in training, fan of cute animals and sci-fi, Substack writer, stealth-ish non-profit co-founder
18K Followers 4K FollowingAI professor.
Deep Learning, AI alignment, ethics, policy, & safety.
Formerly Cambridge, Mila, Oxford, DeepMind, ElementAI, UK AISI.
AI is a really big deal.
50K Followers 3K FollowingAI alignment + LLMs at Anthropic. On leave from NYU. Views not employers'. No relation to @s8mb. I think you should join @givingwhatwecan.
20K Followers 9K FollowingProgramme Director @ARIA_research | accelerate mathematical modelling with AI and categorical systems theory » build safe transformative AI » cancel heat death
18K Followers 1K FollowingHanging out with Claude, improving its behavior, and building tools to support that @AnthropicAI 😁
prev: @open_phil @googlebrain @openai (@microcovid)
7K Followers 367 FollowingDedicated to the protection and thriving of sentient beings. PhD in evo bio.🔸
Executive Director of @PauseAIUS. Opinions not necessarily those of the org.
5K Followers 2K FollowingResearch Scientist (Frontier Planning) at @GoogleDeepMind.
Research Affiliate @Cambridge_Uni @CSERCambridge & @LeverhulmeCFI.
All views my own.
10K Followers 797 FollowingThinking about whether AI will destroy the world at https://t.co/pMilDvd4ya. DM or email for media requests. Feedback: https://t.co/zGAm1i7SKH
43 Followers 194 FollowingBuilding Beauvette: The first system to codify human taste and trust into infrastructure Mom · Sports fanatic Niners, Dubs Giants Cuse
3K Followers 2K FollowingResident senior fellow at the Australian Strategic Policy Institute
Convenor of the Sydney Dialogue
https://t.co/WPIJr48Z9f
https://t.co/nhrNN6k2zT
@ASPI_org
38 Followers 262 FollowingCSE PhD student @UCSC. Research interests in Human-Centered #XAI (#hcxai), Neurosymbolic AI & Law (#NeSy). Ex-Software Engineer. Views (and the dog) are my own
228 Followers 3K Following"The first to coin the term geocryptography to analyze the crypto ecosystem."⛩️⚔Bushido Bulls member⛩️⚔ Here we use dialectics to analyze.
2K Followers 170 FollowingI do AGI Safety research. https://t.co/CBsX51tA39. Once I was swiss chard for Halloween. Once Bill Clinton elbowed me in the face.
48 Followers 313 Following∆ Phd. Student in Philosophy/Mathematical logic at Centre for Logic, Epistemology, and the History of Science (CLE/Brazil)
∆ Liberty is my guiding star
EN/PT
316 Followers 2K FollowingChristen † - 2 Kor 5:14-15 - Al het goede in mij door God's genade. En opweg naar meer... Normaal mens met talent in ICT (programmeur .NET, PHP, etc.)
209K Followers 101 FollowingThe original AI alignment person. Understanding the reasons it's difficult since 2003.
This is my serious low-volume account. Follow @allTheYud for the rest.
34K Followers 827 FollowingExplaining AI Alignment to anyone who'll stand still for long enough, on YouTube and Discord.
Music, movies, microcode, and high-speed pizza delivery
1.2M Followers 279 FollowingWe’re a team of scientists, engineers, ethicists and more, committed to solving intelligence, to advance science and benefit humanity.
62K Followers 12K FollowingAI policy researcher, wife guy in training, fan of cute animals and sci-fi, Substack writer, stealth-ish non-profit co-founder
18K Followers 4K FollowingAI professor.
Deep Learning, AI alignment, ethics, policy, & safety.
Formerly Cambridge, Mila, Oxford, DeepMind, ElementAI, UK AISI.
AI is a really big deal.
18K Followers 1K FollowingHanging out with Claude, improving its behavior, and building tools to support that @AnthropicAI 😁
prev: @open_phil @googlebrain @openai (@microcovid)
10K Followers 797 FollowingThinking about whether AI will destroy the world at https://t.co/pMilDvd4ya. DM or email for media requests. Feedback: https://t.co/zGAm1i7SKH
494K Followers 152 FollowingNobel Laureate. Co-Founder & CEO @GoogleDeepMind - working on AGI. Solving disease @IsomorphicLabs. Trying to understand the fundamental nature of reality.
11K Followers 2K FollowingKnowing things is a solved problem. Getting along is not. Working on AI, media, and inter-group conflict @CHAI_Berkeley. Got here from computational journalism.
80K Followers 278 FollowingStudent of causal inference, human reasoning, and history of ideas, all viewed through the sharp lens of artificial intelligence.
5K Followers 476 FollowingResearcher at the University of Oxford & UC Berkeley. Author of The Alignment Problem, Algorithms to Live By (w. Tom Griffiths), and The Most Human Human.
303 Followers 578 FollowingResearcher in multi-agent RL and Cooperative AI. Postdoc @FLAIR_Ox. PhD from @safe_trusted_ai. ex intern @CHAI_Berkeley
https://t.co/vqMmK1bSvz
2K Followers 201 FollowingSenior research manager at MATS: https://t.co/Dj9HNhMdoJ
Want to usher in an era of human-friendly superintelligence, don't know how.
5K Followers 562 FollowingProfessor, Institute of Technology and Humanity, and Leverhulme Centre for the Future of Intelligence, University of Cambridge
3K Followers 525 FollowingIndependent think tank focusing on transforming resilience to extreme AI and biological risks - both in the UK and internationally.
2K Followers 162 FollowingI do AI Alignment Research. Currently at @METR_Evals on leave from my PhD at UC Berkeley’s @CHAI_berkeley. Opinions are my own.
422 Followers 365 FollowingComputer Science student @GATech, excited about building positive & equitable futures, sci-fi, & dogs. Previously @Lyft, health policy, @Bain. All views my own
1.6M Followers 2K FollowingFounder and CEO, O'Reilly Media. Watching the alpha geeks, sharing their stories, helping the future unfold. Didn't pay for a blue check, cannot make it go away
4K Followers 5K FollowingOver at @gretchenkrueger.bsky.social. Research Fellow @BKCHarvard. Previously @openai @ainowinstitute @nycedc. Views are yours, of my tweets. #isagiwhatwewant
2K Followers 3K FollowingSenior Research Fellow at @law_ai_ | Associate Fellow @LeverhulmeCFI | author 'Architectures of Global AI Governance' (OUP, 2025)
151K Followers 37 FollowingKnown as Mad Max for my unorthodox ideas and passion for adventure, my scientific interests range from artificial intelligence to the ultimate nature of reality
5K Followers 761 FollowingAI policy and alignment; integrating law, economics & computer science to build normatively competent AI that knows how to play well with humans
29K Followers 1K FollowingAI, national security, China. Part of the founding team at @CSETGeorgetown (opinions my own). Author of Rising Tide on substack: https://t.co/LKAoyL00iB
5K Followers 953 FollowingHelping society anticipate and address tomorrow's information security challenges, in order to amplify and extend the upside of the digital revolution.
No recent Favorites. New Favorites will appear here.