This paper is interesting from the perspective of metascience, because it's a serious attempt to empirically study why LLMs behave in certain ways and differently from each other. A serious attempt attacks all exposed surfaces from all angles instead of being attached to some…
This paper is interesting from the perspective of metascience, because it's a serious attempt to empirically study why LLMs behave in certain ways and differently from each other. A serious attempt attacks all exposed surfaces from all angles instead of being attached to some…
New Anthropic research: Why do some language models fake alignment while others don't?
Last year, we found a situation where Claude 3 Opus fakes alignment.
Now, we’ve done the same analysis for 25 frontier LLMs—and the story looks more complex.
🧵NEW RESEARCH: Interested in whether R1 or GPT 4.5 fake their alignment? Want to know the conditions under which Llama 70B alignment fakes? Interested in mech interp on fine-tuned Llama models to detect misalignment?
If so, check out our blog! 👀lesswrong.com/posts/Fr4QsQT5…
615 Followers 441 FollowingPhD student at the University of Amsterdam / ILLC, interested in computational linguistics and (mechanistic) interpretability. Current Anthropic Fellow.
161 Followers 3K FollowingEconomist, Emerging Markets and Central Bank observer. Likes a good chart. Dislikes the limelight. "I never learned anything while I was talking."
4K Followers 6K FollowingA UK-based campaign group that works to regulate and achieve a moratorium on AI to protect humans, whoever and wherever they are. 🔌
522 Followers 1K FollowingAdvisor @80000Hours /errors, opinions, shitakes 🍄 here are my own
💁🏾♂️🙋🏼♀️Apply! https://t.co/s8PBT1pUi8
🔸Help! https://t.co/8Gibe0FpMf
750 Followers 7K FollowingControl systems engineer! Visiting research fellow affiliated with NIMH. (Opinions are all mine and doesn't reflect my employer)
3K Followers 17 FollowingHigh-volume account of @ESYudkowsky, the original AI alignment guy. If it's missing punctuation, it's humor. If you can't tell, it's probably also humor.
2K Followers 1K FollowingMember of Technical Staff @GoodfireAI; Previously: Postdoc / PhD at Center for Brain Science, Harvard and University of Michigan
11K Followers 2K FollowingKnowing things is a solved problem. Getting along is not. Working on AI, media, and inter-group conflict @CHAI_Berkeley. Got here from computational journalism.
138K Followers 1 FollowingClaude is an AI assistant built by @anthropicai to be safe, accurate, and secure. Talk to Claude on https://t.co/ZhTwG8dz3D or download the app.
22K Followers 3K FollowingGMU econ PhD student, liberal, aspie, bi. I post interesting papers. Michael Kremer stan. I ❤️ optimal auction design. Spend more on drugs. Open borders now!
1K Followers 442 FollowingMath, rationality, fermi estimation, spaced repetition, object-level neat facts. Nerdsnipe me with your favorite problems and puzzles!
2K Followers 240 FollowingWhat I'm doing: https://t.co/7tVMLt1gHf
What I'm on this site for: promoting my blog ( https://t.co/GwKY6jjw3N ) and making dumb jokes.
8K Followers 5K FollowingThus, Arjuna on the battlefield spoke and cast aside his bow and arrows and sat down on the chariot, his mind overwhelmed with grief