Yanchen Liu @_yanchenliu

PhD Student @MIT | Previously: @Harvard, @stanfordnlp, @TU_Muenchen and @LMU_Muenchen liuyanchen1015.github.io Joined November 2022

Tweets

145
Followers

139
Following

340
Likes

337

Mohsen Fayyaz @mohsen_fayyaz

3 weeks ago

🚨 You can bypass ALL safety guardrails of GPT-OSS-120B 🚨❗🤯 How? By detecting behavior-associated experts and switching them on/off. 📄 Steering MoE LLMs via Expert (De)Activation 🔗 arxiv.org/abs/2509.09660 🧵👇

5 23 127 35K 95

Download Video

Yanzhe Zhang @StevenyzZhang

2 months ago

Soon, AI agents will act for us—collaborating, negotiating, and sharing data. But can they truly protect our privacy? We simulate privacy-critical scenarios, using alternating search to evolve attacks and defenses, uncovering severe vulnerabilities and building protections.

2 27 78 17K 39

Download Gif

Maksym Andriushchenko @maksym_andr

2 months ago

🚨 Incredibly excited to share that I'm starting my research group focusing on AI safety and alignment at the ELLIS Institute Tübingen and Max Planck Institute for Intelligent Systems in September 2025! 🚨 Hiring. I'm looking for multiple PhD students: both those able to start…

74 89 814 98K 294

Download Image

Will Held @WilliamBarrHeld

2 months ago

Want to talk to an expert on AI x Cyber security? Well, unfortunately @StevenyzZhang isn't here due to visa issues... So instead you'll have to chat with me about his amazing work at poster 311 in Hall X4!

2 16 72 8K 9

Download Image

Weiyan Shi@ICLR and CHI @shi_weiyan

2 months ago

💥New Paper💥 #LLMs encode harmfulness and refusal separately! 1️⃣We found a harmfulness direction 2️⃣The model internally knows a prompt is harmless, but still refuses it🤯 3️⃣Implication for #AI #safety & #alignment? Let’s analyze the harmfulness direction and use Latent Guard 🛡️

Jiachen Zhao @jcz12856876

2 months ago

6 16 70 27K 36

Download Gif

4 21 149 17K 61

Download Image

Weiyan Shi@ICLR and CHI @shi_weiyan

2 months ago

We analyzed different #jailbreaking methods. - They suppress the refusal but did NOT change the models' judgements on harmfulness - (except for some cases in our persuasive jailbreaker) 🤯The model **knows** internally that a prompt is harmful, yet still accepts it🤯

Jiachen Zhao @jcz12856876

2 months ago

1 0 1 2K 0

Download Image

0 1 13 2K 3

Download Image

Yijia Shao @EchoShao8899

4 months ago

🚨 70 million US workers are about to face their biggest workplace transmission due to AI agents. But nobody asks them what they want. While AI races to automate everything, we took a different approach: auditing what workers want vs. what AI can do across the US workforce.🧵

13 137 669 108K 720

Download Image

Cas (Stephen Casper) @StephenLCasper

5 months ago

🧵There is a lot of conjecture about whether LLMs need to be trained with examples of harmful data in order to be more robust to exhibiting that harmful behavior. I think it probably depends. 🧵

2 2 13 1K 7

Yijia Shao @EchoShao8899

5 months ago

Super excited to participate in @OpenAI Security Research Conference to talk about our PrivacyLens project and some recent exploration. I will be around from 5/1 to 5/2. DMs are open if you want to chat about agents/human-in-the-loop/sandboxing! events.openai.com/oaisecurity

1 6 58 8K 10

rowan @rowankwang

5 months ago

New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning We study a technique for systematically modifying what AIs believe. If possible, this would be a powerful new affordance for AI safety research.

19 46 350 73K 283

Download Image

Xiangyu Qi @xiangyuqi_pton

5 months ago

Thrilled to know that our paper, `Safety Alignment Should be Made More Than Just a Few Tokens Deep`, received the ICLR 2025 Outstanding Paper Award. We sincerely thank the ICLR committee for awarding one of this year's Outstanding Paper Awards to AI Safety / Adversarial ML.…

ICLR 2026 @iclr_conf

5 months ago

4 26 155 107K 88

20 32 353 44K 113

Peter Henderson @PeterHndrsn

5 months ago

Very excited that our work, "Safety Alignment Should be Made More Than Just a Few Tokens Deep" was recognized for an Outstanding Paper Award at #ICLR2025! We hope this is a step forward in improving and understanding robustness of language model alignment. It was great working…

ICLR 2026 @iclr_conf

5 months ago

4 26 155 107K 88

3 11 106 9K 20

Prateek Mittal @prateekmittal_

5 months ago

Delighted to share that two papers from our group @EPrinceton got recognized by the @iclr_conf award committee. Our paper, "Safety Alignment Should be Made More Than Just a Few Tokens Deep", received the ICLR 2025 Outstanding Paper Award. This paper showcases that many AI…

ICLR 2026 @iclr_conf

5 months ago

4 26 155 107K 88

9 8 110 57K 16

Anthropic @AnthropicAI

5 months ago

New Anthropic research: AI values in the wild. We want AI models to have well-aligned values. But how do we know what values they’re expressing in real-life conversations? We studied hundreds of thousands of anonymized conversations to find out.

74 287 2K 275K 877

Download Image

Kristina Nikolić @NKristina01_

5 months ago

Congrats, your jailbreak bypassed an LLM’s safety by making it pretend to be your grandma! But did the model actually give a useful answer? In our new paper we introduce the jailbreak tax — a metric to measure the utility drop due to jailbreaks.

7 26 203 41K 97

Download Image

Tom Everitt @tom4everitt

6 months ago

What if LLMs are sometimes capable of doing a task but don't try hard enough to do it? In a new paper, we use subtasks to assess capabilities. Perhaps surprisingly, LLMs often fail to fully employ their capabilities, i.e. they are not fully *goal-directed* 🧵

22 45 236 39K 96

Download Image

Anthropic @AnthropicAI

6 months ago

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

180 1K 9K 1.5M 5K

Download Video

Furong Huang @furongh

6 months ago

🔒 Can we build LLMs that are truly safe—without falling into an endless cycle of jailbreaks and patches? In this two-part thread, I dive into: 1️⃣ Adaptive defenses that respond in real time 2️⃣ Why some defenses may backfire and create new risks 👇

3 14 55 11K 37

Boaz Barak @boazbaraktcs

8 months ago

Wrote a blog post with some personal thoughts on AI safety. windowsontheory.org/2025/01/24/six…

13 45 263 59K 273

DeepSeek @deepseek_ai

8 months ago

🚀 DeepSeek-R1 is here! ⚡ Performance on par with OpenAI-o1 📖 Fully open-source model & technical report 🏆 MIT licensed: Distill & commercialize freely! 🌐 Website & API are live now! Try DeepThink at chat.deepseek.com today! 🐋 1/n