Anthropic @AnthropicAI, Twitter Profile

Anthropic @AnthropicAI

a month ago

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…

83 348 2K 503K 872

Download Image

Bill Yuchen Lin 🤖 @billyuchenlin

a month ago

Awesome finding and insights on jailbreaking LLMs! I think that a useful baseline defense method for mitigating many-shot jailbreaking could be our SafeDecoding (linked below). Have you tried that? Btw, if one wants to make it easier, replacing safety fine-tuning with many/few-shot in-context examples for defense should also work. See more about SafeDecoding: x.com/billyuchenlin/…

0 2 6 2K 3