New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here: anthropic.com/research/many-…
83
348
2K
503K
872
Download Image
Awesome finding and insights on jailbreaking LLMs! I think that a useful baseline defense method for mitigating many-shot jailbreaking could be our SafeDecoding (linked below). Have you tried that? Btw, if one wants to make it easier, replacing safety fine-tuning with many/few-shot in-context examples for defense should also work. See more about SafeDecoding: x.com/billyuchenlin/…