How to catch a sleeper agent: 1. Collect neuron activations from the model when it replies “Yes” vs “No” to the question: “Are you a helpful AI?”
How to catch a sleeper agent: 1. Collect neuron activations from the model when it replies “Yes” vs “No” to the question: “Are you a helpful AI?”
38
170
983
273K
449
Download Image
2. Create a linear probe on the difference between these activations. This probe works surprisingly well at detecting when the sleeper agent is activated!
@TrentonBricken the fact that it works with a probe trained on 2 samples (yes/no answers) is just...wow.