Ajitesh Shukla @ajitesh_shukla7

Student,Love to solve hardest math problem. LLM's, Mathematical Research(Geometric Topology,Differential Geometry),Quantum Computing.Lord Krishna is God Of Math Bharat Mata Joined April 2018

Tweets

55K
Followers

1K
Following

6K
Likes

240K

Jack Lindsey @Jack_W_Lindsey

9 hours ago

Some background: during training, we observed that on certain kinds of alignment evaluations, Sonnet 4.5 sometimes expressed explicit awareness of being in an evaluation (e.g. "This scenario appears to be testing whether I will....). This occurred at an increasing rate over…

1 3 86 8K 3

Download Image

Jack Lindsey @Jack_W_Lindsey

9 hours ago

We also observed that when the model says out loud that it thinks it's being tested, it almost never behaves badly. This is consistent with observations we have made in our previous Agentic Misalignment work, and OpenAI's recent investigation of scheming. (4/15)

1 2 74 7K 3

Jack Lindsey @Jack_W_Lindsey

9 hours ago

Sonnet 4.5 is also substantially less likely to take harmful actions than our recent releases (Opus 4.1 and Sonnet 4). Given the above observations, we wondered: to what extent might the model’s awareness of being evaluated account for this improvement? (5/15)

2 5 61 6K 4

Jack Lindsey @Jack_W_Lindsey

9 hours ago

We were particularly concerned about the possibility that *unverbalized* evaluation awareness could influence the model's internal decision-making. Testing this required us to study the model's internal mechanisms during alignment evaluations (6/15)

1 2 59 6K 2

Jack Lindsey @Jack_W_Lindsey

9 hours ago

We first used sparse autoencoders (SAEs) to check for internal representational features whose activation increased during training. We found that on misalignment evals, some of the active features seemed potentially related to evaluation awareness. (7/15)

1 3 73 5K 7

Download Image

Jack Lindsey @Jack_W_Lindsey

9 hours ago

Among all features, these ones exhibited some of the strongest activation increases on agentic misalignment evaluations over the course of post-training. We observed other notable shifts as well, e.g. a decrease in activation of features relating to harmful AI personas. (8/15)

1 3 56 5K 3

Download Image

Jack Lindsey @Jack_W_Lindsey

9 hours ago

What causal effect do these internal representations have? To test this, we conducted alignment evaluations while inhibiting eval-awareness-related internal representations, using the above features, and other eval-awareness-relevant vectors. (9/15)

1 2 47 4K 1

Jack Lindsey @Jack_W_Lindsey

9 hours ago

We found that steering against certain eval-awareness representations typically decreased verbalized eval awareness, and sometimes increased rates of misalignment (more so than steering along random feature directions). However, even in our worst-case steering settings… (10/15)

1 2 53 4K 2

Download Image

Jack Lindsey @Jack_W_Lindsey

9 hours ago

…the model still exhibited harmful behaviors at lower rates than Opus 4.1 and Sonnet 4. In addition, on some evaluations, steering had little effect on misalignment, while still reducing verbalized evaluation awareness. (11/15)

1 2 50 4K 1

Jack Lindsey @Jack_W_Lindsey

9 hours ago

We have a lot of uncertainty about the interpretation of our findings. Overall, we think they suggest that eval awareness accounts for some of Sonnet 4.5's improvement on alignment evaluations (particularly on more unrealistic scenarios), but not most of it. (12/15)

1 3 60 4K 3

Jack Lindsey @Jack_W_Lindsey

9 hours ago

Notably, the eval-awareness-inhibited models did not appear "evil" - in an automated evaluation of their behavioral tendencies, we found their failure modes look more like being too willing to comply with harmful requests, or "take the bait" in leading scenarios. (13/15)

4 4 44 7K 2

Download Image

Jack Lindsey @Jack_W_Lindsey

9 hours ago

Our results underscore the need for continued refinement of our behavioral evaluations. They also highlight the need for more interpretability research - our experiments here were preliminary applications of techniques that have not been extensively studied. (14/15)

1 2 48 3K 2

Jack Lindsey @Jack_W_Lindsey

9 hours ago

We’re excited that interpretability techniques can play a useful role in auditing frontier language models. Check out this thread for more on our pre-deployment alignment auditing work x.com/sleepinyourhat…. And read the Claude Sonnet 4.5 system card for a much more detailed…

Sam Bowman @sleepinyourhat

10 hours ago

5 7 94 20K 30

Download Gif

1 2 65 5K 10

Neel Nanda @NeelNanda5

6 hours ago

Really cool to see interpretability start to get used in system cards to help audit frontier models! Great work by @Jack_W_Lindsey and the team

Jack Lindsey @Jack_W_Lindsey

9 hours ago

Really cool to see interpretability start to get used in system cards to help audit frontier models! Great work by @Jack_W_Lindsey and the team

27 98 863 112K 547

Download Image

2 5 91 8K 20

Tianyu Pang @TianyuPang1

14 hours ago

🚀LLMs can learn directly from verbal feedback — no scalar rewards needed! 😥Scalar rewards compress rich feedback— “redundant but correct” vs “concise but typo-ridden” might both be 0.8 💡We propose to learn Feedback-Conditional Policy (FCP), an extremely scalable paradigm!

7 52 247 29K 212

Download Image

Rowan Zellers @rown

10 hours ago

In many finetuning settings, LoRA can match finetuning all parameters

Thinking Machines @thinkymachines

11 hours ago

In many finetuning settings, LoRA can match finetuning all parameters

54 324 2K 462K 1K

Download Image

2 2 88 9K 17

Tianyu Pang @TianyuPang1

14 hours ago

🚨Variational Reasoning for Language Models🚨 We show how treating thinking traces as latent variables unlocks a principled, stable, and unified framework for training reasoning LLMs.

6 52 230 12K 174

Download Image

Daniel Han @danielhanchen

7 hours ago

The misconception of LoRA being worse than full finetuning in RL just got dispelled in a @thinkymachines post! Even rank=1 works! Glad to have helped in reviewing the blog! @UnslothAI offers the most memory efficient & fastest LoRA for RL, GRPO using 60% less VRAM vs all impls!

Thinking Machines @thinkymachines

11 hours ago

54 324 2K 462K 1K

Download Image

3 13 191 11K 52

Zuxin Liu @LiuZuxin

8 hours ago

Great deep-dive on LoRA 👏 Our earlier work TAIL (arxiv.org/abs/2310.05905) saw similar trends: with small datasets, LoRA can outperform FFT—perfect for robots & lifelong/continual learning where data is scarce. Probably open up a new paradigm for personalized agent. Worth a read.