📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining
SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
I had the chance to join the TWIML podcast to talk about my group’s ICML 2025 papers! We dug into the surprising limitations of modern pre-training: where it breaks down, why it matters, and what new directions might help us move past these barriers.
I had the chance to join the TWIML podcast to talk about my group’s ICML 2025 papers! We dug into the surprising limitations of modern pre-training: where it breaks down, why it matters, and what new directions might help us move past these barriers.
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute
We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
Researchers are working on ways to prevent large language models (LLMs) from simply memorizing information instead of truly learning. They found that removing memorized parts directly can harm the model's ability to learn new things. Their solution, called MemSinks, creates…
I had early sneak peeks into this exciting work on rethinking pretraining—credits to @gaurav_ghosal, my constant buddy through countless late nights at CMU. It’s been a blast building pretraining frameworks and sharing insights. @gaurav_ghosal’s energy is absolutely unmatched!
I had early sneak peeks into this exciting work on rethinking pretraining—credits to @gaurav_ghosal, my constant buddy through countless late nights at CMU. It’s been a blast building pretraining frameworks and sharing insights. @gaurav_ghosal’s energy is absolutely unmatched!
One thing years of memorization research has made clear: unlearning is fundamentally hard. Neurons are polysemantic & concepts are massively distributed. There’s no clean 'delete'.
We need architectures that are "unlearnable by design".
Introducing, Memorization Sinks 🛁⬇️
One thing years of memorization research has made clear: unlearning is fundamentally hard. Neurons are polysemantic & concepts are massively distributed. There’s no clean 'delete'.
We need architectures that are "unlearnable by design".
Introducing, Memorization Sinks 🛁⬇️
There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success.
❓What if unlearning is actually doomed from the start?
👇This thread explains why and how *memorization sinks* offer a new way forward.
🚨 Super excited to finally share our Safety Pretraining work — along with all the artifacts (safe data, models, code)!
In this thread 🧵, I’ll walk through our journey — the key intermediate observations and lessons, and how they helped shape our final pipeline.
🚨 Super excited to finally share our Safety Pretraining work — along with all the artifacts (safe data, models, code)!
In this thread 🧵, I’ll walk through our journey — the key intermediate observations and lessons, and how they helped shape our final pipeline.
The new OpenAI paper “Why Language Models Hallucinate” is more like PR than research.
The claim that hallucinations arise because training/evaluation reward guessing over abstaining is decades-old (reject option classifiers, selective prediction).
1/Excited to share the first in a series of my research updates on LLM pretraining🚀.
Our new work shows *distilled pretraining*—increasingly used to train deployable models—has trade-offs:
✅ Boosts test-time scaling
⚠️ Weakens in-context learning
✨ Needs tailored data curation
🤖 Some company just released a new set of open-weight LLMs well-suited for your production environment. However, you suspect that the models might be trained with backdoors or other hidden malicious behaviors. Is it still possible to deploy these models worry-free? (1/7)
@abitha___ will be presenting our work on training language models to predict further into the future beyond the next token and the benefits this objective brings.
x.com/gm8xx8/status/…
@abitha___ will be presenting our work on training language models to predict further into the future beyond the next token and the benefits this objective brings.
x.com/gm8xx8/status/…
In "Mind Your Step (by Step): Chain‑of‑Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse", we connect human "overthinking" insights to LLM reasoning, offering a new lens on when thinking‑out‑loud backfires.
📄 Read the full paper: arxiv.org/abs/2410.21333…
In "Mind Your Step (by Step): Chain‑of‑Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse", we connect human "overthinking" insights to LLM reasoning, offering a new lens on when thinking‑out‑loud backfires.
📄 Read the full paper: arxiv.org/abs/2410.21333…
606 Followers 715 FollowingAssistant Professor at @WisconsinCS. Machine learning, privacy, and memorization. Postdoc @uwcse and PhD at Boston University.
45 Followers 714 FollowingMSc @CMU_Africa/@CMUEngineering | Exploring Vision x Language x Security | Focused on Computer Vision, HealthTech, and Cybersecurity
543K Followers 24K FollowingThe best from ML/AI community | Ex-Microsoft, Rackspace, Fast Company | Wrote eight books about the future | Silicon Valley robots, holodecks, BCIs, & startups.
2K Followers 8K FollowingFounder, Imaginator ai
knowledge discovery 2D navigation TS ML DL recsys econ math incentives mech design finance networks bridges boundaries, Time, 3d type
101 Followers 289 FollowingMachine Learning PhD @CarnegieMellon | Previously Churchill Scholar @Cambridge_Uni @UofMaryland | AI + Decision Making for social good
18K Followers 4K FollowingAssociate Professor at UC Berkeley. Former Research Scientist at Google DeepMind. ML/AI Researcher working on foundations of LLMs and deep learning.
6K Followers 366 FollowingComputer use agents lead @ Meta Superintelligence Labs; on leave from ML PhD @CarnegieMellon. Prev: multimodal research @GoogleAI. Opinions my own. 🇸🇬
2K Followers 3K FollowingAutomated/strongly-augmented AI safety research. Past: AI safety independent research and field-building - ML4Good, AGISF; ML academia (PhD, postdoc).
8K Followers 710 FollowingAssistant Professor MIT @medialab @MITEECS @nlp_mit || PhD from CMU @mldcmu @LTIatCMU || Foundations of multisensory AI to enhance the human experience.