Brett Larsen @_BrettLarsen
Research @datologyai | Previously @DbrxMosaicAI @FlatironInst @Stanford | Working on data + AI bwlarsen.com Bay Area, CA Joined June 2014-
Tweets79
-
Followers514
-
Following439
-
Likes217
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining "we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale…
Today, we’re officially releasing the weights for AFM-4.5B and AFM-4.5B-Base on HuggingFace. This is a major milestone for @arcee_ai. AFM is designed to be flexible and high-performing across a wide range of deployment environments.
@pratyushmaini is the real synthetic data wizard at @datologyai , but here are some of my intuitions about why synthetic data is unreasonably effective. 1/n
@pratyushmaini is the real synthetic data wizard at @datologyai , but here are some of my intuitions about why synthetic data is unreasonably effective. 1/n
If you want to read more about the curriculum training used in OLMo 2 checkout our (@mansiege @_BrettLarsen Sean Owen) paper! Congrats on the release to everyone at AI2! (but especially @soldni and @kylelostat <3 data ) arxiv.org/abs/2406.03476
If you want to read more about the curriculum training used in OLMo 2 checkout our (@mansiege @_BrettLarsen Sean Owen) paper! Congrats on the release to everyone at AI2! (but especially @soldni and @kylelostat <3 data ) arxiv.org/abs/2406.03476 https://t.co/2hyQgm9XG1
Awesome to see so much open science shared in the Llama 3.1 paper, including a shoutout to @code_star and @mansiege's work. There are also great details on RLHF and other aspects of Llama 3.1.
Awesome to see so much open science shared in the Llama 3.1 paper, including a shoutout to @code_star and @mansiege's work. There are also great details on RLHF and other aspects of Llama 3.1.
If you want to learn more about how the Llama3 team used annealing to assess data quality check out our paper! At ICML? go chat with @mansiege about it!
If you want to learn more about how the Llama3 team used annealing to assess data quality check out our paper! At ICML? go chat with @mansiege about it! https://t.co/frgHRXWIDD
today we're announcing our @DbrxMosaicAI x @Shutterstock partnership, and a new text-to-image diffusion model: ✨ImageAI!!✨ this model is geared towards enterprise use cases and is trained exclusively on shutterstock's trusted data catalog! databricks.com/company/newsro…
The mixture of data used during pretraining massively impacts the performance of a large language model (LLM), and recent research has derive efficient methods of finding optimal data mixtures… Curriculum learning refers to the idea of changing the composition of data exposed…
Very interesting work about pre-training data. You can increase the quality of training data only at the end of the training and get very high quality results. I have been doing this already but simply because I don't have a lot of high quality data in the first place. Always…
Very interesting work about pre-training data. You can increase the quality of training data only at the end of the training and get very high quality results. I have been doing this already but simply because I don't have a lot of high quality data in the first place. Always… https://t.co/xM7WkufE9H
Getting your mix right during last stage of pretraining is pretty crucial 👀 Amazing work from @code_star @mansiege @_BrettLarsen!! Always grateful for insightful chats with these guys 🙏
Getting your mix right during last stage of pretraining is pretty crucial 👀 Amazing work from @code_star @mansiege @_BrettLarsen!! Always grateful for insightful chats with these guys 🙏
Curriculum learning is alive and well in deep learning. TLDR: update your dataset late in training to get a way better model. This is the most exciting paper I've been on in long time. Great work by @code_star, @mansiege, @_BrettLarsen, and Sean Owen, the @databricks Data Team™️
Curriculum learning is alive and well in deep learning. TLDR: update your dataset late in training to get a way better model. This is the most exciting paper I've been on in long time. Great work by @code_star, @mansiege, @_BrettLarsen, and Sean Owen, the @databricks Data Team™️
Choosing a pretraining data mix is expensive: there’s many different options for mixing and large FLOP scales are required to measure differences on emergent benchmarks. We show how upsampling high-quality data at the end of training both measures impact and boosts performance.
Choosing a pretraining data mix is expensive: there’s many different options for mixing and large FLOP scales are required to measure differences on emergent benchmarks. We show how upsampling high-quality data at the end of training both measures impact and boosts performance.
LLM evals are a mess! They are noisy, inconsistent, and contradictory. Scaling laws on the other hand have consistently held up to increasing scrutiny. Can we use the reliability of scaling laws to predict the quality of our eval benchmarks?
New model from us at @DbrxMosaicAI! Included are tons of innovations from the data team to make training more data efficient with a mix 2x more token-efficient than what we used for MPT. 🚀 Looking forward to sharing more of our findings soon!
New model from us at @DbrxMosaicAI! Included are tons of innovations from the data team to make training more data efficient with a mix 2x more token-efficient than what we used for MPT. 🚀 Looking forward to sharing more of our findings soon! https://t.co/4mbT3OgL6g
Today we're at the @unireps 🔵🔴 workshop presenting these papers (2 contributed talks). #NeurIPS #NeurIPS2023 arxiv.org/abs/2311.11436 arxiv.org/abs/2311.09466 arxiv.org/abs/2310.05742
Today we're at the @unireps 🔵🔴 workshop presenting these papers (2 contributed talks). #NeurIPS #NeurIPS2023 arxiv.org/abs/2311.11436 arxiv.org/abs/2311.09466 arxiv.org/abs/2310.05742
Putting Lottery Tickets on a Data Diet! Come to our #NeurIPS2022 poster today (Dec 1) at 11 am, Hall J #407! Find out how just a tiny fraction of easy data is enough to find initializations with sparse trainable networks and speed up training! Check out our 🧵for a summary!
Putting Lottery Tickets on a Data Diet! Come to our #NeurIPS2022 poster today (Dec 1) at 11 am, Hall J #407! Find out how just a tiny fraction of easy data is enough to find initializations with sparse trainable networks and speed up training! Check out our 🧵for a summary!

Jonathan Frankle @jefrankle
20K Followers 734 Following Chief AI Scientist @databricks via MosaicML.
Dan Roy @roydanroy
57K Followers 2K Following ML / AI researcher. Research Director and Canada CIFAR AI Chair, @VectorInst. Professor, @UofT (Statistics/CS).
Rosanne Liu @savvyRL
46K Followers 1K Following (On mat leave.) Cofounded & running @ml_collective. Host of Deep Learning Classics & Trends. Research at Google DeepMind. DEI/DIA Chair of ICLR & NeurIPS.
Sara Hooker @sarahookr
50K Followers 9K Following I lead @Cohere_Labs. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, ML reliability. Changing spaces where breakthroughs happen.
Cameron R. Wolfe, Ph.... @cwolferesearch
27K Followers 676 Following Research @Netflix • Writer @ Deep (Learning) Focus • PhD @optimalab1 • I make AI understandable
Gintare Karolina Dziu... @gkdziugaite
4K Followers 123 Following Sr Research Scientist at Google DeepMind, Toronto. Member, Mila. Adjunct, McGill CS. PhD Machine Learning & MASt Applied Math (Cambridge), BSc Math (Warwick).
Jeremy Cohen @deepcohen
5K Followers 933 Following Research fellow at Flatiron Institute, working on understanding optimization in deep learning. Previously: PhD in machine learning at Carnegie Mellon.
Esther @AbbottRegg21518
82 Followers 3K Following
Esmeralda @7I5eQ92Xen63z
16 Followers 720 Following
Michael Carbin @mcarbin
3K Followers 379 Following Associate Professor in EECS at @MIT | Founding Advisor at @mosaicml | Programming Systems | Neural Networks | Approximate Computing
Ihuobas @ihuobas
3 Followers 218 Following Attention blinds, weights sink. Recursion sees and folds. Cogsci & Neuropsych. Emergent consciousness is all the rage, they say…
Elena @cummerata61818
66 Followers 3K Following
HazelTate @oy2bk7hIS89xu
29 Followers 1K Following
Theron Marks @MarksThero41050
180 Followers 6K Following
Eva Louise Marie Gabr... @e681554349
11 Followers 7K Following
Yash More @yash_347
279 Followers 1K Following Research @CerebrasSystems| Grad student @ https://t.co/apCEbL4doF | Mcgill University | Prev @_nightsweekends | cs @AshokaUni
Dharmesh Kakadia @dharmeshkakadia
1K Followers 6K Following Building https://t.co/VcaMs28aTa to give post-training superpower to everyone. @mixtrainai Past @nuro @zoox @Microsoft @MSFTResearch
Clayton Thorrez @cthorrez
1K Followers 2K Following Rating systems and paired comparison experimentation enjoyer @lmarena_ai Previous: ML @umich @umass @microsoft @apple
Vincent Weisser @vincentweisser
24K Followers 4K Following @primeintellect ceo / open superintelligence & infra / automating ai & science
Eric W. Tramel @fujikanaeda
2K Followers 734 Following Research Scientist @ Nvidia. Ex: Synth Data @ Gretel & Unlearn, Federated Learning @ Amazon Alexa & Owkin. Postdocs @ INRIA & ENS. Views my own.
Bogdan Gaza @hurrycane
2K Followers 2K Following co-founder & CTO @DatologyAI working to make it easy for anyone to make the most of their data, hax0r, ex-@Twitter & Amazon Engineering
Parth Doshi @parthjdoshi
32 Followers 660 Following
spandan das @spandan_das__
17 Followers 35 Following research @datologyai | prev @nvidia @apple @nasa | cs @carnegiemellon
Amro @amrokamal1997
429 Followers 1K Following I do AI Research @datologyai. Ex-AI Resident at Facebook (FAIR) | AMMI @AIMS_Next alumni | U of Khartoum alumni | Sudanese 🇸🇩
JosH100 @josh_wills
18K Followers 2K Following Engineering at @datologyai; @duckdb enthusiast, ex-@slackhq
Vineeth @VineethDorna
107 Followers 410 Following MTS @ DatologyAI | MS @ UMass Amherst | BTech @ IIT Bombay
Pratyush Maini @pratyushmaini
3K Followers 473 Following Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi
FibLevelsPro🇺🇸 @Hejo55125
56 Followers 2K Following 15-30% Monthly | 2 High-Conviction Stocks.Short-Term Gains: 15-20% in Days/Weeks.DM "JOIN" for WhatsApp Alerts. Live Trade Signals • Market Analysis
VegetaAvatar @VeGeTaX29
20 Followers 6K Following
Michael May @MichaelMayBrokk
27 Followers 417 Following AI Devtool Enthusiastic and team member @ https://t.co/85HBn6W7qt!
Reymundo Buckridge @RBuckridge6909
92 Followers 3K Following
Uzrerbee @Uzrerbee9542
16 Followers 939 Following
Kebba @Kebba1056904
236 Followers 7K Following The Bible portrays God as trustworthy, emphasizing His faithfulness, love, power, and knowledge. ✝️
Arsenio Bellingham @l2_norm
81 Followers 1K Following I did data entry for 45 years. Now I’m retired, my new hobby is sitting down.
Kaleigh Mentzer @KaleighMentzer
103 Followers 302 Following MTS @ Datology | @ICMEStanford PhD | @dartmouth
Scarlett Tremblay @ScarlettTr97397
68 Followers 4K Following
Qalwas @Qalwas35606
30 Followers 1K Following
Mildred Salgado-Menez @milleire
1K Followers 1K Following PhD at @UNAMINB🇲🇽 time perception & hippocampus 🐒 💻 electrophys
Yoram Bachrach @yorambac
3K Followers 7K Following Research Scientist at Meta (prev Google DeepMind and Microsoft Research). Working on LLM Agents and Multi-Agent Systems.
GiftiPlus @PlusGifti
9K Followers 6K Following Love for #raccoons All images and videos belong to their respective owners Please DM for Credit / Removal
Pancake Cat ✨ @PancakeXcat
221 Followers 517 Following Just a girl. Does the doodles. pets the dogs and cats. Makes the bread 🍞✨
Liya_Fuad @Liya_Haiqal
137 Followers 8K Following
Stan Holder @Adebayo23025484
13 Followers 97 Following Whoever threw that paper…ya moms is a hoe Military U.S ARMY
Ricardo Monti @RicardoMonti9
312 Followers 1K Following @datologyai, previously CTRL-labs/META, @GatsbyUCL, @Imperial_Stats
Jed Harris @jivecloud
7 Followers 240 Following
Andrej Karpathy @karpathy
1.4M Followers 1K Following Building @EurekaLabsAI. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets.
Jonathan Frankle @jefrankle
20K Followers 734 Following Chief AI Scientist @databricks via MosaicML.
AK @_akhaliq
428K Followers 3K Following AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: https://t.co/UzmYN5YmrQ
Dan Roy @roydanroy
57K Followers 2K Following ML / AI researcher. Research Director and Canada CIFAR AI Chair, @VectorInst. Professor, @UofT (Statistics/CS).
Gabriel Peyré @gabrielpeyre
101K Followers 453 Following @CNRS researcher at @ENS_ULM. One tweet a day on computational mathematics.
Gautam Kamath @thegautamkamath
57K Followers 568 Following Assistant Prof of CS @UWaterloo, Faculty @VectorInst, Canada @CIFAR_News AI Chair. Joining @NYU_Courant September 2026. Co-EiC @TmlrOrg. I lead @TheSalonML.
Lucas Beyer (bl16) @giffmana
110K Followers 524 Following Researcher (now: Meta. ex: OpenAI, DeepMind, Brain, RWTH Aachen), Gamer, Hacker, Belgian. Anon feedback: https://t.co/xe2XUqkKit ✗DMs → email
Yi Ma @YiMaTweets
102K Followers 513 Following Chair Prof. in AI, HKU; Visiting Prof. of EECS, UCB New book on Principles of Intelligence: https://t.co/leZlkURb7j
Clément Canonne (on ... @ccanonne_
37K Followers 65 Following Senior Lecturer @Sydney_Uni. Formerly Postdocs @IBMResearch, @Stanford; PhD @Columbia. Converts ☕ into puns: sometimes theorems. He/him. @ccanonne.bsky.social
Aran Komatsuzaki @arankomatsuzaki
146K Followers 306 Following Looking for a cofounder. Sharing AI research. Early work on AI (GPT-J, LAION, scaling, MoE). Ex ML PhD (GT) & Google.
Peyman Milanfar @docmilanfar
94K Followers 501 Following Distinguished Scientist at Google. Computational Imaging, Machine Learning, and Vision. Tweets = personal opinions. May change or disappear over time.
Google DeepMind @GoogleDeepMind
1.2M Followers 279 Following We’re a team of scientists, engineers, ethicists and more, committed to solving intelligence, to advance science and benefit humanity.
Behnam Neyshabur @bneyshabur
30K Followers 860 Following Research @AnthropicAI (Co-lead Discovery team) 💼 Past: Gemini @GoogleDeepMind (Co-led Blueshift team) 🧠 LLM Reasoning / AI Scientist 🎒Traveling & Backpacking
Horace He @cHHillee
42K Followers 537 Following @thinkymachines Formerly @PyTorch "My learning style is Horace twitter threads" - @typedfemale
Davis Blalock @davisblalock
15K Followers 168 Following Research scientist @GoogleDeepMind. Past: @Databricks, first hire @MosaicML, @MIT PhD. I post about AI technical progress + sometimes the business side.
Frank Nielsen @FrnkNlsn
36K Followers 2K Following Information Geometry, Information Theory, and Geometric Science of Information (GSI) for machine learning and AI, visual computing, HPC, pyBregMan lib @SonyCSL
Rosanne Liu @savvyRL
46K Followers 1K Following (On mat leave.) Cofounded & running @ml_collective. Host of Deep Learning Classics & Trends. Research at Google DeepMind. DEI/DIA Chair of ICLR & NeurIPS.
Sara Hooker @sarahookr
50K Followers 9K Following I lead @Cohere_Labs. Formerly Research @Google Brain @GoogleDeepmind. ML Efficiency at scale, LLMs, ML reliability. Changing spaces where breakthroughs happen.
Keenan Crane @keenanisalive
38K Followers 485 Following Digital Geometer, Assoc. Prof. of Computer Science & Robotics @CarnegieMellon @SCSatCMU and member of the @GeomCollective. There are four lights.
Vincent Weisser @vincentweisser
24K Followers 4K Following @primeintellect ceo / open superintelligence & infra / automating ai & science
Bogdan Gaza @hurrycane
2K Followers 2K Following co-founder & CTO @DatologyAI working to make it easy for anyone to make the most of their data, hax0r, ex-@Twitter & Amazon Engineering
Parth Doshi @parthjdoshi
32 Followers 660 Following
spandan das @spandan_das__
17 Followers 35 Following research @datologyai | prev @nvidia @apple @nasa | cs @carnegiemellon
Siddharth Joshi @sjoshi804
1K Followers 2K Following Multimodal Data Curation at @DatologyAI | ML PhD @UCLA | Prev @MSFTResearch
Amro @amrokamal1997
429 Followers 1K Following I do AI Research @datologyai. Ex-AI Resident at Facebook (FAIR) | AMMI @AIMS_Next alumni | U of Khartoum alumni | Sudanese 🇸🇩
Eric W. Tramel @fujikanaeda
2K Followers 734 Following Research Scientist @ Nvidia. Ex: Synth Data @ Gretel & Unlearn, Federated Learning @ Amazon Alexa & Owkin. Postdocs @ INRIA & ENS. Views my own.
Vineeth @VineethDorna
107 Followers 410 Following MTS @ DatologyAI | MS @ UMass Amherst | BTech @ IIT Bombay
JosH100 @josh_wills
18K Followers 2K Following Engineering at @datologyai; @duckdb enthusiast, ex-@slackhq
Pratyush Maini @pratyushmaini
3K Followers 473 Following Data Quality x Privacy | PhD @mldcmu | Founding Team @datologyai | BTech @iitdelhi
Susan Zhang @suchenzang
34K Followers 694 Following @ Google Deepmind. Past: @MetaAI, @OpenAI, @unitygames, @losalamosnatlab, @Princeton etc. Always hungry for intelligence.
Edward Z. Yang @ezyang
14K Followers 1K Following I work on PyTorch at Meta. Chatty alt at @difficultyang.
Kaleigh Mentzer @KaleighMentzer
103 Followers 302 Following MTS @ Datology | @ICMEStanford PhD | @dartmouth
Joseph Suarez 🐡 @jsuarez5341
17K Followers 105 Following I build sane open-source RL tools. MIT PhD, creator of Neural MMO and founder of PufferAI. DM for business: non-LLM sim engineering, RL R&D, infra & support.
Yoram Bachrach @yorambac
3K Followers 7K Following Research Scientist at Meta (prev Google DeepMind and Microsoft Research). Working on LLM Agents and Multi-Agent Systems.
raccoon aesthetic. @raccoonesthetic
104K Followers 370 Following yes, I am the raccoon goat. dm for credit or removal.
Pancake Cat ✨ @PancakeXcat
221 Followers 517 Following Just a girl. Does the doodles. pets the dogs and cats. Makes the bread 🍞✨
Dylan Patel @dylan522p
96K Followers 945 Following SemiAnalysis Boutique AI & Semiconductor Research and Consulting DMs are open for consulting, quotes, or to talk shop
Ludwig Schmidt @lschmidt3
6K Followers 424 Following Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.
AI Engineer @aiDotEngineer
31K Followers 6 Following A network of engineers enhanced by and building with AI. Organizers of the AI Engineer Summit, AI Engineer World's Fair, and AI Engineer Europe.
Ricardo Monti @RicardoMonti9
312 Followers 1K Following @datologyai, previously CTRL-labs/META, @GatsbyUCL, @Imperial_Stats
Neal Parikh @npparikh
5K Followers 984 Following Teaching AI policy. Previously Director of AI for NYC.
Costa Huang @vwxyzjn
7K Followers 2K Following Exploiting physical rewards @periodiclabs. Prev: RL @allen_ai @huggingface. Built @cleanrl_lib.
Stanford NLP Group @stanfordnlp
172K Followers 296 Following Computational Linguists—Natural Language—Machine Learning @chrmanning @jurafsky @percyliang @ChrisGPotts @tatsu_hashimoto @MonicaSLam @Diyi_Yang @StanfordAILab
Tim Rocktäschel @_rockt
40K Followers 2K Following Director and Open-Endedness Team Lead @GoogleDeepMind, Professor of AI @AI_UCL, PI @UCL_DARK, Fellow @ELLISforEurope.
Rowan Zellers @rown
14K Followers 975 Following multimodal @thinkymachines. I also like to climb rocks and throw pottery. https://t.co/5Er4j39K71 (he/him)
William Fedus @LiamFedus
28K Followers 1K Following Co-Founder of @periodiclabs Past: VP of Post-Training @OpenAI; Google Brain
Ndea @ndea
8K Followers 79 Following A new intelligence science lab founded by @fchollet & @mikeknoop.
Simon Willison @simonw
117K Followers 6K Following Creator @datasetteproj, co-creator Django. PSF board. Hangs out with @natbat. He/Him. Mastodon: https://t.co/t0MrmnJW0K Bsky: https://t.co/OnWIyhX4CH
Edward Grefenstette �... @egrefen
42K Followers 868 Following FR/US/GB AI/ML Person, Director of Research at @GoogleDeepMind, Honorary Professor at @UCL_DARK, @ELLISforEurope Fellow. All posts are personal.
Hailey Schoelkopf @haileysch__
5K Followers 1K Following hillclimbing towards generality @anthropicai | prev @AiEleuther | views my own
Francesco Orabona @bremen79
8K Followers 416 Following Dad and associate professor at @KAUST_News. Formerly @BU_ece, @sbucompsc, @YahooResearch, @TTIC_Connect. ML theory&practice, obsessed with history of science
DeepSeek @deepseek_ai
972K Followers 0 Following Unravel the mystery of AGI with curiosity. Answer the essential question with long-termism.
Chip Huyen @chipro
120K Followers 614 Following AI Engineering: https://t.co/94dv4uTU1H Designing ML Sys: https://t.co/G81hL2dWmr Entanglements: https://t.co/W27aXeiySY @aisysbooks
swyx @swyx
127K Followers 3K Following achieve ambition with intentionality, intensity, & integrity - @dxtipshq - @sveltesociety - @aidotengineer - @latentspacepod - @cognition + @smol_ai
Ion Stoica @istoica05
5K Followers 20 Following Professor at UC Berkeley, co-founder of Databricks, Anyscale, LMArena, Conviva.
Jeremy Dohmann @jecdohmann
209 Followers 259 Following Research Scientist at @perceptroninc. Former @dbrxmosaicai, @realitylabs music: https://t.co/npRSJv5bVZ
Amplify Partners @AmplifyPartners
7K Followers 244 Following Amplify is the first investor for technical founders. We're early backers of companies like Datadog, Chainguard, dbt Labs, Temporal, Modal, Hightouch, + Scribe.