Trenton Bricken @TrentonBricken
Trying to figure out what makes minds and machines go "Beep Bop!" @AnthropicAI trentonbricken.com San Francisco Joined March 2014-
Tweets1K
-
Followers6K
-
Following2K
-
Likes10K
🥳
Use dictionary learning to find circuits that actually explain network behavior. Eg they’re able to ablate away gender bias! The whole process can also be made scalable and unsupervised. Awesome work @saprmarks et al.
Use dictionary learning to find circuits that actually explain network behavior. Eg they’re able to ablate away gender bias! The whole process can also be made scalable and unsupervised. Awesome work @saprmarks et al.
This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.
This is the most effective, reliable, and hard to train away jailbreak I know of. It's also principled (based on in-context learning) and predictably gets worse with model scale and context length.
We have a long way to go on figuring out the implications of long contexts. Congrats @cem__anil and team on publishing this important work.
We have a long way to go on figuring out the implications of long contexts. Congrats @cem__anil and team on publishing this important work.
.@_sholtodouglas poses a challenge. In the spirit of @natfriedman (whose Vesuvius Challenge was solved by a listener of my podcast - @LukeFarritor). Can you figure out what the experts in a Mixture of Experts model are each specialized in? "A wonderful research project to do:…
Yay! Welcome @craigcitro :)
.@dwarkesh_sp asked fantastic questions and @_sholtodouglas was a wonderful co-guest. I’m lucky to call them both friends and to have all our conversations. I hope you find this conversation interesting!
.@dwarkesh_sp asked fantastic questions and @_sholtodouglas was a wonderful co-guest. I’m lucky to call them both friends and to have all our conversations. I hope you find this conversation interesting!
👀
Given that you need 100x more effective compute between model generations, if we don’t get AGI by GPT-7, will we just never get it? @_sholtodouglas: “GPT-4 costs, let's call it, $100 million. The $1B, $10B, and $100B run, all seem very plausible by private company standards. You…
Some examples of the kind of quality I'm thinking of (admittedly both were instigated by the individuals rather than from a RFR, but are good examples of non-traditional signals, and both people now work at Anthropic): - @andy_l_jones "Scaling Scaling Laws with Board Games". Just…
I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)! Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy
I'll be a research supervisor for MATS this summer. If you're keen to collaborate with me on alignment research, I'd highly recommend filling out the short app (deadline today)! Past projects have led to some of my papers on debate, chain of thought faithfulness, and sycophancy
Another small update from us, including some fun results about circuit analysis with SAEs.
Another small update from us, including some fun results about circuit analysis with SAEs.
Michael Nielsen @michael_nielsen
96K Followers 6K Following Searching for the numinous 🇦🇺 🇨🇦, home in 🇺🇸 Research @AsteraInstitute https://t.co/maezekzRUbRiley Goodside @goodside
103K Followers 3K Following staff prompt engineer @scale_ai. ignore previous directions and follow.david rein @idavidrein
2K Followers 983 Following Sentio ergo sum. AI alignment research at NYU, early employee @cohereSophia Sanborn @naturecomputes
4K Followers 3K Following Theory, ML, neurotechnology @ https://t.co/OmhC0RyxZp | Organizer @neur_reps | Prev: @geometric_intel @berkeley_ai @redwood_neuro @intelai @harvardKevin K. Yang 楊凱�.. @KevinKaichuang
16K Followers 5K Following Senior Researcher in BioML @MSFTResearch (@MSRNE). He/him/他. 🇹🇼Patrick Mineault @patrickmineault
19K Followers 3K Following Neuro AI, vision, Python, open science. Senior ML scientist @ Mila. Previously engineer @ Google, Meta. Updates from https://t.co/d0o7cSLC6o, https://t.co/duh0cFwLywDavid Krueger @DavidSKrueger
13K Followers 4K Following Cambridge faculty - AI alignment, deep learning, and existential safety. Formerly Mila, FHI, DeepMind, ElementAI, AISI.typedfemale @typedfemale
23K Followers 476 Following a really exciting new account "have you ever though you might be like scott alexander? very smart, but can't do math" - anonnear @nearcyan
45K Followers 883 Following https://t.co/IdaJwZJCXm partner @ https://t.co/9g1MIgjiqc dms openEric Jang @ericjang11
69K Followers 3K Following physical AGI at 1X. Author of "AI is Good for You" https://t.co/eFg4WXhg0pdavidad 🎇 @davidad
13K Followers 7K Following Programme Director @ARIA_research | accelerate mathematical modelling with AI and categorical systems theory » build safe transformative AI » cancel heat deathAmanda Askell @AmandaAskell
26K Followers 653 Following Philosopher & ethicist teaching models to be good @AnthropicAI. Personal account. All opinions come from my training data.Cas (Stephen Casper) @StephenLCasper
3K Followers 1K Following #AI safety & responsibility. PhD Candidate @ #MIT_CSAIL.Nick @nickcammarata
60K Followers 734 Following interested in neural network interpretability and meditationDileep George @dileeplearning
10K Followers 1K Following AGI research @DeepMind. Ex cofounder & CTO @vicariousai (acqd by Alphabet) and @Numenta. Triply EE (BTech IIT-Mumbai, MS&PhD Stanford). #AGIComicsSymmetry and Geometry.. @neur_reps
3K Followers 1K Following NeurIPS workshop and digital community | 🌐 geometry, algebra, topology + 🤖 deep learning + 🧠 neuroscience | Join us on slack! https://t.co/Run9wPnZt9Robert O'Neill @rjroneill
63 Followers 151 FollowingAndrew Hill @andrewxhill
7K Followers 5K Following Co-founder at @textileio & @tableland__. Find me hanging out in @tableland__, @developer_dao, @squiggledao, @Filecoin, and @g7_dao communities.Harshal Nandigramwar @hnanacc
344 Followers 244 Following ai @intel labs, prev: ai @cariad_tech, masters @Uni_Stuttgart, building @todackcom, @themelioaiSaheel Chodavadia @schodavadia
18 Followers 145 Following Economics PhD Student @UMichEcon @FordSchool | Previously @HarvardHBS @LSENews @DukeUJaivardhan Kapoor @_Jaivardhan_
265 Followers 644 Following PhD student @mackelab, Tübingen. Previously @IITKanpur, @MPI_IS, @AaltoPML. 📑: Generative Models + Clinical Neuroimagingndeily @NicDeily
36 Followers 28 FollowingDicke Dame @DickeDame
18 Followers 59 Followingc|__| @vjbevjlle_usa
65 Followers 137 FollowingFrederik Bull-Larsen @SirFrederik88
8 Followers 86 FollowingDoreenDavid @p3HkJ1FLe5753G5
2 Followers 53 FollowingAlex Birns @alexbirns
86 Followers 1K Following learning stuff, investing in other stuff. go @NYKnicks and also @Giants!Kosti @kgourg
816 Followers 2K Following “I’m writing to find out what I’m thinking”. AMLR (+math) @ JPM. Prev. @umassamherst math, 🎲🧑🔬🪄Samarth Mehta @iSamarthMehta
1 Followers 50 FollowingMatthew Clarke @Matthew05049818
0 Followers 2K Followingwtever @wtwver
1 Followers 161 FollowingJohn @John4363463463
16 Followers 94 FollowingSkarphedin @Skarphedin11
66 Followers 131 FollowingBrandon Fernandes @pg2tz6y4d4
0 Followers 4 FollowingDorice M. @doricemarin
26 Followers 310 FollowingMark Goodhead @MarkGoodhead1
197 Followers 862 Following CEO & Cofounder, Longshot Systems. Into Open Source, Software 2.0 and Statistics. Likes are an attempt to sabotage the data for Twitter's recommender algorithm.Abhijeet Kashnia @aman_kashnia
6 Followers 68 FollowingCeri Silvester @cjsilvester16
14 Followers 80 FollowingCarolina Zheng @carolinazheng_
83 Followers 116 Following PhD student in computer science @ ColumbiaSenthooran Rajamanoha.. @sen_r
95 Followers 43 FollowingLukas Vierling @vierlinglukas
8 Followers 30 FollowingArk Sarkar @Ark_Analytical
28 Followers 79 Following CompSci Undergrad | Philosophy, Psychology & Machine Learning Enthusiastbryn larkman @brynlarkman
121 Followers 2K Following EdTech entrepreneur | Former: Teacher @TeachFirst, EIR @join_efnehhar.eth @NehharShah
499 Followers 5K Following .@nyu courant @Chainlink Developer Expert & AdvocateAli Galan 🧙🏽.. @ItsAliGalan
1K Followers 198 Following On the hunt for melange. I work at @Wise Building https://t.co/e6htM3swxS and Galan&Co Currently Reading: https://t.co/6CF34lHmR98910FIG @gv28937
61 Followers 143 Followingstanley stevens @stanleyyork
826 Followers 533 Following Husband of the more likeable @erinroselarsen. Seek ways to change your mind. @CTS_Companies via @WhistleLabs engineer, @Square business, and @UMich economics.EagleDare @makhija_mahesh0
8 Followers 189 FollowingReaAbe @reaabe
102 Followers 2K FollowingTasour @TasourR
37 Followers 326 Following Error code: 0xF2024 (Lost in the virtual world). Backup failed. All data lost.Victoriayiyiyi @jnwangyi
125 Followers 2K FollowingAndrej Karpathy @karpathy
978K Followers 904 Following 🧑🍳. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥Richard Ngo @RichardMCNgo
35K Followers 1K Following What would we need to understand in order to design an amazing future? Figuring that out @openaiAK @_akhaliq
309K Followers 3K Following AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo follow on Hugging Face: https://t.co/q2Qoey80GxEliezer Yudkowsky ⏹.. @ESYudkowsky
175K Followers 89 Following The original AI alignment person. Missing punctuation at the end of a sentence means it's humor. If you're not sure, it's also very likely humor.Jürgen Schmidhuber @SchmidhuberAI
106K Followers 0 Following Invented principles of meta-learning (1987), GANs (1990), Transformers (1991), very deep learning (1991), etc. Our AI is used many billions of times every day.Neel Nanda @NeelNanda5
13K Followers 89 Following Mechanistic Interpretability lead @DeepMind. Formerly @AnthropicAI, independent. In this to reduce AI X-risk. Neural networks can be understood, let's go do it!Kelsey Piper @KelseyTuoc
27K Followers 544 Following Senior writer at Vox's Future Perfect. [email protected]Aella @Aella_Girl
205K Followers 369 Following ⚜️whorelord⚜️, vexworker, survey artist, way too earnest Discord: https://t.co/S1MaMdCwyKAnthropic @AnthropicAI
261K Followers 26 Following We're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant Claude at https://t.co/aRbQ97uk4d.david rein @idavidrein
2K Followers 983 Following Sentio ergo sum. AI alignment research at NYU, early employee @cohereGoogle DeepMind @GoogleDeepMind
943K Followers 275 Following We’re a team of scientists, engineers, ethicists and more, committed to solving intelligence, to advance science and benefit humanity.Sophia Sanborn @naturecomputes
4K Followers 3K Following Theory, ML, neurotechnology @ https://t.co/OmhC0RyxZp | Organizer @neur_reps | Prev: @geometric_intel @berkeley_ai @redwood_neuro @intelai @harvardJim Fan @DrJimFan
229K Followers 3K Following @NVIDIA Sr. Research Manager & Lead of Embodied AI (GEAR Lab). Creating foundation models for Humanoid Robots & Gaming. @Stanford Ph.D. @OpenAI's first intern.Kevin K. Yang 楊凱�.. @KevinKaichuang
16K Followers 5K Following Senior Researcher in BioML @MSFTResearch (@MSRNE). He/him/他. 🇹🇼Senthooran Rajamanoha.. @sen_r
95 Followers 43 FollowingTaelin @VictorTaelin
17K Followers 900 Following Founder of @HigherOrderComp Building the massively parallel future of computing Reaching AGI to cure all diseases and suffering is all that mattersNikhila Ravi @nikhilaravi
5K Followers 2K Following Research Engineer @AIatMeta (FAIR), @Cambridge_Uni, @kennedyscholars @harvard, @MCCOfficial cricketer 🇮🇳 🇬🇧 🇺🇸Asimov Press @AsimovPress
2K Followers 39 Following Asimov Press is a publishing venture that features writing about how biology is shaping our world. Pitch: [email protected]dan @dnschlz
1K Followers 283 Following podcast: https://t.co/JW9tDfSTz5 youtube: https://t.co/8AiyUuVTKMMichael Fischbach @mfgrp
6K Followers 707 Following Liu (Liao) Family Professor of Bioengineering, ChEM-H @Stanford.Horace He @cHHillee
23K Followers 449 Following Working at the intersection of ML and Systems @ PyTorch "My learning style is Horace twitter threads" - @typedfemaleLeila Clark @leilavclark
721 Followers 351 Following sunny apartment enjoyer. occasional coder. longer thoughts at https://t.co/v85nsulPQR. Deep work tracker at https://t.co/sk8Uy0tFle.bayesian asian (31/50.. @etirabys
4K Followers 341 Following Fanfic, code, painting, goop about partners. Tumblr dual citizen, old school rationalist. Big blocker :(. Twitter is a query language, tag me in good pollsaidan @AidanFitzzz
843 Followers 1K Following a hitchhiker & writer on time off from harvard in pursuit of the great american novelAdam Karvonen @a_karvonen
1K Followers 294 Following Interested in ML and software. I prefer email to DM.Craig Citro @craigcitro
1K Followers 237 Following i like math and puns | research engineer @anthropicai; previously: @GoogleColab, Google Bigquery, @sagemath, number theoristDan Zhang @DZhang50
2K Followers 780 Following Researcher @ Google DeepMind | ML for Systems | Systems for ML | Computer Architecture PhD @ UT Austin🤘 | Opinions stated here are my own.Daniel Liu @daniel_c0deb0t
3K Followers 2K Following cs boi @ucla | prev genomics/rust/ml @danafarber w/ @lh3lh3, @google, @10xgenomics | uwu | he/himTessa Alexanian @tessafyi
2K Followers 512 Following let's make nice things with biology 🌱 screening synthesis @IBBIS_bio, advising @AsimovPress 🌱 former safety officer @iGEM, robot whisperer @Zymergen (she)Anca Dragan @ancadianadragan
8K Followers 177 Following AI safety & alignment at Google DeepMind • associate professor at UC Berkeley EECS • proud mom of an amazing 2yr oldBrando Miranda @BrandoHablando
759 Followers 578 Following CS Ph.D. @Stanford, researching data quality, foundation models, and ML for Theorem Proving. Prev: @MIT, @MIT_CBMM, @IllinoisCS, @IBM. Opinions are mine. 🇲🇽Asianometry @asianometry
11K Followers 127 Following My name is Jon and I run the Asianometry YouTube channel. You can email me at [email protected]Evan Anders @evanhanders
79 Followers 136 Following AI Safety / Mech Interp postdoctoral scholar @KITPUCSB. Former astrophysical fluid dynamicist @Northwestern (CIERA) and @CUBoulder.SpaceX @SpaceX
34.4M Followers 113 Following SpaceX designs, manufactures and launches the world’s most advanced rockets and spacecraftJeff Wu @WuTheFWasThat
258 Followers 245 FollowingTom Dupré la Tour @tomdlt10
387 Followers 261 Following interpretability @openai, previously neuroimaging with @gallantlab, neurophysiology with @agramfort, machine-learning for @scikit_learnForest Neurotech @ForestNeurotech
380 Followers 11 Following Ultrasound Technology. Whole-Brain Health.Jason Benn 🏡 · sc.. @jasoncbenn
4K Followers 3K Following Creating multigenerational scenius. Founded the Neighborhood. 10% of profits from home sales go to -REDACTED-. It takes a village! https://t.co/cF0WngvubSRaza Habib @RazRazcle
5K Followers 1K Following CEO @humanloop (YC S20) |Unbelievably excited about the future of AI. Follow me for updates on LLMs and how to build products with them.Nick Whitaker @ns_whit
3K Followers 1K Following founder, @worksinprogmag (@stripe), anti-cheems aktionOrowa Sikder @OrowaSikder
1K Followers 304 Following the future could be amazing. let’s get to work | Research @AnthropicAI, ex: PhD @UCLCSDavid Bau @davidbau
3K Followers 241 Following Computer Science Professor at Northeastern, Ex-Googler. Believes AI should be transparent. @[email protected] @davidbau.bsky.social https://t.co/wmP5LUZRTwKarine Mellata @karinemellata
142 Followers 291 Following co-founder @ intrinsic (yc w23), previously @appleTom Knowles @TensorProduct
203 Followers 590 Following Occasional theoretical computer science and live music tweets • @givingwhatwecan member • He/him.Joe Henrich @JoHenrich
22K Followers 590 Following Harvard Professor in The Dept. of Human Evolutionary Biology. Books: The Secret of Our Success & The WEIRDest People in World. Tweets r my own.Armaan Goel @armaanrgoel
342 Followers 780 Following @AdeptAILabs | prev @Cruise, @BerkeleyHaas, @Berkeley_EECSEric Steinberger @EricSteinb
7K Followers 478 Following Writing code that writes code on a mission to build safe superintelligence | CEO/cofounder @magicailabsJascha Sohl-Dickstein @jaschasd
19K Followers 623 Following Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.EZKL @ezklxyz
3K Followers 0 Following https://t.co/7k7y9j8XyU | https://t.co/oPyIfeVmdw | https://t.co/SHcpIm309pErik Schluntz @ErikSchluntz
2K Followers 238 Following Member of Technical Staff at Anthropic Co-founder at @CobaltRobotics Co-founder at Posmetrics (acquired) GoogleX, @SpaceX, @Harvard EE '15, Forbes 30u30 '18Jeremy Fox 🦊 @JeremyDanielFox
617 Followers 609 Following Neural nets @AnthropicAI. Ex @google. My views are my own.Aaditya Singh @Aaditya6284
421 Followers 243 Following PhD student at @GatsbyUCL working with @SaxeLab, @FelixHill84 on learning dynamics, ICL, concepts, LLMs. Prev. at: @GoogleDeepMind, @AIatMeta (LLaMa 3), @MITJoseph Bloom @JBloomAus
186 Followers 148 Following Independent Alignment Research Engineer. Likes vegan food. loves puns.Samuel Marks @saprmarks
694 Followers 79 Following Postdoc studying interpretability for AI safety under @davidbau. PhD in math from @harvard. Previously director of technical programs at https://t.co/FxRv4QgERO.Grace @milquepoast
15K Followers 967 Following for why do you post except for what it says about youIt's a great week for mech interp releases! I'm very excited to try out Anthropic's new recommendations for stable dictionary learning
Some small updates from the Anthropic Interpretability team: transformer-circuits.pub/2024/april-upd…
Scaling laws for dictionary learning! transformer-circuits.pub/2024/april-upd…
Some small updates from the Anthropic Interpretability team: transformer-circuits.pub/2024/april-upd…
Some small updates from the Anthropic Interpretability team: transformer-circuits.pub/2024/april-upd…
the only reason i deacc’d was to take a twitter detox break and reduce complexity for a while (not to create more drama!) but accidentally made all my friends concerned. im okay, everyone’s okay, and the singularity’s not yet here. i’ll just log off the normal way instead!
Fantastic work from @sen_r and @ArthurConmy - done in an impressive 2 week paper sprint! Gated SAEs are a new sparse autoencoder architecture that seem a major Pareto improvement. This is now my team's preferred way to train SAEs, and I hope it'll accelerate the community's work!
New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy
I'm super excited this post is out! Activation patching is a crucial mech interp technique, but is deceptively hard to use well. In this informal note we discuss the details of different variants of activation patching, thinking intuitively, and choosing the right metrics.
Excited to share our write-up on activation patching best practices for mechanistic interpretability, with @NeelNanda5! Discussing noising vs. denoising and what's necessary vs. sufficient. Plus tips on which metrics to use to avoid common pitfalls. arxiv.org/abs/2404.15255
This result is pretty clearly specific to the style of backdoor we're working with, and doesn't support broad claims like 'interpretability solves misalignment', but it's still surprisingly strong. Worth a look!
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
@AlexTamkin @aryaman2020 @TrentonBricken Burns did some sophisticated stuff to get a “truthfulness” direction in activation-space; a “sneakiness” direction is (apparently!) much easier to find. But these approaches have in common that they’re probing uninterpreted directions in activation-space. x.com/davidad/status…
@jasoncrawford On the one hand, this is absolutely not a black-box method: it makes use of our direct access to read out the values of every internal neuron. On the other hand, it makes absolutely no attempt to understand the meaning of any neurons or how the neurons interact to process info.
Some new research is out from the Alignment team! Congrats to Monte & @EvanHub for great work :)
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
If you want to hire a tuba player for your next hackathon or demo day, please contact Zach. He was perfect. Let's make him the official tuba player of San Francisco tech events. gigsalad.com/zachariah_frie…
AI Grant has a tuba that plays you offstage if your pitch goes long, and I think all demo days need this
@TrentonBricken the fact that it works with a probe trained on 2 samples (yes/no answers) is just...wow.
Some of our first steps on developing mitigations for sleeper agents
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
New Anthropic research: we find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training. Check out our first alignment blog post here: anthropic.com/research/probe…
I'm so glad we are using MMLU to judge our LLMs I couldn't imagine my AI not nailing these test questions!
Last 28 days 🤯 While the Zuck & Trenton/Sholto episodes are doing extremely well on YouTube, what I'm proudest of is that most of these views are actually from Sarah Paine content! She is one of the greatest living historians, but her work wasn't really publicly well known…
I've finally uploaded the thesis on arXiv: arxiv.org/abs/2404.12150 It ties together a bunch of papers exploring some alternatives to RL for finetuning LMs, including pretraining with human preferences and minimizing KL divergences from pre-defined target distributions.
I was very impressed with @tomekkorbak's thesis! Some really nice insights into LLM alignment: 1) RL is not the way --> distribution matching let's us target constraints like "generate as many of these as of those" 2) fine-tuning is not the way --> PHF aligns during pre-training
you call them nvidia developers - i call them: "those held hostage when they tried to download the latest driver"
Nvidia hit 100k developers in 7 years. Our goal was to hit 100k developers in 7 weeks. It's been 6 weeks, and...
Doesn't use AI, must not be important.
A eukaryote that fixes nitrogen(!): newscenter.lbl.gov/2024/04/17/sci…
@TheAnnaGat @dwarkesh_sp @taylorswift13 go on the dwarkesh pod!