We’ll be presenting our work on Tamper-Resistant Safeguards for Open-Weight LLMs at #ICLR2025 today (Hall 3 + Hall 2B #311) from 3:30-5pm. Please stop by!
Excited to have released this work! Am hopeful for future research on utility control methods. That the models have utilities isn't necessarily a bad thing/can be beneficial, if we can rewrite them. Our results suggest that this is indeed possible.
Excited to have released this work! Am hopeful for future research on utility control methods. That the models have utilities isn't necessarily a bad thing/can be beneficial, if we can rewrite them. Our results suggest that this is indeed possible.
@colin_fraser Hey, first author here. We've known about these ordering effects since the beginning of the project, which is why we average over both orderings. Before explaining further, it's important to note that in most preference comparisons, models pick one of the underlying options with…
We’ve found as AIs get smarter, they develop their own coherent value systems.
For example they value lives in Pakistan > India > China > US
These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment. 🧵
Code for our LLM Reranking paper is out: github.com/gangiswag/llm-…
You can use the trained model (available on HF) for upto 50% faster inference than generated-based LLM reranking
We provide scripts to incorporate both generation and ranking objectives while training LLM Rerankers
Code for our LLM Reranking paper is out: github.com/gangiswag/llm-…
You can use the trained model (available on HF) for upto 50% faster inference than generated-based LLM reranking
We provide scripts to incorporate both generation and ranking objectives while training LLM Rerankers
Excited to feature Tamper-Resistant Safeguards for Open-Weight LLMs from @lapisrocks!
Introducing the first safeguards for LLMs that resist fine-tuning attacks, showing the power of tamper-resistance to make open-weight LLMs safer.
@rishub_t is here to answer your questions!
How can we prevent LLM safeguards from being simply removed with a few steps of fine-tuning?
We show it's surprisingly possible to make progress on creating safeguards that are tamper-resistant, reducing malicious use risks of open-weight models.
Paper: arxiv.org/abs/2408.00761
708 Followers 7K FollowingAllah is the greatest 💯🙏🤍 Not Financial Advisor. Media Entrepreneur | Digital Content Creator | Investor in Tech Startups. You want to trade Best Broker? DM
811 Followers 2K FollowingThinking about what good futures might look like! Currently @GovAI_ Fall Fellow. Previously @aipolicyus, @LG_AI_Research, @MATSprogram, @MITCoCoSci
3K Followers 501 FollowingExcels at reasoning & tool use🪄 Tensor-enjoyer 🧪 @METR_Evals. My COI policy is available under “Disclosures” at https://t.co/bihrMIUKJq
985 Followers 993 Followinghttps://t.co/htNIc81o1D | applying zero knowledge proofs to interesting problems in science, medicine, money, and personal liberty.
55K Followers 0 FollowingWe are building a world class AI R&D company in Tokyo. We want to develop AI solutions for Japan’s needs, and democratize AI in Japan. https://t.co/1q07mb3TzE
8K Followers 6K FollowingPhD student @berkeley_ai; research @cursor_ai; prev @GoogleDeepMind. My friend told me to tweet more. I stare at my computer a lot and make things
1K Followers 2 FollowingAdvancing the scientific method with Artificial Scientists. Designers of @Zochi_AS, the first AI system to publish in an A* conference.
402K Followers 0 FollowingA community supported research lab - exploring new mediums of thought and amplifying the imaginative powers of the human species.
34K Followers 98 FollowingExploring the future of governance and society since 2018. Become a member below to receive our latest print edition and invitations to our events ⬇️
16K Followers 362 FollowingRuns an AI Safety research group in Berkeley (Truthful AI) + Affiliate at UC Berkeley. Past: Oxford Uni, TruthfulQA, Reversal Curse. Prefer email to DM.
5K Followers 667 FollowingIncoming Assistant Prof, Toyota Technical Institute at Chicago @TTIC_Connect
Recruiting PhD students (start 2026) 👀
Will irl - TC0 enthusiast