Rishub Tamirisa @rishub_t

Spending KL rishub.ai Joined March 2015

Tweets

41
Followers

105
Following

550
Likes

204

Rishub Tamirisa @rishub_t

5 months ago

We’ll be presenting our work on Tamper-Resistant Safeguards for Open-Weight LLMs at #ICLR2025 today (Hall 3 + Hall 2B #311) from 3:30-5pm. Please stop by!

0 2 5 435 0

Excited to have released this work! Am hopeful for future research on utility control methods. That the models have utilities isn't necessarily a bad thing/can be beneficial, if we can rewrite them. Our results suggest that this is indeed possible.

Dan Hendrycks @DanHendrycks

8 months ago

719 2K 11K 6.2M 10K

Download Image

0 0 7 733 0

Mantas Mazeika @MantasMazeika96

7 months ago

@colin_fraser Hey, first author here. We've known about these ordering effects since the beginning of the project, which is why we average over both orderings. Before explaining further, it's important to note that in most preference comparisons, models pick one of the underlying options with…

1 1 7 427 0

Elon Musk @elonmusk

8 months ago

Important thread on AI

Dan Hendrycks @DanHendrycks

8 months ago

Important thread on AI

719 2K 11K 6.2M 10K

Download Image

2K 3K 15K 4.1M 5K

Dan Hendrycks @DanHendrycks

8 months ago

We’ve found as AIs get smarter, they develop their own coherent value systems. For example they value lives in Pakistan > India > China > US These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment. 🧵

719 2K 11K 6.2M 10K

Download Image

Revanth Gangi Reddy @gangi_official

a year ago

Code for our LLM Reranking paper is out: github.com/gangiswag/llm-… You can use the trained model (available on HF) for upto 50% faster inference than generated-based LLM reranking We provide scripts to incorporate both generation and ranking objectives while training LLM Rerankers

Revanth Gangi Reddy @gangi_official

a year ago

2 10 45 6K 16

Download Image

1 3 11 1K 1

alphaXiv @askalphaxiv

a year ago

Excited to feature Tamper-Resistant Safeguards for Open-Weight LLMs from @lapisrocks! Introducing the first safeguards for LLMs that resist fine-tuning attacks, showing the power of tamper-resistance to make open-weight LLMs safer. @rishub_t is here to answer your questions!

1 4 10 1K 3

Download Image

Dan Hendrycks @DanHendrycks

a year ago

How can we prevent LLM safeguards from being simply removed with a few steps of fine-tuning? We show it's surprisingly possible to make progress on creating safeguards that are tamper-resistant, reducing malicious use risks of open-weight models. Paper: arxiv.org/abs/2408.00761