Percy Liang @percyliang, Twitter Profile

Percy Liang @percyliang

a year ago

Announcing Holistic Evaluation of Language Models (HELM) v0.2.0 with updated results on the new @OpenAI, @AI21Labs, and @cohereai models. HELM now evaluates 34 prominent language models in a standardized way on 42 scenarios x 7 metrics.

4 91 562 154K 145

Percy Liang @percyliang

a year ago

First, looking at the accuracy on the 16 core scenarios: crfm.stanford.edu/helm/v0.2.0/?g… Models are ranked by mean win rate, which is the average fraction of other models that a model outperforms across scenarios.

1 2 34 7K 5

Download Image

Percy Liang @percyliang

a year ago

Summary: - New AI21 model (J1-Grande v2 beta) is quite strong despite only having 17B parameters (3-30x smaller than its peers). - New Cohere model (52B) also improves over the previous version, passing OPT (175B). - New OpenAI model (text-davinci-003) requires more discussion...

1 3 32 10K 1

Percy Liang @percyliang

a year ago

text-davinci-003 improves over text-davinci-002 on 10/16 core scenarios but underperforms significantly on IMDB (82.4% versus 94.6%), causing it to be ranked lower.

2 2 14 4K 0

Percy Liang @percyliang

a year ago

Digging in a bit deeper, text-davinci-003 outputs an invalid category “Neutral” for some reviews despite being prompted with only “Positive” and “Negative” in the in-context examples. It seems to have a stronger prior that is harder to override. crfm.stanford.edu/helm/latest/?g…

2 3 19 4K 1

Percy Liang @percyliang

a year ago

Looking at the targeted evaluations (which includes language, knowledge, reasoning, etc.), text-davinci-003 is best, significantly outperforming text-davinci-002 on synthetic reasoning and math benchmarks. crfm.stanford.edu/helm/v0.2.0/?g…

4 4 19 5K 0

Download Image

Sarah Wang @sarahdingwang

a year ago

@percyliang Very cool! Thanks for sharing! Is this based on the latest Anthropic model?

1 0 0 88 0

Percy Liang @percyliang

a year ago

@sarahdingwang No, this model is nearly a year old.

1 0 0 41 0

Percy Liang @percyliang

a year ago

@sarahdingwang But it's the latest that we have access to right now.

0 0 0 31 0