First, looking at the accuracy on the 16 core scenarios: crfm.stanford.edu/helm/v0.2.0/?g… Models are ranked by mean win rate, which is the average fraction of other models that a model outperforms across scenarios.
Summary: - New AI21 model (J1-Grande v2 beta) is quite strong despite only having 17B parameters (3-30x smaller than its peers). - New Cohere model (52B) also improves over the previous version, passing OPT (175B). - New OpenAI model (text-davinci-003) requires more discussion...
text-davinci-003 improves over text-davinci-002 on 10/16 core scenarios but underperforms significantly on IMDB (82.4% versus 94.6%), causing it to be ranked lower.
Digging in a bit deeper, text-davinci-003 outputs an invalid category “Neutral” for some reviews despite being prompted with only “Positive” and “Negative” in the in-context examples. It seems to have a stronger prior that is harder to override. crfm.stanford.edu/helm/latest/?g…
Looking at the targeted evaluations (which includes language, knowledge, reasoning, etc.), text-davinci-003 is best, significantly outperforming text-davinci-002 on synthetic reasoning and math benchmarks. crfm.stanford.edu/helm/v0.2.0/?g…
@percyliang Very cool! Thanks for sharing! Is this based on the latest Anthropic model?
@sarahdingwang But it's the latest that we have access to right now.