Hamel Husain @HamelHusain, Twitter Profile

Hamel Husain @HamelHusain

2 weeks ago

We need a MKBHD for AI software. Because there’s tons of bullshit out there to wade through

Working on something like this right now for AI benchmarks on models. It's very tricky though for reasons similar to MKBHD, but maybe even more challenging. I have found many "anomalies" to various claims, many centering around benchmark performances. While it's easy to attribute these to malice, it's more likely the case that people are actually just trying to genuinely make models that also do well on benchmark style data. At this point, you're probably even making a mistake if you don't train your model to answer Q&A facts and multiple choice questions. Obviously if they trained directly on known benchmark data, that's more likely cheating (but not necessarily), but it's very hard to determine what's cheating, what's market pressures being met, what's incompetence, and what's just by chance...etc.