I worry about language models being trained on test sets. Recently, we emailed [email protected] to opt out of having our (test) data be used to improve models. This isn't enough though: others running evals could still inadvertently contribute those test sets to training.
A better solution would to have all the LM providers agree on a common repository of examples that should be excluded from any training run.
But this might not be enough either: if we want to measure cross-task generalization, we have to ensure that no examples of a task/domain are represented in the training data. This is essentially impossible.
@percyliang I’m worried about AI auto generating the data that it then consumes to know what’s correct. That’s a recipe for a feedback loop of incorrect. We need a mechanism of recognising when something on the internet is incorrect / potentially untrue / potentially true / correct.
@percyliang Can you elaborate on or provide an example for what you mean by cross-task generalization?
@percyliang Test set measures generalization but these models are already showing impressive generalization abilities. Instead I’d use benchmarks that are hard for models to do instead of trying to hide some test data to “discover the true performance”
@percyliang Isn't it unrealistic to assume that no examples of a domain would be represented in the unsupervised pre-training data? You need to exclude it from finetuning, but starting with an "all-domain" LLM should give a honest estimate of real-world generalization to an arbitrary domain.
@percyliang With the release of OpenAI APIs on Azure, I think, the spec mentions that one can keep their enterprise data local on Azure instance and hence not allow GPT to be trained on it. Can't this functionality be used for test sets as well?