Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.
@ml_perception If llama 8B was improving even through the 15T-th token, what made the team decide enough was enough? Why go that far / why stop there?
@ml_perception thank you very much for all of this :D
@ml_perception I guess this means we can obtain smaller models that are very high performing if we increased the dataset size >15T and train for even longer. Maybe we'll see this with llama-4
@ml_perception Can you clarify why the 8B version has a March, 2023 knowledge cutoff instead of December 2023?
@ml_perception Congratulations on the release! Does that mean that if you had 10 times more token, you would expect the 8B to get even better?
@ml_perception Mike you (and the team) are officially goated for this one 🙏 miss when you were churning out incredible papers but this model is so worth the wait
@ml_perception Are there studies on token saturations, i.e. maximal amount of tokens an LLM can reliably learn? It seems like we can scale Llama 3 8B beyond 15T tokens
@ml_perception Someone needs to bite the bullet and find out the limit.