Mike Lewis @ml_perception, Twitter Profile

Mike Lewis @ml_perception

2 weeks ago

Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.

Felix @felix_red_panda

2 weeks ago

8 7 188 69K 29

Download Image

14 39 504 88K 78

Diego @diegoasua

2 weeks ago

@ml_perception Gradient descent is weird

0 0 0 368 0

Andrew Lee ⚡️ @c_andrew_lee

2 weeks ago

@ml_perception If llama 8B was improving even through the 15T-th token, what made the team decide enough was enough? Why go that far / why stop there?

0 0 0 349 0

Felix @felix_red_panda

2 weeks ago

@ml_perception thank you very much for all of this :D

0 0 6 947 0

Chief. @_Tobie__

2 weeks ago

@ml_perception I guess this means we can obtain smaller models that are very high performing if we increased the dataset size >15T and train for even longer. Maybe we'll see this with llama-4

0 0 0 549 0

Jd3d @Jd3d4

2 weeks ago

@ml_perception Can you clarify why the 8B version has a March, 2023 knowledge cutoff instead of December 2023?

0 0 0 311 0

Segmond Yunsai @ysegmond

2 weeks ago

@ml_perception Thanks, inference speed is fire!

0 0 0 379 0

archivedvideos @archived_videos

2 weeks ago

@ml_perception Congratulations on the release! Does that mean that if you had 10 times more token, you would expect the 8B to get even better?

0 0 0 160 0

KZ @kzSlider

2 weeks ago

@ml_perception Mike you (and the team) are officially goated for this one 🙏 miss when you were churning out incredible papers but this model is so worth the wait

0 0 1 716 0

Elon Muck @0xpussies

2 weeks ago

@ml_perception Are there studies on token saturations, i.e. maximal amount of tokens an LLM can reliably learn? It seems like we can scale Llama 3 8B beyond 15T tokens

0 0 0 379 0

Dshoopy @Dshoopy0

2 weeks ago

@ml_perception Someone needs to bite the bullet and find out the limit.

0 0 0 597 0