5T tokens FineWeb dataset just dropped @huggingface It's a 275GB dataset with cleaned and deduplicated data under an Open Data Commons license. We all see the difference the 15T tokens pre-training made for LLaMA-3 and now everyone can have it .
3
11
89
10K
28
Download Image
@rohanpaul_ai @huggingface deepseeklm should train a 50B parameter model on this. or a 104B