Rohan Paul @rohanpaul_ai, Twitter Profile

Rohan Paul @rohanpaul_ai

2 weeks ago

5T tokens FineWeb dataset just dropped @huggingface It's a 275GB dataset with cleaned and deduplicated data under an Open Data Commons license. We all see the difference the 15T tokens pre-training made for LLaMA-3 and now everyone can have it .

3 11 89 10K 28

Download Image

Rohan Paul @rohanpaul_ai

2 weeks ago

huggingface.co/datasets/Huggi…

1 0 4 586 1

ray @rayzarion

2 weeks ago

@rohanpaul_ai @huggingface Isn’t it 15T

1 0 1 103 0

Hrishbh Dalal @HrishbhDalal

2 weeks ago

@rohanpaul_ai @huggingface deepseeklm should train a 50B parameter model on this. or a 104B

0 0 1 122 1