Guilherme Penedo @gui_penedo, Twitter Profile

Guilherme Penedo @gui_penedo

2 weeks ago

We have just released 🍷 FineWeb: 15 trillion tokens of high quality web data. We filtered and deduplicated all CommonCrawl between 2013 and 2024. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama!

40 347 2K 586K 832

Download Image

Guilherme Penedo @gui_penedo

2 weeks ago

We trained 200+ ablation models to validate our processing decisions, and we share all the code you need to reproduce our setup, along with our dataset comparison ablation models checkpoints! Find out all abut 🍷 FineWeb on the 🤗 model page: huggingface.co/datasets/Huggi…

5 18 215 20K 86

Zacchary Hulsman @HulsmanZacchary

2 weeks ago

@gui_penedo Awesome work! When are we going to get dataset with synthetic coding and math data?

0 0 0 86 0

massey branscomb @Memetic_Theory

2 weeks ago

@gui_penedo I'm stupid, how understand what inside? (Like can I start to segment and search for any parts of the data in any meaningful way? Like semantic query?

0 0 0 34 0

Jonathan Chang @cccntu

2 weeks ago

@gui_penedo curious why are some months seem missing?

0 0 0 421 0

Manuel Faysse @ManuelFaysse

2 weeks ago

@gui_penedo Do you anneal the learning rate before downstream task evaluation from each of the checkpoints ? Or just evaluate the checkpoint as is ?

0 0 1 228 0

Ohad Rubin @OhadRubin

2 weeks ago

@gui_penedo Do you have a token length histogram laying around maybe? (Compared to c4 and dolma maybe?)

0 0 1 193 0