Running on CPU Upgrade
251
The Synthetic Data Playbook: Generating Trillions of the Finest Tokens
๐
Visualize syntheticโdata experiments as an interactive bookshelf
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Visualize syntheticโdata experiments as an interactive bookshelf
Viewer to explore the finewiki dataset
Explore and download the FineWeb webโscale text dataset
Evaluate multilingual models using FineTasks
Explore and analyze experiment results
Launch an interactive demo interface