Spend 80% of Your LLM Compute on Data, Not Training

Community Article Published February 14, 2026

blog_thumbnail

Open LLM efforts aim to build the best possible models under a fixed development compute budget. For these efforts, the central question is not whether to train a model, but how to allocate compute to maximize capability.

Current practice often answers this question implicitly: most development compute goes to the final training run, with data work receiving only a small fraction of the budget. As a result, projects spend the bulk of their compute budget on producing a single set of weights rather than on improving the data.

Recent results suggest that heavily investing in data work can match or exceed open frontier performance with substantially less training compute. AI2's OLMo 3 matches Qwen-3 32B while training on approximately 6x fewer tokens. FineWeb-Edu achieves ~8x token efficiency through quality filtering. BeyondWeb reports 7.7x faster training through systematic rephrasing of web documents. These are multiplicative improvements in capability per compute dollar, not marginal gains.

Beyond immediate efficiency, data investments also exhibit a durability that training investments lack: a curated dataset from 2024 can train models in 2024, 2025, and 2026, while a model trained in early 2024 is likely superseded within the same year. Data investments persist across model generations; training investments depreciate within months.

I believe open LLM efforts should invest the large majority of their compute in data, and not in model training. Efficiency multipliers of 6–9x imply that the majority of compute, approximately 80%, can profitably go to data work. This compute funds three modes: selection (annotation and quality filtering), transformation (rephrasing and restructuring), and generation (synthetic data at scale). Training-only scaling faces hard limits when high-quality data is exhausted or repeated; data compute offers a path to continued capability gains by producing higher-utility tokens.

This complements the position of Kandpal & Raffel (ICML 2025), who argue that data should be the most expensive part of an LLM through fair compensation to data creators; I argue it should be the most expensive part from a compute allocation perspective.

Data Compute: What to Invest Compute In

If data-centric allocation produces better models, then labs should invest in data compute: compute spent to increase training signal per FLOP by selecting, transforming, and generating training data. I'll walk through three modes, each illustrated with concrete projects that report costs and measurable gains.

Selection Compute

Selection compute covers model-driven data curation: LLM-based annotation, learned scoring models, and large-scale filtering decisions.

FineWeb-Edu is a good example: 450K documents are labeled with Llama-3 70B to produce supervision for an "educational value" signal, a small encoder model is trained to score documents at web scale, and the 15T-token FineWeb corpus is filtered by thresholding on those scores. This yields roughly an 8x token-efficiency gain, with recent work confirming similar gains for its multilingual counterpart FineWeb-2.

Scoring and filtering 15 trillion tokens consumed approximately 6,000 H100 GPU-hours; the full project, including all experiments, required roughly 80,000 H100 GPU-hours. That sounds like a lot in isolation, but it's modest relative to frontier model training budgets: 80,000 GPU-hours is approximately 1.6% of a representative 5-million GPU-hour training run. This asymmetry points to systematic underinvestment: if 1.6% of a frontier compute budget can produce multi-fold efficiency gains, then scaling selection compute is among the highest-return uses of additional GPU-hours.

A second example is Nemotron-CC, which uses a classifier ensemble to bucket documents into quality tiers rather than applying a single keep/discard threshold, enabling tier-specific downstream processing.

Despite this potential, current practice remains conservative: both examples rely on small models chosen for throughput, not capability, like FastText classifiers or lightweight encoder models. Yet selection is arguably the most consequential decision in dataset construction: a single score per document, applied at billion-document scale, can shape the entire training distribution.

Larger or more expressive selection models (e.g., billion-parameter LLMs, multi-attribute classifiers, or ensembles) could move beyond binary keep/discard decisions to capture dimensions like reasoning depth or time-sensitivity of documents. Such models would enable fine-grained filtering, curriculum design, and experimentation with selection criteria that remain unexplored at scale. Given the downstream efficiency gains, economizing on selection compute is a false economy; the marginal dollar is better spent on smarter curation than on additional training tokens.

Aggressive filtering carries risks, though: quality thresholds can inadvertently remove valuable diversity, as FineWeb's finding that global deduplication yields worse outcomes than per-snapshot approaches demonstrates. A single definition of data quality in terms of "educational value" might miss the full diversity of good training signals. This is why selection compute must fund ablation sweeps to detect harmful heuristics before they're baked into final datasets.

Transformation Compute

Transformation compute covers document rephrasing, format lifting, extraction, and structure imposition. Instead of synthesizing content from scratch, transformation takes a source document as input and produces a cleaner, denser, or more structured version, staying grounded in real sources.

BeyondWeb illustrates what targeted transformation can achieve: 7.7x faster training compared to raw web data. The efficiency gains stem from multiple interacting mechanisms. Rephrasing compresses knowledge into denser tokens. Style transformations (e.g., converting to Q&A or instructional formats) close the gap between web-heavy pretraining distributions and conversational deployment, though gains from style-matching alone saturate quickly.

Diversity across transformation strategies is what sustains efficiency at scale: single-strategy approaches such as Q&A-only plateau early, while a diverse mix of formats, restructurings, and style modifications keeps improving across longer training. Critically, BeyondWeb surpasses the "full data upper bound" that naive data augmentation cannot break, suggesting that strategic restructuring fills distributional gaps that raw web data underrepresents.

Seed quality also matters, though the relationship is nuanced. BeyondWeb finds that rephrasing high-quality source documents outperforms rephrasing lower-quality web text, even when the latter provides more novelty.

Nemotron-CC takes a tier-specific approach, using Mistral NeMo 12B to apply Wikipedia-style rewrites that denoise low-quality pages, while high-quality pages receive more aggressive transforms (QA generation, distillation, knowledge extraction) to produce denser, more structured tokens. The result is a 6.3T-token corpus with 1.9T transformed tokens, demonstrating that transformation compute can be targeted by source quality rather than applied uniformly. This philosophy favors rephrasing over discarding: rather than aggressively filtering away most of the web, transformation can rescue lower-quality documents while enriching high-quality ones.

Nemotron Nano 2, trained on Nemotron-CC-v2, extends this further by treating transformation as a curriculum knob: the share of transformed and SFT-style data increases in later training phases, rising to over 30% by the final phase. Their ablations underscore the value: synthesized multilingual QA pairs outperform curated multilingual crawl data, leading the authors to weight transformed data more heavily in the final mixture.

Like selection, transformation compute is likely underinvested: if rephrasing existing documents yields large training efficiency and model performance gains, scaling transformation pipelines is a high-return use of compute.

Generation Compute

Generation compute produces synthetic corpora from scratch or from minimal seeds, rather than transforming existing documents. Where transformation requires source documents to rewrite, generation can cover domains where good sources are scarce, create novel reasoning chains, or explore distributions that natural data underrepresents.

Synthetic data generation is also a response to the "data wall": as high-quality natural text runs out (estimates suggest the stock of public human-generated text may be exhausted by 2026–2032), synthetic generation offers a path to continued scaling. And aggressive selection significantly reduces corpus size: a dataset filtered for quality may end up being too small to train frontier models, even if each token is highly efficient. Synthetic data can restore scale while preserving the gains from selection, producing a large corpus of high-value tokens rather than forcing a choice between quality and quantity.

Nemotron-CC-v2 uses multiple LLMs as data generators for math, code, and tool-calling corpora, where correctness can often be verified programmatically. They mostly rely on large mixture-of-expert (MoE) models, such as Qwen3-235B and DeepSeek-R1, to produce high-quality tokens at high inference throughput. TOUCAN synthesizes 1.5 million tool-agentic trajectories from real-world MCP servers, using execution-based validation to filter generated data.

Recent fully synthetic pretraining projects push generation compute further. SYNTH, a 75-billion-token synthetic corpus derived from only 50,000 Wikipedia articles, reports allocating approximately 95% of total project compute to data pipeline work (synthetic generation, validation, curation) and only 5% to final training, with the resulting models achieving state-of-the-art performance for their size. This 95/5 split illustrates a broader pattern: as generation scales, the bottleneck shifts from training compute to data-side compute.

Synthetic generation carries risks, though: models trained on synthetic data can lose diversity and degrade, a phenomenon called model collapse. Effective synthetic pipelines need validation infrastructure: diversity constraints, real-data anchors, and drift detection. SYNTH demonstrates that grounding generation in curated seed documents and applying LLM-as-judge curation can maintain diversity even in fully synthetic corpora. Yet even validated synthetic investments remain modest relative to frontier training budgets, and the field has only begun to explore what systematic, large-scale synthetic generation can achieve.

Transformation and generation compute can easily reach millions of GPU-hours, dwarfing current selection compute by two orders of magnitude.

Arcee AI's Trinity models illustrate how transformation and generation compute combine at scale, using over 8T synthetic tokens (~47% of the 17T-token corpus), including ~6.5T synthetic web produced via rephrasing and format transformation, ~1T synthetic multilingual data, and ~0.8T synthetic code. They report that generating this data required clusters peaking at 2,048 H100 GPUs running for approximately one month, totaling over one million GPU-hours for data synthesis alone. The resulting 400B sparse MoE matches peer models on standard benchmarks, demonstrating that synthetic-heavy pretraining can scale to frontier performance when transformation and generation are deployed together.

The Filter-Then-Augment Imperative

If filtering like FineWeb-Edu yields 8x token efficiency, why invest in more compute-expensive approaches like synthetic data generation? The answer lies in a fundamental tension: filtering improves training token efficiency but reduces dataset size.

Both FineWeb-Edu and Nemotron-CC achieve their efficiency gains by discarding the majority of tokens. From original corpora of 15T and 4.5T tokens, aggressive quality filtering produces much smaller high-quality subsets of only 553B and 1.3T tokens, respectively. This creates a ceiling: you cannot filter your way to an arbitrarily large high-quality dataset because filtering is subtractive by nature. At some point, further filtering leaves you with too few tokens to train frontier models, which require trillions of training tokens to reach compute-optimal performance.

Synthetic data generation is the solution to this tension. It is the mechanism for scaling the dataset back up while preserving the efficiency gains from quality filtering. Rather than training on 15T tokens of mixed quality, the data-centric approach is:

Filter aggressively: Use quality classifiers to identify high-utility tokens, accepting that this drastically reduces dataset size
Augment systematically: Use transformation and generation (rephrasing, Q&A pair generation, knowledge extraction) to expand the filtered corpus back to frontier scale
Validate continuously: Use ablation experiments to verify that the resulting tokens maintain the efficiency multiplier

This filter-then-augment pipeline produces datasets that are both large (many trillions of tokens) and high-quality (maintaining the efficiency multiplier). As illustrated by BeyondWeb and Nemotron-CC above, this approach avoids the false choice between quality and scale.

The implication for compute allocation is clear: all three modes deserve substantial investment. Selection is cheap per token but high-leverage; transformation and generation are expensive per token but necessary for scale. Together, they enable training on trillions of high-utility tokens rather than choosing between quality (small filtered datasets) and quantity (large raw datasets).

Where Reinforcement Learning with Verifiable Rewards Fits In

In late 2024, posttraining LLMs using Reinforcement Learning with Verifiable Rewards (RLVR) gained widespread attention with OpenAI's o1 model. DeepSeek's R1 then demonstrated that RLVR could produce strong reasoning capabilities without supervised fine-tuning on chain-of-thought data, cementing it as a key technique for building reasoning models.

Conceptually, RLVR is data compute in disguise: rollouts generate candidate trajectories, a verifier provides pass/fail signals, and selection retains successful trajectories. These operations mirror two of the three modes of data compute: generation (rollouts produce candidates) and selection (retain winners). To illustrate, DeepSeek-R1-Zero claims a 5x performance gain on AIME 2024 (15.6% to 77.9%) through RL training alone, consuming approximately 100,000 H800 GPU-hours to generate and validate millions of reasoning trajectories.

It's natural to classify all RL compute as "training compute." The label matters less than the distinction in mechanism: traditional supervised training optimizes parameters on a static dataset, while RLVR compute actively generates and validates the learning signal itself. The point is about the economics of leverage: when compute is spent to create higher-utility training signal, whether via offline synthesis or online rollouts, it can shift the capability-per-FLOP frontier.

How Much Compute to Invest in Data?

If data compute yields substantial efficiency gains, how much of a project's total compute budget should go to it? I'll define the efficiency multiplier, present published evidence, and derive a break-even formula.

The Token Efficiency Multiplier

For LLMs, the mechanism underlying data-centric allocation is the token efficiency multiplier: the factor by which improved data reduces the training compute required to reach a certain target capability or benchmark performance.

Let $C_{\text{base}}$ be the training compute required to reach a target capability using a baseline data pipeline, and let $C_{\text{improved}}$ be the compute required using an improved pipeline, holding model and training recipe constant. I define $m = C_{\text{base}} / C_{\text{improved}}$ . Equivalently, assuming training compute scales with tokens processed, $m$ can be interpreted as an effective-token multiplier: an improved pipeline maps raw tokens $D$ to quality-adjusted tokens $D_{\text{eff}} = m \cdot D$ relative to the baseline. This is analogous to the scaling-factor formulation in quality-aware scaling laws. With this definition, $m = 2$ means the improved data pipeline results in half the training compute being required to reach the same capability, or equivalently, 1T improved tokens yield the same capability as 2T baseline tokens.

The Break-Even Allocation

Given an efficiency multiplier $m$ , how much compute can profitably go to data work? With baseline training compute $C_{\text{base}}$ , and data improvements yielding $m$ -fold training token efficiency, the training compute required to reach the same target becomes $C_{\text{base}}/m$ . One can therefore "spend" up to $C_{\text{base}} - C_{\text{base}}/m$ on data work while matching baseline total compute. This yields the break-even share:

$\alpha_{\max} = 1 - \frac{1}{m}$

where $\alpha_{\max}$ is the maximum share of compute that can be allocated to data work at break-even. For example, $m = 1.2$ implies $\alpha_{\max} = 17\%$ ; $m = 2$ implies $\alpha_{\max} = 50\%$ ; $m = 6$ implies $\alpha_{\max} = 83\%$ ; $m = 9$ implies $\alpha_{\max} = 89\%$ .

Token Efficiency Multipliers: Published Evidence

Multiple independent projects report results from which I derive efficiency multipliers in the 6–9x range. They don't always report multipliers directly; I compute $m$ from their token or compute comparisons.

Project	Metric	$m$	$\alpha_{\max}$
FineWeb-Edu	Tokens	~7.9x	87%
BeyondWeb	Speed	~7.7x	87%
OLMo 3	Tokens	~6x	83%
FineWeb2-HQ	Speed	~6x	83%
SYNTH	Tokens	~10–50x	90–98%

FineWeb-Edu. Quality filtering yields ~7.9x token efficiency: matching baseline performance with 38B tokens instead of 300B.

FineWeb2-HQ. Multilingual quality filtering results in ~6x training speedup across 20+ languages.

BeyondWeb. Systematic rephrasing yields 7.7x faster training, with an 8B model matching 180B-token baseline performance in 23.2B tokens.

OLMo 3. Rigorous data curation, including synthetic and rewritten math corpora in midtraining, enables matching Qwen-3 32B performance while training on ~6x fewer tokens.

SYNTH. Fully synthetic pretraining from curated seeds, reporting 10–50x token efficiency, with ~95% of project compute allocated to data pipeline work.

Nemotron-CC. Neither the Nemotron-CC nor Nemotron Nano 2 reports provide controlled token-sweep comparisons that would let us derive a multiplier $m$ . At a 1T-token training horizon, the full 6.3T dataset roughly matches DCLM on MMLU (53.0 vs 53.4), while a 1.1T high-quality subset achieves +5.6 MMLU improvement. But without scaling curves for both datasets, we can't determine how many baseline tokens would be needed to reach the same capability.

Recommendation: Allocate 80% to Data

Given the consistency of 6–9x multipliers across independent projects, I propose 80% as a default policy. This is a conservative estimate: $m = 6$ implies $\alpha_{\max} = 83\%$ , and $m = 9$ implies $\alpha_{\max} = 89\%$ , so 80% sits 3–9 percentage points below break-even. SYNTH's 95/5 split suggests that even more aggressive allocations may be viable for generation focused on reasoning tasks with constrained seed diversity, though this has so far only been demonstrated at small scale.

Practitioners should follow a three-step process:

Estimate $m$ via small-scale proxy experiments comparing baseline and curated data pipelines, including data-quality dependent scaling-law derivation.
Apply the break-even formula to determine $\alpha_{\max} = 1 - 1/m$ .
Allocate with a safety margin below $\alpha_{\max}$ to account for uncertainty in multiplier estimates and pipeline costs.

The safety margin is prudent because $m$ is measured on proxy tasks and may not transfer perfectly to the full training run. Even at the conservative floor of documented multipliers, 80% is justified.

When multipliers are lower. If proxy experiments yield $m < 6$ , the break-even formula still applies. Lower multipliers justify proportionally lower data allocations, but the principle remains: allocate up to, but not beyond, break-even. The key point: for any $m > 2$ , the majority of compute should go to data rather than training, so the core thesis holds for any efficiency multiplier above 2x.

Splitting across modes. Within the data compute budget, selection is cheap per token but high-leverage; transformation and generation are expensive but necessary for scale. I won't prescribe fixed ratios because the optimal split heavily depends on the quality and size of available source data. Projects starting from high-quality sources (e.g., curated domain corpora) may invest more heavily in generation; projects starting from noisy web crawls may need substantial selection and transformation first.

Limits of Training-Only Scaling

Training-only scaling faces two fundamental constraints. First, in data-constrained regimes, the marginal value of additional training compute decays toward zero: repeated epochs on fixed data yield diminishing returns, with performance gains vanishing after sufficient repetition. Second, redundant or low-quality data produces diminishing returns: high-density datasets with redundant information lead to sub-scaling, where performance improvements decelerate regardless of additional training compute.

These constraints are increasingly binding as estimates suggest the stock of public human-generated text may be exhausted by 2026–2032. The implication is that the binding constraint on capability is not training compute but high-utility token supply, which is exactly what data compute produces.

Fine Print

To be fair, efficiency multipliers from different studies may not be directly comparable; experimental setups, baseline choices, and evaluation metrics all vary. So $m$ should be treated as conditional on the baseline, target metric, and training recipe, not as a universal constant. That said, the consistency of the 6–9x range across independent projects gives confidence that substantial training token efficiency gains are real. My 80% recommendation is derived from the conservative floor of this range.

Data Investments Compound

So far I've treated data and training compute symmetrically within a single training run. But there's a fundamental asymmetry in their medium- and long-term returns: high-quality datasets remain valuable across multiple model generations, while individual model checkpoints depreciate within months. This asymmetry further strengthens the case for data-centric allocation.

The Durability Asymmetry

Model weights go stale fast. The interval from Llama-1 to Llama-2 was 5 months, followed by Llama-3 nine months later. The Stanford AI Index reports 149 foundation models released in 2023, more than double the 2022 count, indicating that the model frontier advances rapidly. Architectural innovations, improved training recipes, and expanded capabilities (reasoning, tool use, multimodality) compound this effect: older weights become cost-inefficient relative to newer alternatives within a year of release. Models do retain secondary value as teachers for distillation, baselines for regression testing, or cost-efficient solutions for constrained tasks. But these residual uses represent a fraction of the original investment's intended return; the model's value as a frontier capability asset depreciates rapidly.

Datasets, by contrast, persist across model generations. FineWeb, released in mid-2024, is continuously updated and has produced multiple derivatives (FineWeb-Edu, FineWeb 2, FineWeb2-HQ), with multiple generations of models trained on them (Apertus, SmolLM2, SmolLM3, Salamandra). Nemotron 3 is the third generation of models trained on Nemotron-CC. Dolma evolved through three major versions, with Dolma 3 being used to train OLMo 3. In each case, the initial curation investment continues to yield value years after the original expenditure.

Mechanisms of Compounding

Data investments compound in several ways. First, a well-curated dataset can train multiple model generations, including smaller variants and domain-specialized fine-tunes, without repeating the original curation cost. Second, datasets spawn derivatives: quality-filtered subsets, format-transformed versions, and domain-specific extractions each represent new assets built atop prior work. Third, when datasets are released openly, their value multiplies across the ecosystem. FineWeb-Edu has been adopted by numerous open LLM projects, reducing the effective per-project cost of data curation toward zero.

Training compute, by contrast, produces a single artifact, with each new model generation requiring a full training run.

Limits of Dataset Durability

Datasets also decay, though on longer timescales and through different mechanisms than models. Staleness occurs as facts, language patterns, and user behavior evolve. Contamination risk grows when evaluation data leaks into training corpora, reducing diagnostic value. Legal and licensing uncertainties can render datasets unusable if provenance is not carefully tracked.

This reinforces the argument: data-centric allocation should fund not only initial curation but also ongoing maintenance.

Implications for Long-Term Returns

The durability asymmetry implies that the true return on data compute exceeds what single-run break-even analysis suggests. If a dataset serves three model generations, its effective cost per model is one-third of the original investment. If it is shared openly and adopted by ten projects, the per-project cost approaches zero. Dataset maintenance is incremental (adding crawls, updating filters, refreshing classifiers), whereas model "maintenance" requires full retraining.

This asymmetry has strategic implications for compute allocation. A project investing 80% of its budget in data is constructing infrastructure for multiple generations; a project investing 80% in training is producing a single artifact with a 12–18-month shelf life as a frontier system. For open LLM efforts operating under constrained budgets, the compounding returns of data investment provide a durable competitive advantage that training-centric allocation cannot match.

Pushback

There are three credible counterarguments worth addressing.

"Training scale is sufficient."

Labs should allocate the majority of compute to training because scale has been the primary driver of capability gains. Qwen-3 trained on 36T tokens and achieved state-of-the-art results; classic scaling laws successfully predicted performance while holding data quality fixed.

My take: Even apparent brute-force approaches invest heavily in data quality. Qwen-3's training included a 5T-token reasoning stage of high-quality STEM data, annotation of 30T tokens for educational value and domain, and instance-level mixture optimization via ablations. Classic scaling laws answer "how to split compute between model size and tokens given a data distribution"; the claim here is orthogonal: optimizing compute spent to change the distribution shifts the frontier itself. For labs without frontier resources, data efficiency is the only viable path: OLMo 3 achieved 80.5% on GSM8K versus Marin's 69.1%, both 32B models with full transparency.

"Synthetic data cannot exceed its source."

A model trained on synthetic data generated by model X cannot surpass X. Information-theoretically, you cannot extract more signal than the generator contains; distillation research confirms that student models typically underperform their teachers. If synthetic data has a fundamental ceiling, heavy investment in generation compute is wasteful.

My take: The distillation analogy is misleading because distillation transfers a distribution, whereas generate-then-verify selects from candidates. The key distinction is reliability versus coverage: a generator with 5% accuracy on hard problems still contains correct solutions in its output distribution; verification selects those correct outputs even though the generator cannot produce them reliably. This is precisely the mechanism underlying RLVR: rollouts generate candidates, verifiers identify successes, and training amplifies what works. DeepSeek-R1-Zero's 5x capability gain demonstrates that generate-then-verify can exceed what the base model could achieve through supervised learning alone. Ensembling multiple generators with different failure modes increases coverage beyond any single model. Iterative pipelines compound these gains: a model trained on verified outputs becomes a better generator for the next round. The theoretical ceiling, if it exists, depends on whether generators can produce correct outputs at all, not on whether they can do so reliably.

"Posttraining is where compute should go."

Given base model commoditization, the highest-value compute should go to RL posttraining with verifiable rewards (RLVR), not pretraining data. If posttraining delivers such leverage, pretraining data quality matters less, and one should allocate the majority of compute to posttraining.

My take: I agree RLVR is high-leverage, but view it as supporting this thesis rather than contradicting it. RLVR's leverage comes from investing compute in signal quality (verifier design, task distribution, trajectory selection) rather than pure scaling. Rollouts are generation; retaining winners is selection. However, RLVR requires tasks with verifiable objectives (math, code, formal proofs), whereas pretraining data quality is universally relevant. And weak pretraining caps RLVR gains: you cannot RL-finetune your way out of a poor base model. The underlying principle, that signal quality beats pure scaling, extends from pretraining to posttraining.

Call to Action

Practitioners should rethink compute allocation: estimate the token-efficiency multiplier via proxy experiments, apply the break-even formula, and allocate compute accordingly. The default assumption that training should dominate the budget is outdated and suboptimal. My recommendation of 80% data compute reflects a conservative floor of documented multipliers.

For researchers, this position points to several important directions. Scaling law research should incorporate data pipelines, formalizing how effective tokens depend on data compute investment. Data selection research should explore scaling up annotators and scorers with capable LLMs rather than lightweight classifiers, to curate training corpora with richer quality signals. Synthetic data generation research should study how to produce diverse synthetic tokens at trillion-token scale without distributional collapse.

Conclusion

Open LLM projects face a fundamental choice: allocate compute to training runs that produce rapidly depreciating model weights, or invest in data work that yields compounding returns across model generations. The evidence is clear. Independent projects report efficiency multipliers of 6–9x from data curation, filtering, and synthetic generation. These are not marginal improvements but multiplicative gains that shift the capability-per-FLOP frontier. The break-even analysis is unambiguous: when data work yields $m$ -fold training efficiency, up to $1 - 1 / m$ of compute can profitably go to data. For $m = 6$ , this implies 83%; for $m = 9$ , it reaches 89%. My recommended 80% allocation is conservative.

Beyond immediate training token efficiency, data investments possess a durability that training investments lack. A curated dataset serves multiple model generations, spawns derivatives, and, when released openly, multiplies its value across the ecosystem. Model weights, by contrast, depreciate within months as the frontier advances. This asymmetry means that the true return on data compute exceeds what single-run analysis suggests.

The field's default assumption, that training should dominate compute budgets, is a legacy of an era when data was abundant and curation was manual. That era has ended. High-quality natural text is finite; aggressive filtering shrinks corpora below frontier training requirements; and the capability gap between curated and uncurated data continues to widen. Data compute, in the form of selection, transformation, and generation, is the mechanism for escaping these constraints.

I call on open LLM projects to adopt explicitly data-centric compute accounting: estimate efficiency multipliers via proxy experiments, apply the break-even formula, and allocate accordingly. The projects that invest heavily in data infrastructure will not only train better models today but will build durable assets for the models of tomorrow.