For most of the last decade, compute was the limiting factor in AI. More GPUs meant better models. That's still partially true for frontier pretraining, but the margin is narrowing. The research papers that keep surprising people — models that punch above their compute budget — are almost always doing something interesting on the data side, not the architecture side.
We started calling this the data layer thesis about a year ago. The short version: in a world where compute is increasingly available and architectures are increasingly similar, the quality and curation of training data is becoming the primary differentiator. And the tooling to systematically create high-quality training data is, as of late 2025, still largely missing from the market.
What "Data Quality" Actually Means
The phrase gets thrown around loosely, so let's be specific. Training data quality has at least four dimensions:
- Accuracy: Is the information correct? For instruction-tuning data, does the response actually answer the question correctly? Does it reflect current knowledge?
- Diversity: Does the dataset cover the task space well, or is it concentrated in easy examples? Models trained on skewed distributions develop skewed capabilities.
- Difficulty calibration: Recent work suggests that including a significant fraction of genuinely hard examples — ones where even a strong model struggles — produces substantially better performance on out-of-distribution tasks. Curating for difficulty requires knowing what's hard, which requires running evals.
- Consistency: For instruction-following data, are the annotations consistent? The same task presented in different forms should get the same treatment. Inconsistency in annotation teaches inconsistent behavior.
Most large-scale datasets do none of these things well, because they were collected at scale from the web without systematic curation. The result is models that are broad but uneven — very capable on common tasks, unreliable on specific ones.
The Scaling Law Implication
Neural scaling laws describe the relationship between compute, data, and model performance. The canonical version says: to get the best model, match compute and data quantity in a specific ratio. What that framing misses is data quality. If you hold data quantity fixed and double data quality — by filtering out the worst half of a dataset and filling that space with higher-quality examples — the scaling dynamics change.
We've seen internal ablations from teams in our network where filtering a pretraining dataset from 1T to 400B tokens using quality heuristics produced a model that outperformed the 1T-token baseline on most benchmarks. Less data, better data. The model got more efficient use out of each token. That has direct implications for the economics of training and for who can compete at the frontier.
Where the Tooling Gap Is
The data pipeline problem is not just about what data exists — it's about the infrastructure to process, label, evaluate, and maintain it. The specific gaps we see most often:
Synthetic data generation at quality. Generating synthetic training examples is tractable with current models. Generating synthetic examples that are correct, diverse, and well-calibrated in difficulty is much harder. The teams building systematic pipelines for high-quality synthetic data — with verification steps, consistency checks, and automated quality scoring — are solving a real bottleneck.
Human-in-the-loop annotation at scale. For tasks requiring expert judgment — medical annotation, legal clause labeling, code correctness — you can't automate annotation entirely. The tooling for managing expert annotators, tracking annotation quality over time, and catching labeler drift is immature. Most organizations doing this at scale built their own infrastructure and wish they hadn't had to.
Data attribution and lineage. If your model has a capability gap or a specific failure mode, tracing it back to the training data is hard. Tools that maintain provenance through the pipeline — from raw source to processed example to training batch — make debugging possible. Without them, improving model quality is partially guesswork.
The Business Case
The market for training data tooling is large and growing, but the buyer is specific. Organizations training their own models — whether frontier labs, enterprises fine-tuning for specific tasks, or vertical AI companies — all have a data quality problem. The budget for solving it exists. The technical buyer is an ML engineer who understands the problem deeply and can evaluate solutions on technical merit.
We're looking for companies that have built real infrastructure here — not just a labeling interface but a full pipeline with quality measurement, human oversight, and systematic improvement loops. The best ones have a point of view on what makes data good for a specific class of tasks, and they've validated that point of view with customers who saw measurable improvement in their model performance.
Building training data infrastructure, synthetic data pipelines, or annotation tooling? We want to hear from you.