The dominant assumption in AI development for the past decade has been simple: more data makes better models. It has the intuitive appeal of most wrong ideas. If a model learns from patterns in data, then more patterns mean more learning, right? The reality is more complicated, and understanding why tells you something important about how these systems actually work.

This isn’t a fringe concern. It’s a pattern researchers encounter regularly, and it has real consequences for how companies deploy AI systems. A model trained on a carefully curated dataset of one million examples can consistently outperform one trained on a billion examples scraped from the open web. Smaller, focused models frequently beat larger ones on specific benchmarks. Google’s own experience with medical AI illustrates this concretely, and it’s worth understanding what happened there before generalizing the lesson.

What Overfitting Actually Means (and Why More Data Makes It Worse)

Overfitting is the most familiar culprit here, but it’s often explained too narrowly. The textbook version goes like this: a model that memorizes its training data rather than learning generalizable patterns will fail on new examples. True enough. But the more interesting version of the problem is what happens when you keep adding data without improving its quality or relevance.

Imagine you’re training a model to identify fraudulent financial transactions. You have 100,000 labeled examples of fraud. Someone suggests adding another 10 million transaction records from a different financial system with different fraud patterns, different customer demographics, and slightly different labeling standards. Your dataset just grew by 100x. Your model’s performance on the original task almost certainly got worse.

The model isn’t stupid. It’s doing exactly what it was built to do: finding patterns across all the data it was given. The problem is that those patterns now include noise, contradictions, and signals that are specific to a different problem. The model learned something, just not what you wanted it to learn.

The Label Noise Problem Is Underestimated

When datasets are small, human annotators can carefully label each example. When datasets are large, that process becomes expensive and error-prone. At the scale of billions of documents, labeling is often automated, crowd-sourced under time pressure, or inferred rather than directly observed.

Label noise, where training examples are assigned incorrect or inconsistent labels, doesn’t wash out harmlessly at scale. Research in this area shows that even modest rates of label noise (around 10-20%) can significantly degrade model performance. When noise is systematic rather than random, meaning that certain types of examples are consistently mislabeled in the same direction, the effect compounds. The model learns the bias in the labeling process as if it were signal about the world.

This is one reason why models trained on carefully curated benchmark datasets often outperform those trained on massive web scrapes on the same benchmark tasks. The curation process, which looks like it’s just removing data, is actually increasing the signal-to-noise ratio in a way that helps the model find real patterns.

Diagram comparing broad diffuse training signal versus narrow focused training signal and their accuracy on a target
Broad training data doesn't sharpen a model's aim. For specific tasks, it often does the opposite.

Distribution Shift Is a Quiet Killer

Here’s a scenario that plays out repeatedly in production AI systems. A team collects a large dataset to train a model. The dataset spans several years of historical records. The model trains well, performs well in testing, and ships. Then performance degrades in production.

What happened? The world changed, but the training data represents an average across conditions that no longer exist. A model trained on customer service emails from 2018 through 2024 will have learned patterns from when people used different vocabulary, referenced different products, and had different expectations. The 2018 data didn’t help the model understand 2024 users. It diluted the signal from more recent, relevant data.

This is distribution shift, and more data makes it worse when that data comes from periods or contexts that don’t match the deployment environment. Smaller, temporally focused datasets often produce models that perform better in production precisely because they don’t carry the dead weight of outdated patterns.

When the Task Itself Punishes Breadth

Some tasks require models to be general. A model that helps with everything from coding to poetry needs exposure to a wide range of human knowledge. But most deployed AI models don’t do everything. They do specific things: classify documents, generate product descriptions, answer customer questions about a particular software product.

For narrow tasks, broad training is often actively harmful. The model allocates representational capacity to patterns that are irrelevant to the task. It learns to hedge and generalize when the deployment context rewards precision and specificity. The computational budget used to process irrelevant examples is budget that didn’t go toward reinforcing the patterns that matter.

This is why fine-tuning (taking a general model and training it further on domain-specific data) is such a powerful technique. It lets you start with broad knowledge and then sharpen the model on what actually matters for the task at hand. But it also explains why starting with a massive general dataset isn’t always the right move if you know your deployment context in advance.

The Benchmark Illusion Compounds the Problem

There’s a measurement problem layered on top of all of this. AI models are typically evaluated on benchmark datasets, and those benchmarks have a tendency to become contaminated over time. When a benchmark is widely used, its examples eventually appear in training data scraped from the internet. A model that has seen benchmark questions during training will appear to perform better than it actually does in the wild.

Larger training datasets, particularly those built by scraping the open web, are more likely to contain benchmark contamination. So you end up with a counterintuitive situation: the model trained on more data looks better on the benchmark but performs worse on genuinely novel examples that weren’t in its training data. The model that looks smaller and worse may actually generalize better because it was trained on cleaner data that didn’t include accidental benchmark exposure.

This is a difficult problem to solve because it requires knowing what’s in your training data at a granular level, which is genuinely hard at scale.

Why the Industry Keeps Making This Mistake Anyway

If the problems with indiscriminate data scaling are known, why does the “more data is better” assumption persist? A few reasons.

First, the relationship between data and performance is real and well-documented at a coarse level. Scaling laws, documented extensively in research from OpenAI and others, show that larger datasets and larger models do tend to produce better general performance on average across many tasks. The problem is that “on average across many tasks” is not the same as “for your specific task in your specific deployment context.” The aggregate finding gets misapplied to specific situations where it doesn’t hold.

Second, more data is a story that’s easy to tell to stakeholders. Data collection and storage are costly but comprehensible. Data curation and quality control are valuable but invisible. Nobody puts “we carefully removed 80% of our training data” in a press release, even when that decision was responsible for the model’s success.

Third, the companies with the most data have strong incentives to promote data quantity as the key variable in model performance. It’s a moat narrative that happens to be partially true, which makes it more effective and more dangerous than a pure fiction.

What This Means

If you’re building or evaluating AI systems, the practical implications are fairly clear.

Data quality is a genuine technical decision, not just a hygiene concern. Investing in labeling quality, removing examples that don’t match your deployment context, and filtering for temporal relevance will often outperform adding more raw data. The examples you remove matter as much as the ones you keep.

Benchmark performance is not deployment performance. Any evaluation that relies solely on standard benchmarks should be treated with skepticism, especially for large models trained on web-scale data. Held-out evaluation sets that are carefully isolated from training data give a clearer picture.

Fine-tuning narrow, high-quality datasets on top of general models is often a better strategy than training specialized models from scratch on large but noisy domain data. You get the benefits of broad knowledge without the costs of broad training signal.

And finally: the scaling laws that dominate AI research describe averages across many tasks and many models. They are useful research tools. They are not a reliable guide to whether your specific model will benefit from your next data collection effort. The answer to that question requires knowing what’s in your data, how it was labeled, and how closely it matches what your users will actually ask the model to do.