In 2019, Google published research in Nature showing its AI could detect breast cancer in mammograms better than radiologists. The headline numbers were striking. The model had trained on tens of thousands of de-identified mammograms from two countries and outperformed six radiologists on key metrics. The AI press celebrated. Radiology departments took notice.

Then came the harder question: why did several narrower, domain-specific models trained on far fewer images keep matching or outperforming Google’s system once researchers tested outside the original conditions?

The answer cuts to something the AI industry still hasn’t fully reckoned with. More training data is not a reliable proxy for better performance. Sometimes it actively makes things worse.

The Setup

Google’s mammography model was trained on images from two healthcare systems, one in the UK and one in the US. The diversity felt like a strength. More variation, more generalization, better outcomes. The intuition makes sense if you think of AI training the way you think of human experience: a doctor who has seen patients from many backgrounds should be more adaptable than one who has only ever worked in a single clinic.

But medical imaging doesn’t work quite like that. Mammography machines differ by manufacturer. Imaging protocols differ by institution. Radiologists annotate findings differently depending on their training and the standards of their hospital. When you aggregate data across these differences without accounting for them, you don’t create a more capable model. You create a model that has learned to average across noise.

Smaller research teams, working with curated datasets from single institutions or tightly controlled multi-site studies with standardized protocols, repeatedly produced models that performed comparably within their specific deployment context. Their training sets were sometimes a fraction of the size. What they lacked in volume they made up for in consistency.

Abstract visualization of signal versus noise in a dataset
More data points don't produce cleaner patterns if the additional data introduces inconsistency. The noise competes with the signal.

What Happened

When independent researchers applied Google’s model to external datasets, performance degraded. This is documented across medical imaging AI broadly, not just in Google’s case. A 2020 analysis published in Nature Machine Intelligence examined hundreds of AI studies on COVID-19 detection and found that the majority of models, including several trained on large multi-hospital datasets, failed to generalize reliably to new patient populations. The culprit was almost always the same: training data that looked large and diverse but carried hidden inconsistencies in how images were acquired and labeled.

This phenomenon has a name in machine learning: dataset shift. The model learns the statistical patterns in its training data, which include the noise and the idiosyncrasies specific to how that data was collected. Add more data from more sources with more collection quirks and you give the model more noise to absorb, not more signal to learn from.

Google’s team understood this. Their published research was careful and their caveats were honest. The problem was what happened next. Organizations looking to build or buy medical AI systems used aggregate benchmarks to compare models, and aggregate benchmarks favor the numbers that look impressive on the headline metric without capturing whether the model would hold up in their specific radiology department using their specific equipment.

Smaller, focused models that scored modestly on the big benchmark often outperformed larger ones once deployed into a single-institution setting because they had been trained on data that more closely resembled what they would actually see.

Why This Matters Beyond Medicine

You might think this is a problem specific to healthcare, where data collection is messy and labeling is expert-dependent. It isn’t.

OpenAI has written publicly about how adding certain categories of low-quality web data during GPT training degraded performance on specific reasoning tasks, even when the overall dataset got larger. The fix was not to collect more data but to filter more aggressively. The scale of language model training has made this filtering work into one of the most consequential engineering decisions in the field, but it is invisible to most of the people evaluating these models.

Hugging Face researchers studying smaller fine-tuned models have observed repeatedly that a 7-billion parameter model fine-tuned on a clean, task-specific dataset will routinely outperform a much larger general model on the target task. The larger model has seen more of the world. The smaller model understands your question.

This matters practically if you are making decisions about AI tools for your organization. The instinct is to assume that the model trained by the company with the most resources, the most data, and the biggest parameter count is the safest bet. That instinct is wrong often enough to be worth interrogating.

What You Can Take From This

There are three things worth internalizing here.

First, benchmark performance and deployment performance are different things. When evaluating an AI tool for a specific use case, look for evidence that the model has been tested on data that resembles your context, not just data that resembles the general population. A coding assistant trained heavily on open-source Python will behave differently from one trained on a broader corpus that includes a lot of PHP forum posts from 2009.

Second, if you are building with AI rather than just using it, take dataset curation seriously before you take dataset expansion seriously. The temptation to reach for more training data when a model underperforms is understandable but often counterproductive. Ask first whether the data you already have is clean, consistently labeled, and genuinely representative of the task.

Third, smaller and more focused is sometimes the right architecture choice, not a concession. The AI industry has a visibility problem: the models that get attention are the ones with the biggest numbers attached to them. The models that get deployed quietly and work reliably in narrow domains don’t generate the same press. That asymmetry distorts how practitioners think about what good looks like.

Google’s mammography research was not a failure. It was careful science that produced genuine insight. The lesson is in what happened around it: how organizations interpreted the headline results, how the benchmark became a substitute for deployment testing, and how the assumption that more data means better models kept leading buyers and builders toward the wrong conclusions.

The next time someone pitches you an AI solution by leading with the size of their training set, that’s your cue to ask the follow-up question: consistent with what?