When you ask a large language model to summarize something, you are not getting a compressed version of the source. You are getting a statistically plausible description of what a summary of that kind of document tends to look like. That distinction sounds academic until it causes a real problem.

This matters because millions of people now use LLM summarization for consequential tasks: reviewing contracts, distilling research papers, catching up on long email threads. If your mental model of what the model is doing is wrong, your trust calibration is wrong too.

What the model actually sees

Transformer-based language models process text as sequences of tokens, with attention mechanisms that let each token relate to others in the context window. When you paste in a 5,000-word document and ask for a summary, the model doesn’t read it the way you would, top to bottom, building a mental model of the argument. It processes the token relationships simultaneously and produces output token by token, each one selected based on probability distributions conditioned on everything that came before.

This works remarkably well for common document types. If you feed the model a standard earnings report, it has seen thousands of earnings report summaries in training data and can produce something that looks correct. The structure is familiar. The vocabulary is familiar. The model is, in a meaningful sense, pattern-matching against a learned distribution of “what summaries of this kind of thing look like.”

The problem is that this process and genuine comprehension produce the same surface output until they don’t.

Split illustration showing a source document with a critical detail highlighted versus a clean summary that omits it
The model's output reads the same whether it got it right or missed the one thing that mattered.

Why key details get lost

Summarization requires judgment about salience: what matters, what doesn’t, what the reader needs to walk away knowing. That judgment depends on understanding the document’s purpose, the reader’s context, and the stakes of being wrong.

LLMs approximate this by weighting tokens that co-occur with importance signals in training data. A sentence with “critical finding” or “primary risk” near it is more likely to survive into the summary. A single anomalous data point buried in paragraph eight, expressed in ordinary language, often won’t make it, even if that point is the most important thing in the document.

Researchers studying this have found consistent patterns: models tend to over-represent information from the beginning and end of documents (a positional bias that mirrors human reading tendencies in training data), and they systematically under-represent information that contradicts the document’s dominant framing. A legal contract with a standard structure and one unusual clause is likely to produce a summary that covers the standard parts well and soft-pedals or omits the unusual one.

This isn’t a bug that will be patched out. It’s a consequence of how the models learn what “important” means.

Confident language, uncertain content

The output reads with the same confident, clean prose regardless of whether the model got it right or wrong. There is no hedging that scales with accuracy. A summary of a document the model understood well and a summary of a document it partly hallucinated look identical from the outside.

This is the deeper problem. We’ve built good intuitions for human communication, where someone who sounds uncertain probably is, and someone who speaks with authority usually has some basis for it. LLM output violates that mapping completely. The fluency is a function of the training objective, not a signal of correctness. As we’ve covered in the context of reasoning tasks, the model’s confidence and its accuracy are essentially orthogonal.

The counterargument

The obvious pushback is: so what? LLM summaries are still useful. They save time. They get the broad strokes right most of the time. Expecting perfect comprehension from any summarization tool, human or machine, is unrealistic.

This is fair. I use LLM summarization regularly and find it valuable. A rough summary of a 50-page report that gets me 80% of the way there in two minutes is genuinely useful, even if I need to verify the critical parts myself.

But the problem isn’t that the tool is imperfect. The problem is that its failure modes are invisible and counterintuitive. A human summarizer who misses something important tends to produce a summary that feels thin or evasive in that area. An LLM will produce fluent, detailed prose about what it got wrong. Users aren’t calibrating their verification effort against actual risk because they have no signal that risk exists.

Read the output like a draft, not a digest

The practical implication is simple: treat LLM summaries as a first draft from a very fast, very confident junior analyst who hasn’t actually read the document carefully. That framing is useful and worth keeping. Use the summary to orient yourself, identify what to read closely, and form hypotheses about the content. Then verify anything that matters.

Compression is genuinely hard, and the fact that LLMs can do a passable version of it across almost any document type is impressive. But passable and reliable are different standards, and for tasks where getting the nuance wrong has real consequences, you need to know which one you’re working with.

You are not getting a compressed version of the truth. You are getting a fluent approximation of what such a compression typically looks like. For many tasks, that’s enough. For the ones where it isn’t, you need to know the difference before you find out the hard way.