Why Your LLM Gets Dumber With More Context

The Simple Version

When you give a large language model more text to work with, it doesn’t always pay equal attention to all of it. Critical instructions buried in the middle of a long prompt frequently get ignored, and the model’s accuracy on tasks can drop measurably as context length grows.

What a Context Window Actually Is

Before getting into why things go wrong, a quick grounding in what a context window does. Every LLM has a context window: the total amount of text it can “see” at once during a single interaction. This includes your system prompt, any documents you’ve pasted in, the conversation history, and your current question. Early models had windows of a few thousand tokens (roughly 750 words per thousand tokens). Modern models like GPT-4 and Claude offer windows in the hundreds of thousands.

This feels like pure upside. You can drop in an entire codebase, a legal document, a research paper, and ask questions about all of it. The problem is that a large context window and perfect attention across that window are two different things.

Two funnels contrasting noisy, over-filled context versus focused, minimal context leading to a cleaner output — More input doesn't mean more signal. Often it means less.

The Lost-in-the-Middle Problem

In 2023, researchers at Stanford and UC Berkeley published a paper studying how well LLMs actually use information depending on where it appears in a long context. The finding was striking: models performed best when the relevant information appeared at the very beginning or very end of the context. When the same information was placed in the middle, accuracy dropped significantly, sometimes by 20 percentage points or more on retrieval tasks.

They called this the “lost-in-the-middle” problem, and it maps onto something intuitive. Think about how you’d read a hundred-page document versus a ten-page one. You’d probably nail the opening argument and remember the conclusion, but the details from page 47 would blur. LLMs show a structurally similar pattern, though for different architectural reasons.

The mechanism involves how transformer attention works. Each token in a sequence attends to other tokens, but that attention isn’t uniformly distributed. Long-range dependencies are genuinely harder to maintain. The model isn’t “reading” your document the way you do, sequentially, with a mental bookmark. It’s processing relationships across the entire sequence simultaneously, and that process has known weaknesses when sequences get very long.

Why More Context Can Mean More Confusion

There’s a second effect beyond positional bias, and it’s about signal-to-noise ratio. When you stuff a context window with everything potentially relevant, you’re also stuffing it with everything potentially irrelevant. The model has to figure out what matters.

This matters most for complex, multi-step tasks. Say you’re asking a model to analyze a contract and flag unusual clauses. If you paste in the contract alone, the model’s attention is focused. If you also paste in three previous contracts for comparison, plus your notes from a prior review, plus a general summary of standard terms, you’ve added useful context but also added noise. The model may latch onto a clause from the comparison document when answering about the current one, or weight your notes more heavily than the actual contract language.

This isn’t a flaw in any particular model. It’s a fundamental property of how these systems balance competing signals. As the article on what LLMs actually do with your prompt first explains, the model is always doing a kind of weighted averaging over everything it’s seen, and that averaging process doesn’t come with a “this part is ground truth” flag.

The Compression Paradox

Here’s the counterintuitive takeaway: a shorter, well-structured prompt often outperforms a longer, information-dense one on the same task.

This doesn’t mean you should never use long contexts. For tasks like “summarize this entire document” or “find any mention of liability across these five contracts,” you genuinely need the full text present. But for tasks where you want precise, accurate reasoning, ruthless trimming frequently helps. Give the model what it needs, not everything you have.

A few practical implications follow from this:

Put your most important instructions first or last. If you have a critical constraint (“never recommend a specific product”), don’t bury it after three paragraphs of background. The beginning and end of your context are the highest-attention real estate.

Be skeptical of RAG pipelines that retrieve too much. Retrieval-augmented generation systems (which fetch relevant chunks from a database to include in the prompt) can over-retrieve. Pulling in ten document chunks when three would suffice doesn’t improve accuracy. It often degrades it.

Test with shorter contexts first. If you’re building an application on top of an LLM, start with minimal context and add material only when you can measure that it improves outputs. The instinct to include more is usually wrong.

This Is Getting Better, But It’s Not Solved

Model developers are actively working on this. Anthropic has published research on improving long-context performance, and there’s a whole subfield of work on “context compression,” where models learn to summarize and prioritize earlier in the context before doing the actual task. There’s also interesting work on architectures that move away from standard attention for long sequences entirely.

The connection to a broader principle in machine learning is worth drawing: constraints often improve performance. Shrinking a neural network often makes it smarter because forced compression eliminates noise and requires the model to prioritize what matters. The same logic applies to your prompts. A model working with a tightly scoped, carefully chosen context has fewer ways to go wrong.

For now, the practical stance is: treat context like expensive real estate. Every sentence you add is competing with every other sentence for the model’s attention. Given how these systems actually work, more isn’t neutral. More is a choice with real costs.