The Setup

In 2023, a mid-sized legal technology company was building a contract review tool on top of GPT-4. The product was straightforward in concept: upload a contract, ask the model questions about risk clauses, get structured output back. Their early demos were impressive enough to win pilot customers. Their production results were not.

The team had written careful, detailed prompts. Instructions for tone, output format, reasoning style, edge case handling. One senior engineer described them as “the most thorough specifications I’ve ever written for any system.” The prompts ran to several hundred tokens before the contract text even started.

The model kept ignoring parts of the instructions. Not randomly. Predictably. The same instructions failed the same way, across different contracts, different users, different days. The output format would drift. Reasoning that should appear in one section appeared in another, or not at all. Instructions near the end of the system prompt had a disturbingly low hit rate.

They spent two months assuming the problem was prompt quality. They rewrote. They refined. They brought in a consultant who specialized in prompt engineering. The outputs improved marginally. The core failure pattern persisted.

What Actually Happened

The problem wasn’t the prompts. It was their mental model of what the model was doing with those prompts.

Here is what they eventually understood: a large language model does not read a prompt the way you read a document. You read sequentially and maintain equal attention across the text (roughly). Transformers do not. The attention mechanism, which determines how much each token influences each other token during inference, distributes unevenly. And critically, position matters.

Researchers have documented this pattern in what’s sometimes called the “lost in the middle” problem. A study from UC Berkeley and other institutions published in 2023 found that language models perform significantly worse at retrieving information placed in the middle of long contexts compared to information at the beginning or end. The effect is real, measurable, and consistent across model families.

The legal tech team had been placing their most critical formatting and reasoning instructions in the middle of a long system prompt, then appending the contract text at the end. The model was, in effect, seeing a degraded version of their carefully constructed instructions every single time.

But position was only part of it. There was a second problem they hadn’t accounted for: the model was also receiving content they hadn’t written.

Like most teams building on a foundation model API, they were using a managed service that added its own system-level instructions before their prompt. Safety guidelines. Content policies. Behavioral defaults set by the provider. These were invisible to the engineering team, but they occupied tokens and, more importantly, they occupied early positions in the context window where attention is strongest. The model was reading those instructions first, with highest fidelity, and reading the team’s custom instructions afterward.

Their prompt wasn’t the beginning of the conversation. It was somewhere in the middle of one they couldn’t see.

Diagram showing attention weight distribution across a context window, with bright edges and a dim middle section
The 'lost in the middle' effect: instructions placed at the edges of a context window receive more reliable attention than those buried in the center.

Why This Matters Beyond One Company

This is not an unusual story. It’s a representative one.

Most engineers building on top of LLM APIs work with a fundamentally incorrect mental model of what the model receives. They think of a prompt as an instruction document that the model reads and follows. The accurate model is closer to: a probability distribution over tokens is shaped by all the text in context, weighted non-uniformly by position, content, and architectural details that the API surface doesn’t expose.

That gap in mental models produces real failures. Teams optimize the wrong things. They rewrite prompts for clarity when the problem is position. They add more instructions when the problem is that existing instructions are being diluted by system-level context they can’t see. They test in playgrounds where the context window is clean and short, then deploy to production where it’s long and cluttered, and wonder why results diverge. (Smarter AI models hallucinate more confidently for related reasons: more capable models don’t necessarily become more transparent about their failure modes.)

The invisible prefix problem is particularly sharp for enterprise deployments. Azure OpenAI, AWS Bedrock, and similar managed services routinely inject content before user prompts. The exact content varies by configuration and isn’t always documented in detail. If you’re building a product where output consistency matters, you’re likely operating with incomplete information about your own system’s inputs.

There’s also the tokenization layer, which most people think about even less. Your prompt text gets converted to tokens before the model touches it. Tokenization is not character-level. It’s not even always word-level. “ChatGPT” might be one token or several depending on context and the specific tokenizer. Unusual formatting, non-English text, code, and structured data like JSON or XML all tokenize in ways that can compress or fragment your instructions in unexpected ways. You write in English. The model processes tokens. These are not the same thing.

What the Team Actually Fixed

Once the legal tech team had an accurate model of what was happening, the fixes were less glamorous than their months of prompt rewriting but considerably more effective.

First, they moved critical instructions to the front and back of their prompt, keeping the middle for content that could tolerate lower attention fidelity (explanatory context, examples). This alone improved format compliance measurably in their evals.

Second, they built a logging layer that captured the full serialized prompt as sent to the API, including anything prepended by the service layer, before it was transmitted. They started reviewing these logs as part of their debugging workflow. The invisible became visible.

Third, they shortened prompts aggressively. Not because brevity is a virtue for its own sake, but because every token you add to a long context competes for attention with every other token. Prompt engineering is just structured thinking applied: say the important things once, clearly, in the right position, rather than saying many things at length and hoping the model tracks all of them.

Fourth, they separated concerns. Instead of one large prompt that tried to handle formatting, reasoning, tone, and edge cases simultaneously, they broke tasks into smaller, focused calls with shorter contexts. More API calls, lower latency variance, more predictable outputs.

What to Take From This

The lesson isn’t that LLM APIs are broken or that building reliable products on them is impossible. It’s that the abstraction is leakier than it appears, and treating it as a black box you can instruct through natural language will eventually cost you.

Four things worth internalizing:

Position is semantics. Where you place an instruction in a prompt affects how the model weights it. Beginning and end positions receive more reliable attention. Treat prompt structure as a design decision, not just an organizational preference.

Your context window has tenants you didn’t invite. When using managed APIs, assume there is system-level content you can’t see. Log the full payload. Know what the model actually receives.

Your prompt is not the text you wrote. It’s a sequence of tokens, some of which may represent your words in ways you don’t expect. For high-stakes applications, inspect token counts and tokenization outputs, especially for structured formats.

Eval environments lie. Clean playground testing doesn’t replicate production context length, injected system content, or the distribution of real user inputs. Build evals that match production conditions before you trust production results.

The engineers on that legal tech team are good engineers. They wrote careful specifications. They iterated thoughtfully. They did the work. What they were missing wasn’t effort. It was an accurate picture of the system they were actually talking to. Once they had that, everything else followed.