The 2017 “Attention Is All You Need” paper didn’t just improve NLP performance. It retired an entire class of architectures (RNNs, LSTMs) and made today’s large language models possible. That’s real. But there’s a version of the story that gets told in conference talks and blog posts where attention was the key that unlocked language understanding, and we just needed to scale it up. That story is too clean.
Attention solved a specific, well-defined problem. It moved others downstream, made some worse, and created a few entirely new ones. Here’s where the debt actually landed.
1. Attention Fixed Long-Range Dependencies, Then Made Context a New Crisis
The core problem with recurrent neural networks was that information had to travel sequentially through time steps. By step 200, what happened at step 1 was essentially a faint whisper. Attention fixed this by letting every token in a sequence look directly at every other token, weighted by relevance. A question word at position 0 could attend strongly to an answer at position 300 without information degrading in between.
But this fix carries a cost that scales quadratically. If you double the sequence length, attention’s memory and compute requirements quadruple. Early Transformer models worked with sequences of 512 tokens. That’s about 380 words, roughly one page of a novel. Getting to the 128,000-token context windows in current frontier models required years of engineering work on sparse attention, sliding window attention, and hardware-level optimizations. The problem didn’t disappear. It became an infrastructure problem instead of a modeling problem.
2. The Data Requirements Became Essentially Unbounded
RNNs trained on millions of examples. Transformers, to realize their potential, need billions. The attention mechanism is extraordinarily good at finding patterns, but it needs an ocean of signal to do it reliably. Early BERT models trained on roughly 3.3 billion words of text. GPT-3 trained on around 300 billion tokens. Models released after it have used more.
This isn’t just a cost problem, though cost is real. It’s a data-quality and data-sourcing problem that the field still hasn’t solved cleanly. Training data encodes biases, factual errors, and outdated information in ways that are difficult to audit after the fact. The attention mechanism is a very powerful pattern-matcher. Give it bad patterns at scale and it learns them very well. The garbage-in problem got much larger when the input required went from millions of examples to hundreds of billions.
3. Attention Doesn’t Understand Anything, It Correlates Everything
This is the one that takes the longest to sit with. Attention computes relationships between tokens based on learned weights. It does not build a symbolic model of what those tokens mean. When a model correctly answers a question about a piece of code or a legal document, it’s because the attention patterns over its training distribution generalize well to the current input. When it confidently produces something wrong, it’s usually because those patterns generalized in a direction that looked plausible but wasn’t grounded.
The original NLP promise, going back decades, was machines that could parse language into structured meaning. Attention got us further on benchmarks than any prior approach, but it did it by building increasingly sophisticated correlation engines, not by cracking semantic understanding. If you’ve spent time reading model outputs carefully, you’ve noticed this. The model can discuss the concept of a null pointer exception fluently and still write code that dereferences one. As understanding what an LLM actually does with your prompt makes clear, the gap between fluent output and grounded reasoning is wider than it looks from the outside.
4. The Compute Requirement Centralized the Field
Before Transformers, NLP research was broadly accessible. Training a competitive LSTM model on a reasonable dataset was something a graduate student with a few GPUs could do. Training a competitive Transformer at modern scale requires infrastructure that costs millions of dollars and is available to perhaps a dozen organizations globally.
This isn’t inherent to attention as a mechanism. Small Transformer models are cheap. But the performance gains that made Transformers worth caring about happen at scale, and scale requires capital. The practical effect is that the frontier of NLP research is now defined by a small number of labs, and independent researchers work primarily with models those labs have chosen to release. That’s a structural change in who gets to ask what questions.
5. Positional Encoding Is Still Kind of a Hack
Attention, on its own, is permutation-invariant. It treats a sequence as a set. To actually process language in order, you have to inject positional information separately, and the original Transformer paper used sinusoidal functions to do this. Later work moved to learned positional embeddings, then to relative position encodings, then to rotary position embeddings (RoPE), which is what most current models use.
Each of these is an attempt to solve a problem that attention created by abstracting away sequence order in the first place. RoPE works well enough that it’s become the default, but the fact that we’re still iterating on positional encoding six years after the original paper suggests this wasn’t a solved problem, it was a deferred one. How a model handles position directly affects how it handles tasks that require tracking order, like reasoning through multi-step problems or understanding narrative sequence. It’s a subtle constraint with non-subtle consequences.
6. Inference Cost Is the New Training Cost
The story of Transformers is mostly told through training. The paper titles, the benchmarks, the parameter counts. But for anyone actually deploying these models in production, the constraint that bites is inference: the cost of running a model on each new input, at whatever query volume your application generates.
Attention is expensive at inference time for long contexts. The KV cache (which stores previously computed key and value representations so you don’t recompute them on every token) grows with context length and becomes a significant memory bottleneck. This is why serving a long-context model is meaningfully more expensive per query than serving a short-context one. The architectural decisions made to solve the training-time long-range dependency problem created a runtime cost that the field is now trying to engineer back out through quantization, speculative decoding, and attention approximations. The bottleneck relocated from model design to deployment infrastructure.
None of this is an argument against attention or Transformers. They are genuinely better than what came before on almost every task that matters. But better is not solved. Each constraint that attention eliminated revealed the constraint behind it. The field is now running into the limits of scale, the costs of inference, the inadequacy of correlation for reasoning, and the concentration of capability in a small number of compute-rich organizations. Those are the actual problems on the table. Attention was the mechanism that let us find them.