GPT-4 Does Not Know What Words Mean, and That Turns Out to Be Fine

When OpenAI released GPT-4 in March 2023, the discourse split almost immediately into two camps. One side insisted the model was on the verge of sentience. The other insisted it was a ‘stochastic parrot,’ a phrase borrowed from a 2021 paper by Emily Bender and colleagues, mindlessly stitching together word sequences with no grasp of meaning. Both camps were wrong, but one of them was wrong in a more interesting way.

The parrot framing stuck because it felt like a useful corrective to the hype. And mechanically, it’s close to accurate. LLMs are trained to predict the next token given the preceding tokens. That’s the job. Predict the next word. Do it billions of times across an enormous corpus of text. Adjust the weights when you get it wrong. Repeat.

But here’s where the framing breaks down: what does it actually take to become very, very good at predicting the next word?

The Setup: What GPT-4 Was Actually Trained to Do

The training objective for a model like GPT-4 is called next-token prediction. Given a sequence of text, the model outputs a probability distribution over every possible next token. It assigns a number to ‘cat,’ a number to ‘the,’ a number to ‘justice,’ and so on. The highest-probability token wins, usually, though there’s controlled randomness in the process that keeps outputs from being identical every time you ask the same question (a point worth its own examination at AI Models Give Different Answers to the Same Question Every Time and the Reason Is a Feature Not a Bug).

The numbers inside the model, the weights, encode statistical relationships learned from training data. That much is uncontroversial. The dispute is about what those statistical relationships actually represent.

Consider what it takes to correctly predict that the word following ‘The defendant was acquitted because the evidence was’ is more likely to be ‘insufficient’ than ‘delicious.’ You need to have extracted something about legal proceedings, about what constitutes reasonable grounds for acquittal, about the relationship between evidence quality and legal outcomes. You can call that ‘understanding’ or you can refuse to call it that. The label doesn’t change the functional capability.

Diagram showing attention weights connecting tokens in a transformer layer, with thicker lines for stronger attention — Attention allows a model to weigh every token against every other token. The weights are learned, not programmed.

What Happened When the Benchmark Numbers Came In

GPT-4’s performance on standardized professional exams became the concrete case study that forced this conversation into sharper focus. On the Uniform Bar Exam, GPT-4 scored around the 90th percentile of human test-takers. On the Medical College Admission Test, it performed comparably to students who went on to complete medical school. On advanced mathematics benchmarks, it cleared hurdles that previous models had found essentially impassable.

This created a genuine explanatory problem for the ‘it’s just pattern matching’ position. Pattern matching against training data is the right description of the mechanism, but mechanism is not the same as capability. A calculator ‘just’ does arithmetic operations. The word ‘just’ is doing a lot of work in both sentences.

The researchers who built these benchmarks largely stopped arguing about whether models ‘understand’ things after seeing the GPT-4 results. The question pivoted. It stopped being ‘does it understand?’ and became ‘what is it actually doing internally, and how does that relate to the kinds of reasoning we care about?’

That’s a better question.

Why the ‘Understanding’ Frame Causes Real Problems

The stakes here are not purely philosophical. The language we reach for shapes the decisions we make.

If you believe LLMs ‘understand’ in the way humans do, you’re likely to trust outputs in domains where the model is confidently wrong, because confident wrongness looks a lot like confident rightness when you’re reading text. The model doesn’t flag its own uncertainty in any reliable internal way. It produces fluent text regardless of whether the underlying computation corresponds to accurate information. This is where hallucinations come from: not from the model ‘lying’ or ‘misunderstanding,’ but from the fundamental fact that the training objective rewards fluent, plausible token sequences, not true ones.

If, on the other hand, you believe LLMs are ‘just’ autocomplete and therefore incapable of anything interesting, you’ll systematically underestimate what they can do. Companies that dismissed GPT-4-class models as expensive search engines have spent the past two years rebuilding products their competitors shipped correctly the first time.

The productive frame is something more mechanical and more honest. These models compress and index vast regularities in human-generated text. When you query them, you’re pulling on those regularities in ways that often produce correct, useful, sometimes genuinely novel outputs, and sometimes produce confidently wrong ones. The trick is knowing which is which, and that requires understanding the mechanism, not anthropomorphizing it.

The Attention Mechanism Is the Actual Story

The specific architectural feature that makes modern LLMs work is the transformer’s attention mechanism, introduced in the 2017 paper ‘Attention Is All You Need’ by Vaswani and colleagues at Google. Attention allows the model to weigh the relevance of every token in the input against every other token when computing its next prediction. This is how a model can correctly use a pronoun that refers to a noun forty words earlier in a sentence, without any explicit rule being programmed.

When a model processes ‘The trophy didn’t fit in the suitcase because it was too big,’ attention heads in the model learn to associate ‘it’ with ‘trophy’ rather than ‘suitcase.’ They do this through learned weights, not through programmed rules. The model has, in some functional sense, extracted the relationship between object sizes and spatial containment, because that relationship is a regularity in the text it trained on.

Is that ‘understanding’? The question is genuinely not useful. What’s useful is knowing that this capability exists, that it’s somewhat brittle under distribution shift (novel phrasings or contexts the model hasn’t seen often), and that it doesn’t come with any built-in epistemic humility about its own limits.

What This Means in Practice

The GPT-4 release taught the industry several concrete lessons that are now shaping how serious teams use these models.

First, the competence floor is higher than skeptics assumed. If your product requires reasoning over professional-grade text, these models are a viable starting point, not a novelty.

Second, the failure modes are specific and learnable. Models fail in characteristic ways: they confabulate citations, they struggle with precise numerical reasoning (especially multi-step arithmetic), they can be thrown off by irrelevant information inserted into prompts. These aren’t random failures. They follow from the training objective.

Third, the word ‘understanding’ is doing active harm in product conversations. Teams that build with the mechanical picture in mind, that ask ‘what statistical regularities has this model learned, and does my task require something beyond that?’ tend to deploy more reliably and fail more gracefully.

The model is not intelligent. It’s also not dumb. It’s a very high-dimensional function that maps token sequences to probability distributions over next tokens, trained on enough human text that those distributions capture enormous amounts of structure about how humans think and communicate. That’s remarkable on its own terms. It doesn’t need the ‘understanding’ label to be genuinely useful, and attaching that label makes it harder to think clearly about what it can and can’t do.

Say what the thing is. Then figure out what it’s good for.