The Setup
In 2023, a researcher at a well-regarded AI investment firm was using GPT-4 to assist with a literature review on reinforcement learning from human feedback. The model returned a list of twelve papers, each with author names, journal names, volume numbers, and DOIs. The formatting was immaculate. The citations looked exactly like what you’d find in a Nature Machine Intelligence bibliography.
Six of the twelve papers did not exist.
Not approximately. Not misattributed. The papers were invented: plausible titles, plausible author combinations, plausible venues, fictional DOIs that resolved to nothing. When the researcher checked the real papers against the fake ones, she said later that the fake ones were actually better formatted.
This is not a story about a careless user or a novel edge case. This is what happens when a model is capable enough to be convincing but not grounded enough to be correct, and it matters more as the models get smarter.
What Happened
The pattern that showed up in that literature review has since been documented repeatedly across high-stakes professional settings: legal research, medical literature summaries, financial analysis, code documentation. Lawyers at a New York firm submitted a brief citing cases that ChatGPT had fabricated, a story that became public in 2023 when a federal judge sanctioned the attorneys involved. The cases sounded real. They had docket numbers. They had plausible holdings. ChatGPT had assembled them from patterns.
What makes this particularly sharp is the relationship between model capability and hallucination confidence. Older, smaller models tend to either get something right or produce output that’s obviously wrong, garbled, or hedged into uselessness. GPT-2 generating text that started coherently and devolved into word salad was actually more honest in a sense: the failure was visible.
Better models fail more gracefully, which means they fail more dangerously. When GPT-4 fabricates a citation, it does not produce something that looks wrong. It produces something that looks like what a correct citation looks like. The model has learned the surface form of correctness without having reliable access to the underlying facts.
This is sometimes called the fluency-accuracy gap, and it widens as fluency improves faster than factual grounding. A model trained on more data and optimized for coherent, human-preferred output gets better at producing text that feels trustworthy. Whether the content is true is a separate question that the model is not actually answering.
There’s a useful analogy in software here. Imagine a function that returns a result in exactly the right format, with no exceptions thrown, but the computation inside is wrong. Statically typed languages will accept it. Your tests might pass if they only check structure. The bug only surfaces when you check the actual value against ground truth. That’s what a fluent hallucination is: well-typed garbage.
Why It Matters
The reason this problem is getting more serious rather than less is that deployment patterns have shifted. When GPT-3 was new, most teams used it with a human reviewing every output before anything left the door. As models have improved and teams have grown comfortable, the review layer has thinned. Workflows that used to require sign-off now pipe model output directly into downstream systems.
The false confidence extends to the people using the models. There’s solid evidence from cognitive science that fluency cues, meaning prose that reads smoothly and uses correct technical vocabulary, strongly influence how credible people find a claim. Readers slow down and scrutinize text that sounds uncertain or awkward. They tend to accept text that sounds authoritative. A model that has learned to sound authoritative is, structurally, a persuasion machine. Whether it’s right is a property of the output, not the form.
This is compounded by what you might call the upgrade paradox. When a team switches from an older model to a newer, smarter one, they usually see genuine quality improvements across most tasks. Fewer dumb errors, better reasoning on complex prompts, more useful summaries. The appropriate response feels like increased trust. But increased trust applied uniformly, rather than selectively, is exactly the wrong response when the failure mode has shifted from visible errors to invisible ones.
The legal case is instructive here not because lawyers are careless but because they are experts. These were attorneys who knew their field well enough that a real citation would have sounded familiar, but a fabricated one from an adjacent area could slip past. The model’s confidence was calibrated to their level of expertise. It filled in exactly where their knowledge had gaps, which is where you most need accuracy and least have the background to check it.
What We Can Learn
The first lesson is architectural: trust should scale with verifiability, not with model capability. If a model’s output touches something consequential, the question is not “is this a smart model” but “can this specific claim be checked against a source.” Retrieval-augmented generation (RAG), where the model is given actual documents to work from and required to cite them, meaningfully reduces hallucination rates for factual questions because it changes the task. Instead of generating plausible text, the model is selecting and summarizing from provided context. That’s a different and more tractable problem.
The second lesson is about calibration at the workflow level. The places where AI assistance is most tempting are often the places where human expertise is thinnest, and those are exactly the places where fluent hallucinations are hardest to catch. A senior engineer can often spot when a model’s code suggestion doesn’t make sense. A junior engineer who doesn’t fully understand the codebase might accept the same suggestion because it compiles and looks reasonable. The review process needs to be strongest where subject-matter expertise is weakest, which is the opposite of where it tends to end up in practice.
The third lesson is harder and more uncomfortable. The experience of using a more capable model feels like using a more reliable tool. The interface is the same, the response time is similar, the output is more polished. Nothing in the user experience signals that the failure mode has changed character. Teams need to deliberately build in the skepticism that the product design does not encourage.
This connects to something worth sitting with: the prompt you write is not the prompt the model reads, and the answer you read is not a fact the model knows. It’s a generation. The model is doing something closer to sophisticated interpolation than retrieval. When interpolation works, it’s extraordinarily useful. When it fails, it fails smoothly.
Smart models are not more honest than dumb ones. They’re better at sounding honest. That distinction is easy to understand and surprisingly hard to remember when the output is sitting in front of you, formatted perfectly, waiting to be copied.