What a Hospital's AI Failure Taught Us About Prompts

The Setup

In 2022, a large academic medical center in the United States rolled out a pilot program using a large language model to assist clinical staff with patient triage documentation. The goal was straightforward: reduce the time nurses spent writing intake summaries by letting the model draft them from structured intake data. The nurses would review, edit, and sign off. Time saved, cognitive load reduced, everyone wins.

Within three months, the pilot was quietly shelved. The model wasn’t the problem. The prompts were. Or more precisely, the thinking behind the prompts, which is where the real failure lived.

The story has been documented in several health informatics case analyses, and the pattern it represents shows up constantly in AI deployments across industries. It’s worth understanding what actually went wrong, because the lesson isn’t about AI at all.

What Happened

The team that configured the system was technically competent. They understood the model’s capabilities. They iterated on prompt wording. They tested outputs and refined language. They did what most people picture when they hear “prompt engineering.”

What they didn’t do was define the task with the same precision they expected from the model.

The intake summaries nurses wrote served multiple audiences simultaneously. A triage nurse needed urgency signals. A billing coder needed diagnostic language that matched ICD-10 categories. A physician reading the chart twenty minutes later needed clinical context that justified decisions already made. A quality assurance reviewer needed documentation that would survive a retrospective audit.

One document. Four readers. Four different primary needs. The nursing staff navigated this implicitly, having learned through experience which details mattered and to whom. Nobody had ever written it down, because nobody had needed to. The nurses just knew.

The prompts the implementation team wrote reflected their surface-level understanding of the task: “Summarize this patient intake data into a clinical documentation format.” The model produced fluent, professional summaries. They were also consistently wrong in ways that were hard to see immediately. Billing language drifted. Urgency signals were present but soft. The physician context was accurate but incomplete in exactly the ways that would matter during a handoff.

Nurses started editing the AI drafts heavily, then stopped using them. The promised time savings evaporated. The project died.

Diagram showing one document serving four different audiences, before and after task decomposition — The same document served four readers with four different primary needs. Nobody had mapped that until the second pilot.

Why It Matters

The post-mortem analysis that followed identified the core failure clearly: the team had optimized prompt language before they had specified the task. They treated prompt engineering as a writing problem when it was actually a requirements problem.

This distinction matters enormously. A well-crafted prompt cannot compensate for an underspecified task. What you get is fluent output that technically answers the question you asked while missing the question you actually needed answered. The model is doing exactly what you told it to do. The problem is you didn’t know what you needed to tell it.

This isn’t a novel insight about AI. It’s a restatement of something software engineers have known for decades: garbage in, garbage out, and the most dangerous garbage is the kind that looks like good data. A vague requirement produces a working system that solves the wrong problem. A vague prompt produces confident text that answers the wrong question.

The difference is that with traditional software, the gap between specification and output is often visible immediately. A function that returns the wrong type throws an error. An LLM that answers the wrong question returns something plausible and complete, which means you can ship broken outputs for weeks before the damage surfaces.

The Structure That Was Missing

After the pilot failed, the health system brought in a clinical informatics team to run a second attempt. Before writing a single prompt, they spent two weeks doing something that felt frustratingly low-tech: they sat with nurses and watched them write intake summaries, asking them to narrate their decisions aloud.

What emerged was a detailed map of the task. The summaries had a hierarchy of concerns, ordered by reader and urgency. Certain phrases were load-bearing in ways that weren’t obvious to outsiders (the difference between “patient reports” and “patient states” carries legal weight in clinical documentation). The appropriate level of detail varied by presenting complaint category. There were soft conventions around hedging language that the nursing staff had internalized from years of chart review.

Once the team had documented all of this, writing effective prompts became straightforward. Not easy, but straightforward. The hard work had already been done. The prompts were encoding knowledge that now existed explicitly, rather than trying to extract knowledge that had never been articulated.

The second pilot succeeded. Nurses accepted the drafts at a high rate and editing time dropped significantly. The model hadn’t changed. The prompts looked completely different, but the real difference was the thinking that preceded them.

As a side note: the team also discovered that shorter, more constrained prompts outperformed longer, more elaborate ones once the underlying task was well-specified. If you’ve noticed this pattern in your own work, there’s a structural reason for it.

What We Can Learn

Prompt engineering has accumulated a lot of mystique it doesn’t deserve. There are genuine techniques worth knowing, around context placement, instruction ordering, output format specification, and how models handle ambiguity. These things matter at the margin.

But the factor that separates working AI implementations from failed ones, consistently, is whether the people building them did the hard cognitive work of specifying the task before they started iterating on language. That work is not glamorous. It involves talking to the people who do the job, watching how they actually work (which differs from how they describe working), and building an explicit model of what good output looks like and why.

This is structured thinking. It’s what good product managers do when they write requirements. It’s what good technical writers do when they document a system. It’s what good engineers do when they design an API before implementing it. The skill is old. The application is new.

The companies getting the most out of LLMs right now are not the ones with the cleverest prompts. They’re the ones that treat AI integration as a requirements exercise first and a technical exercise second. They’ve learned that the question “what exactly are we asking the model to do, and for whom, and under what constraints” has to be answered before “how do we phrase this.”

If your AI implementation isn’t working the way you expected, the honest first question isn’t “what’s wrong with our prompts.” It’s “did we actually know what we were asking for before we started asking.”

Most of the time, the answer is no. And no amount of prompt iteration fixes that.