The Setup
In late 2023, a mid-sized legal technology company had a problem most teams would envy: their AI-powered contract summarization feature worked beautifully. The prompt they’d written, after weeks of iteration, reliably extracted key clauses, flagged unusual terms, and formatted output in a structured JSON blob their downstream systems could parse without issue. Customers loved it. The team shipped it, documented it, and moved on to other problems.
Then OpenAI updated GPT-4.
No announcement in their inbox flagged that the specific behaviors they’d tuned against had shifted. The model version they were calling was still labeled the same in their code. But over the following two weeks, support tickets started trickling in. Summaries were longer. The JSON occasionally included extra commentary before the opening brace. A handful of clause extractions came back in a different order than expected, breaking a downstream sort function. Nothing catastrophic, but enough to erode trust with the customers who’d specifically bought the product for its consistency.
The team spent three days debugging before they traced it to the model itself.
What Happened
OpenAI, like most AI providers, reserves the right to update models without guaranteeing behavioral parity. Their documentation says as much. The model called via the API labeled gpt-4 is a moving target, and what “GPT-4” means today is not what it meant six months ago. Pinning to a specific snapshot (like gpt-4-0314) buys you stability, but only until that snapshot is deprecated, which OpenAI announces with roughly six months of notice.
The legal tech team had done almost everything right in terms of prompt engineering. They’d written clear instructions, given examples, specified output format, and tested extensively. What they hadn’t done was treat the model version as a dependency with its own versioning contract, the way you’d treat a database driver or an API client library.
This is the core mistake, and it’s an understandable one. When you write a prompt, you’re writing to a behavior, not to a spec. The behavior is an emergent property of a system you don’t control and can’t fully inspect. Unlike a function call where you know exactly what arguments are accepted and what will be returned, a prompt is a natural-language instruction handed to a system that will interpret it differently depending on factors you can’t see: training data composition, RLHF tuning choices, safety filtering layers, and system-level instruction-following preferences that shift with each update.
The team’s prompt relied on several implicit behaviors that the update had subtly changed. The old model would faithfully return only JSON when asked for JSON. The updated model, presumably fine-tuned to be more conversational and helpful, occasionally prefaced structured output with a short explanatory sentence. That’s arguably better behavior in a chat context. In a pipeline, it’s a parsing error.
They also relied on the model’s tendency to keep summaries under a certain token count without explicit instruction. When that tendency shifted, summaries ballooned, and they hit token limits in downstream calls they hadn’t anticipated.
Why It Matters
This story isn’t unique. Teams building on top of large language models are, in many cases, building on top of infrastructure that changes without a changelog they can subscribe to. The failure mode is particularly sneaky because it often isn’t a hard failure. Your system doesn’t crash. It just behaves slightly worse, or differently, in ways that are hard to detect without good evaluation coverage.
The deeper issue is that many teams treat prompt engineering as a one-time investment. You find what works, you ship it, and you expect it to keep working. That expectation is reasonable for most software dependencies. It is not reasonable for a model that a third party is actively training and retraining on an ongoing basis.
There’s a useful parallel in how prompt engineering without model understanding becomes cargo-culting. If you don’t understand why your prompt works, you won’t know when or how it’s stopped working.
What We Can Learn
The good news is that the legal tech team’s fixes were practical and mostly straightforward. Here’s what they changed, and what you can take from it.
Pin your model version. Always call a specific snapshot, not a generic alias. Yes, you’ll need to update this eventually. That’s the point. You want model updates to be a deliberate decision you make, not something that happens to you. When you’re ready to upgrade, you upgrade intentionally and run your evaluation suite before flipping the switch in production.
Write defensively against output format drift. If you need structured output, use every tool available to enforce it. Most providers now offer constrained output modes or JSON schema enforcement. Use them. Don’t rely on the model’s learned tendency to comply with format instructions, because that tendency can shift. Treat format enforcement as infrastructure, not as a polite request.
Build an evaluation suite and run it on every model change. This is the one that most teams skip because it feels like overhead. It isn’t. Even a small set of golden examples, inputs where you know exactly what the correct output looks like, gives you a tripwire. When a model update changes behavior, your eval suite catches it before your customers do. The legal tech team built twenty representative test contracts and ran their summarization pipeline against them after the incident. Those twenty tests would have caught the regression immediately.
Log model responses in production. You can’t debug what you can’t see. Sampling a percentage of real production outputs and storing them gives you a corpus to compare against when something goes wrong. It also helps you notice gradual drift, where behavior shifts slowly enough that no single output looks obviously wrong.
Separate prompt logic from application logic. If your prompt is hardcoded in the middle of a function, changing it requires a code deploy. Prompts should live somewhere you can version, update, and roll back independently. A prompt that’s embedded in your application code is the kind of quiet accumulating debt that becomes expensive to address after the fact.
One more thing worth saying plainly: some of these precautions will feel like over-engineering when your current prompt is working fine. That feeling is the exact moment to implement them. The legal tech team’s prompt worked fine right up until it didn’t, and at that point they were already in customer-facing trouble. Instrumenting your AI pipeline for stability is much easier to do before an incident than during one.
The broader lesson is that LLM-dependent features are more fragile than they appear, not because the technology is immature, but because you’re depending on a system that is explicitly designed to keep improving. Improvement is a feature. Unpredictable behavioral change is the side effect. You can manage that side effect with good engineering practices, the same practices that apply to any dependency you don’t fully control.
You built something that works. Now build it to last.