The Setup

In 2023, a mid-sized legal tech company had built something genuinely useful. Their product helped small law firms draft routine documents: NDAs, engagement letters, basic contract templates. The core workflow was straightforward. A lawyer answered a structured intake form, and the system used a third-party AI API to synthesize the responses into a formatted draft. Lawyers reviewed and edited. Time saved: real. Value delivered: measurable.

The integration was tight. The prompts were tuned over many months to produce output in a specific format that fed directly into the firm’s document management pipeline. The system knew to output certain field delimiters, to avoid footnotes in sections where the downstream parser would choke, to keep clause numbering consistent with a particular style guide. Hundreds of hours of iteration had gone into making the model’s output behave like a reliable component rather than a probabilistic text generator.

Then the vendor announced they were shutting down the API. Ninety days notice. The model the product depended on was going away.

What Happened

The immediate problem was not finding a replacement model. Models exist. The problem was that every other model produced output in subtly different ways, and “subtly different” was enough to break the downstream pipeline entirely.

Switching to a comparable API from a larger provider sounds trivial until you understand what the engineering team actually faced. The prompt that had worked reliably for months stopped working. Not because the new model was worse, but because it interpreted the same instructions differently. The prompt you write is not the prompt the model reads in any literal sense, and that problem multiplies when you switch between model families entirely.

The delimiter the old model reliably inserted between clause sections? The new model sometimes used it, sometimes did not, depending on phrasing that seemed irrelevant. The document management parser downstream expected exact strings. The mismatch was silent: no exceptions thrown, no error logs, just malformed documents that looked fine until a lawyer actually read them.

Re-tuning took six weeks. The team had to rebuild test cases from scratch because their evaluation suite was implicitly calibrated to the old model’s quirks, not to the actual desired output. They had been testing against the wrong surface without realizing it.

During those six weeks, the product was degraded. Customer churn ticked up. One enterprise client invoked a reliability clause in their contract and negotiated a service credit. The company survived, but the incident cost more than the original integration had.

Diagram showing how an abstraction layer normalizes varied AI model outputs into a consistent internal data format
The normalization layer is the only place your pipeline should know anything about the model's output format.

Why It Matters

This is not a story about a company making a dumb mistake. The engineers who built this product were competent. The choice to use a third-party AI API was reasonable. The mistake was architectural, and it is one that most teams building on top of AI APIs are making right now.

The core issue is treating the model’s output as a stable interface when it is not. A traditional API has a contract. If you call a REST endpoint and pass a valid payload, you get back a defined response structure. The provider is responsible for maintaining that contract across versions. If they break it, they broke their API.

An AI model has no such contract. The output is probabilistic. The same prompt can return different formatted output across model versions, across providers, and sometimes across individual requests. When you build a pipeline that depends on specific output structure, you are depending on observed behavior, not a specification. That is a fragile dependency even when the vendor stays alive.

Vendor failure makes it catastrophic. The legal tech company’s problem was not just that a vendor shut down. It was that the shutdown exposed a hidden assumption baked into their architecture: that the model’s specific behavioral quirks were stable infrastructure they could build on.

This is the vendor lock-in problem for AI, and it is more severe than the classic version. Classic lock-in (say, being stuck on a specific database or cloud provider) is painful to escape but usually feasible. The data is portable. The logic is yours. You pay a migration cost and move on. AI vendor lock-in can be invisible until the vendor disappears, because the dependency is not in your code, it is in the implicit contract between your prompts and a specific model’s behavior.

What We Can Learn

The engineers who rebuilt this product did something smart in the recovery: they inserted an abstraction layer between the AI call and the rest of the pipeline. Instead of having the document management system consume model output directly, they wrote a normalization step, a small parser that translated model output into their internal format before anything downstream saw it.

This is the correct architecture, and it is worth being specific about what it requires.

First, define your internal format explicitly, in code, before you write a single prompt. The model’s job is to produce content. Your normalization layer’s job is to extract structured data from that content and validate it. If the model fails to produce something parseable, the normalization layer raises an error at the boundary rather than letting malformed data corrupt the pipeline. This is the same principle as input validation on an API endpoint: you validate at the edge, not deep in the business logic.

Second, write your evaluation suite against your internal format, not the model’s raw output. The test question should not be “did the model produce the right text?” It should be “did the normalization layer extract the right structured output?” This decouples your quality measurement from any specific model’s behavior.

Third, periodically test against at least one alternative provider, even if you never switch. This is the equivalent of a disaster recovery drill. If you run your evaluation suite against a second model every quarter and it fails completely, you know your abstraction layer is not thick enough. If it degrades gracefully, you know your switching cost is manageable. This practice also forces you to keep your prompts legible and your normalization logic explicit rather than letting them calcify around a specific model’s quirks.

Fourth, read the vendor’s funding situation. This sounds cynical, but it is practical. Many well-funded AI startups are burning cash aggressively on inference costs, betting on future model improvements to bring margins up. Some of them will not make it. A vendor running a specialized model on thin margins with a small customer base is a real dependency risk. This does not mean avoid specialized vendors. It means weight that risk explicitly when you are deciding how tightly to couple your architecture to their specific output behavior.

The legal tech company eventually rebuilt with a cleaner architecture. They now maintain a thin adapter class for each AI provider they support, two of them currently, with a common interface that the rest of the system calls. Switching providers means swapping one adapter. The prompt tuning is still specific to each model, but the impact of that specificity is contained.

The incident cost them real money and customer trust. The lesson it forced on them made their product substantially more robust. The goal is to get the lesson without paying the tuition.