Deploying a machine learning model feels like a conclusion. You’ve trained it, evaluated it, maybe agonized over a few percentage points of accuracy, and now it’s live. But deployment is closer to the beginning of a model’s real life than the end of its development. Here’s what’s actually happening once your model hits production.
1. The Model Starts Encountering Data It Was Never Prepared For
Your training set was a snapshot. Real-world data is a river. The moment users start interacting with your model, they bring phrasing, contexts, edge cases, and combinations that your training data never included. This is called distribution shift, and it’s not a corner case. It’s the default.
A content moderation model trained on 2021 social media data will struggle with slang, memes, and references that emerged in 2023. A fraud detection model trained on one region’s transaction patterns will behave erratically when deployed to a market with different purchasing norms. The model doesn’t “know” it’s struggling. It just produces outputs, some of which are quietly wrong.
This is why accuracy on your holdout test set is a lower bound on how well the model performs in production, not a ceiling. The test set was drawn from the same distribution as your training data. The world wasn’t.
2. Your Model Is Decaying, and Usually Nobody Has Set Up Alerts for It
Model decay (sometimes called model drift or concept drift) is what happens when the statistical relationship between your inputs and correct outputs changes over time. A recommendation model that performed well when user tastes were X will slowly degrade as tastes shift toward Y. This degradation is often gradual enough that no single bad day triggers an incident.
The insidious part is that most monitoring setups watch for infrastructure failures, latency spikes, and error rates. Those things are easy to instrument. Semantic degradation in model outputs is much harder. If your model is classifying things slightly worse each week but never throwing an exception, your dashboards look fine.
The solution is to instrument model-specific metrics: prediction confidence distributions, output class frequencies, and where possible, some form of ground truth feedback loop that lets you compare predictions to eventual outcomes. Without that, you’re flying blind in a way that your SRE dashboards won’t catch.
3. The Model Is Being Interrogated by People You Didn’t Anticipate
You built a tool for use case A. Users will find use cases B through Z, some of which are clever, some of which are harmful, and most of which you never considered. This isn’t hypothetical. When OpenAI released GPT-3 via API, the range of applications that emerged within months included legal document summarization, dungeon-master style games, code generation, and therapeutic chatbots. None of those were the primary design intent.
For your model specifically, this means two things. First, your model will be used as a component in systems you have no visibility into. Someone will pipe its outputs into another automated system. Second, adversarial users will probe it. Not necessarily sophisticated attackers, just curious or motivated people who push on the boundaries. If your model has any surface area that can be exploited (and most do), production is where that gets discovered. The security posture you built in development will be tested seriously for the first time.
4. Serving the Model Becomes Its Own Engineering Problem
A model that achieves great accuracy in a Jupyter notebook can be a nightmare to serve at scale. Inference latency, memory pressure, batching strategies, hardware provisioning: these become first-class concerns once real traffic arrives. A large transformer model that takes 400ms to return a response in a notebook might need to return in under 50ms under a user-facing SLA, which requires quantization (reducing numerical precision to shrink the model), hardware acceleration, or both.
The gap between “works in development” and “works under production load” is a recurring theme in software. For ML models, it’s especially pronounced because the model artifact itself is the bottleneck, not just the surrounding code. Teams often discover this late, because load testing an ML model requires realistic traffic simulation, which is hard to do before you have real users.
5. You Now Own an Artifact That’s Hard to Update
Software gets patched. Configs get changed. A model retraining cycle is slower and more expensive than a code deployment. If a production bug is in a function, you fix the function and redeploy in minutes. If a production failure is in model behavior, you might need weeks of new data collection, retraining, and re-evaluation before you can ship a fix.
This asymmetry shapes how teams should think about model versioning and rollback. You need to be able to revert to a previous model version quickly, which means keeping old versions around even when disk costs make that uncomfortable. You also need a deployment strategy (blue-green, canary, shadow mode) that lets you validate a new model version against real traffic before fully committing. Canary deployment for models isn’t the same as canary deployment for stateless services, because a model’s failure mode can be subtle and require statistical significance to detect.
6. The Model Encodes a Moment in Time, and That Moment Has Already Passed
Every trained model is a compressed artifact of a specific slice of history. The language model that learned from text written before a major event won’t know about that event. The medical imaging model trained on data from one hospital network will encode that network’s equipment characteristics and labeling conventions. The world moves, and the model stays still.
This isn’t just a capability concern. It’s a values concern. A hiring model trained on historical promotion decisions will encode whatever biases were present in those decisions. The model doesn’t “intend” to be biased. It’s just doing what it was shaped to do, which is reproduce patterns from the past. Deploying such a model and treating its outputs as objective is a choice, and it’s a choice with consequences that tend to compound quietly.
The honest framing is that your mental model of the LLM is what’s broken when you think of a deployed model as a static correct artifact rather than a time-stamped approximation with a shelf life. Treating deployment as a lifecycle event, rather than a destination, is the shift that separates teams who handle this well from teams who get surprised by it repeatedly.