Every developer has lived this story. Everything passes locally. CI goes green. The deploy looks clean. Then, somewhere between your staging environment and real users, something breaks in a way that makes no sense until suddenly it makes complete sense.
That production-only bug isn’t random. It’s a signal about gaps in how you model the world your software actually runs in. Here’s how to read it.
1. Your Test Environment Is a Polite Fiction
Local and staging environments are optimistic by design. Databases are small, network calls are fast, queues are empty, and nobody is hammering the system at 3am. Real production has none of those courtesies.
The classic version of this is a race condition (two operations competing for the same resource in a way that depends on timing) that only appears under real load. You can’t reproduce it locally because you’re the only user. Your tests serialize everything that production runs in parallel. When a bug only appears above a certain request volume, that’s your test suite telling you it has no concept of concurrency.
The fix isn’t just finding the race condition. It’s adding a test that actually exercises concurrent access, which usually means either load testing as a first-class practice or targeted tests that spin up multiple goroutines, threads, or async tasks and let them collide on purpose.
2. You’re Testing the Happy Path and Calling It Coverage
Code coverage metrics are one of the more seductive lies in software development. Hitting 80% line coverage feels like rigor. But coverage tells you which lines were executed, not which states were exercised. A function can have 100% line coverage and still have untested behavior when inputs arrive in unexpected orders, shapes, or combinations.
Production-only bugs that trace back to edge-case inputs are almost always coverage theater. The function was “tested” but only with clean, expected data. Real users send you empty strings where you expected values, negative numbers where you expected positive ones, and ISO-8601 dates in timezones your parser doesn’t handle.
Property-based testing (a technique where you define the shape of valid inputs and let the framework generate thousands of random examples) is genuinely good at finding these. Libraries like Hypothesis for Python or fast-check for JavaScript will find edge cases you’d never think to write by hand. If your test suite has no property-based tests and you keep seeing input-related production bugs, that’s the gap.
3. Your Mocks Are Testing Your Assumptions, Not Your Code
Mocking external dependencies is necessary and good. But mocks encode assumptions about how external systems behave, and those assumptions drift. The third-party API you mocked two years ago now returns an additional field, or returns errors in a different shape, or has added rate limiting your mock silently ignores.
When a production bug traces back to an external system behaving differently than your mock, the lesson isn’t “mocks are bad.” The lesson is that mocks need to be validated against the real thing periodically. Contract testing (where both sides of an integration agree on a formal contract and test against it) is the structured answer here. Pact is a popular framework for this. Less formally, even running your mock-based tests against the real system in a scheduled integration job catches drift early.
The deeper point: a mock is a hypothesis about how something behaves. A production bug tells you your hypothesis was wrong.
4. Configuration and Secrets Are Code Too
One category of production-only bug is almost embarrassingly simple: the code is fine, but the environment isn’t. Wrong environment variable, missing feature flag, a secret that exists in production but not staging. These feel like operational failures, not test failures, but that framing lets you off the hook too easily.
If configuration can break your application, configuration needs to be tested. That means tests that validate required environment variables are present at startup, tests that verify feature flag states against expected behavior, and ideally a deployment checklist that’s enforced rather than advisory. A service that fails silently when an environment variable is missing is a service that will eventually have a mysterious production outage.
There’s also a subtler version: the code behaves differently in production because the configuration is different, and that difference is intentional but untested. If your production database has a different collation than your test database, string comparison behavior can diverge in ways that are genuinely hard to predict.
5. Time Is More Complicated Than You Think
Few things generate as many production-only bugs as time. Daylight saving transitions, leap years, leap seconds, timezone handling, date arithmetic at month boundaries, system clocks that aren’t synchronized. These almost never appear in local testing because developers tend to run tests at predictable times and don’t think to test at boundary conditions.
A concrete example: code that calculates whether a subscription is active by comparing now() to an expiration timestamp will behave differently depending on whether the comparison happens in UTC or the user’s local time. Tested against your own timezone, it works. Deployed to a server in a different region or serving users across timezones, it starts making wrong decisions around midnight.
The fix is injecting time as a dependency rather than calling now() directly, which makes your code testable at any point in time without waiting for the calendar to cooperate. Write tests that explicitly set the current time to DST transition points, month boundaries, and year rollovers. These bugs are not rare. They’re just rare enough that you only see them in production, where real time passes.
6. Your Tests Don’t Know What “Done” Looks Like for Async Work
A system that does work asynchronously (message queues, background jobs, event-driven pipelines) is fundamentally harder to test because the interesting behavior happens after the immediate call returns. Tests that don’t account for this either skip the async behavior entirely or use arbitrary sleeps to wait for it, both of which are wrong.
Skipping async testing means your tests prove the job was enqueued but not that it ran correctly. Arbitrary sleeps are just race conditions you’ve added on purpose. The result is a test suite that passes reliably in a fast CI environment but misses bugs in the actual job processing logic, which is exactly the logic that matters in production.
The correct pattern is deterministic waiting: polling for a state change up to a timeout, or using test frameworks that let you drain queues synchronously. In Rails, for example, perform_enqueued_jobs runs queued jobs inline during tests. Most queue frameworks have equivalents. If yours doesn’t, building a thin synchronous adapter for testing is worth the investment.
7. Production Has State. Your Tests Start Fresh Every Time.
Databases accumulate years of varied data. Production state is messy, denormalized in places, and contains records that violate invariants you introduced long after those records were created. Test databases start clean with pristine fixtures.
This means tests that pass cleanly on fresh data fail on real data because the real data has shapes your code doesn’t handle. A migration that adds a NOT NULL column works fine on your empty test database and explodes on the production table with ten million rows.
The practical response to this is twofold. First, include “dirty” data in your test fixtures, records that represent legacy states, partial migrations, and data that was valid under old rules. Second, treat database migrations as code that needs its own careful testing, including testing against a snapshot of production data structure (not the data itself, just the schema and representative volumes) before deploying. Production-only bugs that trace back to data shape are almost always preventable with slightly less optimistic test data.