The reproducible bug is almost a gift. You can write a test for it, fix it, verify the fix, close the ticket. Clean. The bug that shows up once in a production log at 2am on a Tuesday, then never again? That one gets marked “cannot reproduce” and quietly buried. This is a mistake, and it costs teams more than they admit.
My position is simple: the irreproducible bug deserves more attention than the reproducible one, not less. The reproducible bug is a solved problem waiting for implementation. The ghost bug is a signal that your mental model of the system is wrong somewhere, and every day you ignore it is a day that wrongness compounds.
Irreproducibility Is Information, Not Absence of Information
When a bug only appears in production and not on your laptop, that’s the system telling you something specific: the conditions that trigger this failure exist in production and not in your test environment. That’s actually a very precise statement. It narrows the problem to the delta between those two environments. Maybe it’s load, maybe it’s timing, maybe it’s a configuration value that differs, maybe it’s data that only exists at scale.
Consider a classic case: a race condition in connection pool management. On a developer machine with a single thread and a quiet database, you’ll never see it. Under production load with hundreds of concurrent requests, two threads occasionally try to check out the same connection in the same 50-microsecond window. The bug manifests as a cryptic exception, once a week, in a service that otherwise runs fine. Mark it “cannot reproduce” and move on. Three months later, traffic doubles, and that once-a-week bug becomes once-an-hour.
The irreproducibility wasn’t telling you the bug was insignificant. It was telling you exactly what conditions make it surface.
The Conditions That Hide Bugs Are the Conditions That Matter
Most serious production failures don’t happen because code is wrong in isolation. They happen because code is wrong under real conditions: real load, real data distributions, real network latency, real user behavior sequences. Your staging environment, no matter how carefully maintained, is a simplified model of reality.
This is why fast benchmarks don’t always predict app performance. The same principle applies to bug reproduction. A bug that only manifests under load isn’t a minor edge case, it’s a load-dependent failure mode. Those are exactly the bugs that kill systems at the worst possible time, because production traffic tends to spike at the moments when you can least afford an incident.
The bugs you can reproduce reliably in development are, almost by definition, bugs in the part of your behavior space you’ve already thought about. The ones you can’t reproduce live in the parts you haven’t.
Ignoring It Is a Form of Technical Debt with Interest
There’s a useful analogy to compiler warnings. A warning that fires every build becomes invisible noise. A warning you’ve consciously suppressed is a decision you’ve deferred. Both are fine as long as you’re right that the warning is harmless. When you’re wrong, the cost isn’t linear, it’s a cliff. Deferred bugs work the same way.
When a team marks something “cannot reproduce” without genuinely investigating the environmental gap, they’re not eliminating the bug. They’re eliminating the ticket. The bug continues to exist in the running system, accruing context around it that makes it harder to understand later. The engineer who vaguely remembers seeing it leaves the team. The logs that captured it age out of the retention window. The service gets refactored and the bug migrates to a new home where nobody even has the historical context to connect it.
Deferred understanding is the most dangerous form of technical debt, because it doesn’t show up on any balance sheet until the moment it fails catastrophically.
The Counterargument
The reasonable objection is resource allocation. You have a finite number of engineer-hours, a backlog of known, reproducible, clearly scoped bugs, and a ghost that might be a one-in-a-million cosmic ray bit flip. Spending days chasing it is how you fall behind on everything else.
This is a real tension, and I don’t want to dismiss it. Not every ghost bug warrants a week-long investigation. The triage question isn’t “can I reproduce this?” but “if this is real and I’m wrong about why, what’s the blast radius?” A cryptic log line in a non-critical reporting job is different from a cryptic log line in your payment processing service. The irreproducibility doesn’t change the risk calculus, it raises the uncertainty on it.
What I’m arguing against is the habit of using “cannot reproduce” as a terminal state. It should be a starting point for environmental analysis, not a filing cabinet.
The Practice
In concrete terms: when you can’t reproduce a bug, the first question is what differs between where it happened and where you’re looking. Log more. Get production-like data into your test environment. Write a failing test that captures the conditions you think were present, even if you can’t make it fail yet. The test documents your hypothesis. If it never fires, maybe you were right to deprioritize. If it fires six months later, you have a starting point.
The bugs that expose the deepest flaws in your system’s design are almost always the ones that only appear in the wild. They’re not hiding because they’re unimportant. They’re hiding because you haven’t yet built the environment where truth is visible.
That’s a systems problem worth solving.