Your Tests Are Passing and Your Software Is Broken

A green test suite is not proof that software works. It’s proof that software works the way you imagined it would. Those are two different things, and confusing them is how production incidents happen on a Tuesday morning despite a CI pipeline that showed 847 passing tests.

The testing orthodoxy has convinced engineers that coverage is a proxy for correctness. It isn’t. Coverage measures which lines of code a test touched. It says nothing about whether the test asked a meaningful question, whether the system behaves correctly under realistic conditions, or whether the thing you built actually solves the problem a user has. The green checkmark is a measurement of effort, dressed up as a measurement of quality.

Tests confirm your assumptions, not reality

Every test is a formalization of what the author believed the code should do. When a developer writes both the implementation and the test, the test is almost always a mirror of the implementation’s assumptions, not an independent check on them. This is the core problem. You can have 100% branch coverage on a payment processing module and still ship a bug that charges users twice, if neither the code nor the tests ever modeled the case where a network timeout causes a retry on an already-completed transaction.

The Knight Capital incident in 2012 is the clearest example of this at scale. Their automated trading system accumulated a position of $7 billion in 45 minutes because a deployment error activated old code. Their tests, by any reasonable measure, would have been passing. The failure wasn’t in logic the tests could see. It was in the interaction between deployment state and live market conditions, a category of failure that unit tests are structurally unable to catch.

Diagram contrasting the orderly world of test environments with the complexity of production systems — Tests model a simplified version of reality. Production is the rest of it.

The gap between the test environment and production

Most test suites run against a sanitized version of the world. Mocked dependencies, fixed timestamps, predictable data volumes, controlled concurrency. Production is none of those things. Production has users doing things in the wrong order. It has third-party APIs that respond slowly, or not at all. It has database rows with values that were valid under a schema from three years ago.

This isn’t an argument against mocking or controlled environments. Those are necessary. The argument is that tests written entirely within that sanitized world can only validate behavior within that sanitized world. The question “does this work in production” is almost entirely unanswered by a passing test suite, and the confidence engineers derive from green status leads them to underinvest in the testing that would actually catch production failures: integration tests against real dependencies, load tests, chaos engineering, and most importantly, observability that tells you what’s happening after you’ve shipped.

High coverage can actively make things worse

Coverage targets, especially when enforced in CI pipelines, produce a specific pathology: tests that exist to satisfy the counter rather than to find bugs. A test that calls a function and asserts that it returned without raising an exception is worse than no test. It consumes time to write and maintain, it creates false confidence, and it crowds out the mental space for tests that ask hard questions.

Teams that chase coverage metrics tend to write shallow tests quickly. Teams focused on failure modes write fewer tests, but tests that encode genuine understanding of where the system could break. The latter approach finds more bugs. It also produces a test suite that still means something after six months of feature additions.

The counterargument

The standard response to this argument is that imperfect tests are better than no tests, and that’s true. Testing discipline, even when applied mechanically to coverage targets, builds habits that catch real bugs. The engineer who writes a test for every function is more likely to think carefully about edge cases than the engineer who writes none. Unit tests, done well, do catch logic errors and prevent regressions. The testing orthodoxy exists for reasons.

But the argument here isn’t against testing. It’s against treating test passage as a signal of correctness rather than a signal of internal consistency. These are compatible positions. You can maintain rigorous testing practices while being clear-eyed that those practices are a floor, not a ceiling, and that a green pipeline is not clearance to stop thinking about how the system can fail.

What actually reduces production failures

The organizations with the best reliability records don’t just have good tests. They have good observability. They track error rates, latency distributions, and anomalies in production behavior in real time. They run game days where they deliberately break things to see how systems respond. They do code review focused on failure modes, not just correctness. They treat a production incident as a signal that their mental model of the system was wrong, not just that someone wrote a bad line of code.

The shift is from asking “did the tests pass” to asking “how will I know when this is broken in production.” Those are different questions, and only one of them is actually about the system’s behavior in the world.

A passing test suite is evidence. It is not proof. Treating it as proof is how you end up debugging a production outage at midnight while your CI dashboard is entirely green.