Code Review Catches Fewer Bugs Than You Think, and the Research Explains Why

Code review is one of the few software engineering practices that has near-universal buy-in. It’s taught in bootcamps, required at most serious companies, and cited as a pillar of code quality. The problem is that the evidence for its bug-catching effectiveness is much weaker than its reputation suggests, and the reasons why are worth understanding.

1. Reviewers Spend Most of Their Attention on the Wrong Things

Studies of code review behavior consistently show that reviewers spend a disproportionate amount of time on surface-level issues: naming conventions, formatting, comment style. A 2015 study by Alberto Bacchelli and Christian Bird at Microsoft found that code reviewers and authors both ranked “finding defects” as the primary purpose of review, but when researchers actually categorized review comments, defect-finding comments were a small minority. Most comments addressed code style, understandability, and refactoring suggestions.

This isn’t because reviewers are bad at their jobs. It’s because style violations are immediately visible and easy to articulate, while logic errors require reconstructing the author’s intent, tracing execution paths, and holding a mental model of the surrounding system simultaneously. The brain takes the path of least resistance. Linters and formatters have largely automated the style problem, which should free reviewers for the harder work, but in practice many review conversations still revolve around whitespace and naming.

2. The Optimal Review Size Makes Most Real Reviews Ineffective

Smith Smartbear’s analysis of data from Cisco’s code review project (published in “Best Kept Secrets of Peer Code Review”) found that reviewer effectiveness drops significantly beyond around 200-400 lines of code. After 400 lines, defect density in review comments falls off sharply, meaning reviewers start missing things at a rate that compounds as the diff grows.

Now look at the pull requests in your own repositories. Many teams routinely merge PRs with 800, 1,200, or 2,000 lines changed. At that scale, reviewers aren’t really reviewing the whole thing. They’re skimming, anchoring on the parts they happen to understand, and rubber-stamping the rest with an “LGTM” because saying “I didn’t really read this” feels professionally embarrassing. The institution of code review is intact, but the actual inspection has quietly stopped happening.

Funnel diagram showing how code review filters surface issues but allows more serious defects to pass through — Code review selects against the bugs that are easiest to see, not necessarily the ones that matter most.

3. Expertise Mismatch Kills Bug Detection

A reviewer can only catch bugs in code they understand well enough to have an opinion about. This is obvious when stated directly, but teams routinely violate it. A frontend engineer reviews a database query optimization. A junior developer reviews a cryptographic implementation. A backend specialist reviews a React component’s state management.

The review looks fine on paper. A qualified engineer approved it. But the approval was essentially a social courtesy, not an inspection. The Dunning-Kruger problem runs in both directions here: reviewers who don’t know a domain well often don’t know what questions to ask, so they don’t ask them. The bugs that slip through code review most often are precisely the bugs that require specialized knowledge to spot, which is also why they tend to be the most serious ones.

4. Confirmation Bias Runs the Show

When you read code that someone else wrote, you’re not running the code in your head from scratch. You’re reading what’s there and checking whether it matches a plausible version of what should be there. These are very different cognitive tasks, and the second one is much easier to fake.

This is a well-documented phenomenon in proofreading (you read what you expect, not what’s written) and it applies directly to code review. If the code looks roughly like what you’d expect a solution to look like, your brain fills in the gaps and marks it correct. Off-by-one errors, wrong comparison operators, and subtle race conditions are notoriously hard to catch in review because they look almost identical to the correct version. The code pattern matches your expectation, so your brain accepts it.

5. Time Pressure Converts Reviews Into Approvals

Code review is typically treated as an interruption cost rather than a primary task. Reviewers are context-switching from their own work to read someone else’s, usually without dedicated time blocks for it. The social pressure to not be a blocker is real and significant. Pull requests that sit open for more than a day generate friction. Team leads check the “open PRs” list. The author pings you on Slack.

Under this pressure, review quality degrades in a predictable way: reviewers look for reasons to approve rather than reasons to reject. They find one substantive comment to make, post it to demonstrate engagement, and approve conditional on that fix. The defects that survive to production are often the ones that would have required the reviewer to push back firmly and ask for significant rework. That requires social capital that most engineers are reluctant to spend on a Tuesday afternoon.

6. Code Review Was Never Designed for Distributed Systems Bugs

The class of bugs that causes the most serious production incidents has shifted. Twenty years ago, a careful line-by-line review caught a meaningful fraction of real bugs because many real bugs were logic errors in self-contained functions. The kind of software most teams build now fails at the boundaries: network partitions, race conditions between services, cache invalidation, distributed transaction failures, subtle API contract violations.

These bugs don’t live in a single diff. They live in the interaction between this PR and a service written six months ago by someone who’s no longer on the team. No amount of staring at the new code will reveal them, because the new code isn’t wrong in isolation. Code review as a practice was developed for a different category of problem. It hasn’t kept pace with where the hard problems now live.

What This Actually Means

None of this argues for eliminating code review. It genuinely does improve code quality, catches some defects, spreads knowledge, and creates shared ownership. But treating it as a primary bug-prevention mechanism is a mistake that creates false confidence.

The practices that actually catch the bugs code review misses are automated testing (especially integration and property-based tests), static analysis, fuzzing for security-critical code, and structured architecture review that looks at component interactions rather than individual diffs. Code review works best when it’s doing what reviewers are actually good at: understanding intent, questioning assumptions, and flagging design problems before they calcify. Successful developers who write code comments for themselves understand that documentation of intent is partly about making that intent inspectable, which is where reviewers can add genuine value.

Stop treating the approval stamp as quality assurance. Treat it as one weak signal among many, and build your quality systems accordingly.

Code Review Catches Fewer Bugs Than You Think, and the Research Explains Why

1. Reviewers Spend Most of Their Attention on the Wrong Things

2. The Optimal Review Size Makes Most Real Reviews Ineffective

3. Expertise Mismatch Kills Bug Detection

4. Confirmation Bias Runs the Show

5. Time Pressure Converts Reviews Into Approvals

6. Code Review Was Never Designed for Distributed Systems Bugs

What This Actually Means

You might also like

Prompt Engineering Is Just Documentation, and You Already Know How to Do It

The Smarter Your Autocomplete Gets, the Worse Your Code Becomes

Most Explanations of Embeddings Get the Whole Point Backwards

Your LLM Is Not Reasoning. It Is Doing Something Stranger and More Useful.

A Vector Database Does Not Store Your Data. It Stores What Your Data Means.

A Payments Company Learned the Hard Way That Logging Everything Is Not the Same as Logging the Right Things

Stay ahead of the curve.

1. Reviewers Spend Most of Their Attention on the Wrong Things

2. The Optimal Review Size Makes Most Real Reviews Ineffective

3. Expertise Mismatch Kills Bug Detection

4. Confirmation Bias Runs the Show

5. Time Pressure Converts Reviews Into Approvals

6. Code Review Was Never Designed for Distributed Systems Bugs

What This Actually Means

Don't miss the signal.

You might also like

Prompt Engineering Is Just Documentation, and You Already Know How to Do It

The Smarter Your Autocomplete Gets, the Worse Your Code Becomes

Most Explanations of Embeddings Get the Whole Point Backwards

Your LLM Is Not Reasoning. It Is Doing Something Stranger and More Useful.

A Vector Database Does Not Store Your Data. It Stores What Your Data Means.

A Payments Company Learned the Hard Way That Logging Everything Is Not the Same as Logging the Right Things

Stay ahead of the curve.