In 2023, a mid-sized fintech company (one that processes payments for small businesses, the kind of company with about forty engineers and strong opinions about Postgres) made a decision that seemed obviously correct at the time. They rolled out GitHub Copilot across the entire engineering org. Within two quarters, they were shipping features noticeably faster. Pull request volume climbed. Onboarding time for new engineers dropped. The metrics looked great.

Eighteen months later, their most experienced backend engineer sat across from a conference room whiteboard and could not explain, with confidence, how their own transaction reconciliation service worked.

This is not a story about AI being bad. It’s a story about a subtle and underappreciated cost that doesn’t show up in your velocity metrics.

The Setup

The reconciliation service was critical infrastructure. It matched incoming payment confirmations against ledger entries, flagged discrepancies, and triggered alerts. It had been around for about three years, originally written by two engineers who both left the company in 2022. By the time the team fully adopted Copilot, the service had absorbed hundreds of small AI-assisted changes: edge case handling, retry logic tweaks, new fee calculation branches, patches to deal with quirks from specific payment processors.

Each individual change made sense in isolation. The AI suggestions were, by most measures, good. They were syntactically correct, they passed tests, and they solved the immediate problem the engineer was looking at. But nobody was authoring these changes in the way you author code when you have to hold the whole system in your head. Engineers were accepting suggestions, verifying the local behavior, and moving on. The cognitive work shifted from “design this” to “validate this,” which feels like the same thing but absolutely isn’t.

The reconciliation service grew into what the team privately started calling a “black box with tests.” They had reasonable test coverage. They had no mental model.

Abstract illustration of a human mind's internal model of a system fragmenting into disconnected pieces
Authoring code builds a mental model. Accepting suggestions borrows one. They are not the same thing.

What Happened

The reckoning came when a payment processor the company worked with changed their confirmation message format. A field that had previously been a Unix timestamp was now an ISO 8601 string. Small change. The kind of thing you fix in twenty minutes if you understand your system.

It took the team four days and caused incorrect reconciliation for a subset of transactions during that window. Not because the fix was hard, but because nobody could trace with confidence where timestamps were being parsed, transformed, stored, and compared across the service. They had to archaeologically reconstruct the data flow from code that had been assembled in pieces over eighteen months, by multiple engineers, many of whom were accepting AI suggestions they understood locally but not holistically.

The incident post-mortem was honest. The team wrote, in their own words: “We have lost the thread of this service.”

They weren’t blaming the AI tool. They were describing an organizational failure that the tool had enabled. There’s a meaningful difference.

Why This Happens

Code comprehension is not a passive side effect of writing code. It’s built through the act of authoring, through making decisions about structure, naming, and flow, through the friction of having to type out a solution and feel its awkwardness. When you write a function, you’re also building a mental representation of it. When you accept a suggestion, you’re borrowing someone else’s representation without necessarily internalizing it.

This is not a new problem. Senior developers have watched it happen with Stack Overflow copying for years. But the scale and fluency of modern AI assistance changes the dynamic significantly. Copilot and its contemporaries don’t just give you snippets. They draft entire functions, propose architectural patterns, fill in the middle of your logic. The suggestions are coherent and often idiomatic. They’re easy to accept because they look right. And because they look right, your skepticism lowers.

There’s a reasonable argument that this is fine for boilerplate: CRUD endpoints, serialization code, standard middleware setup. That code was never the repository of deep understanding anyway. The problem is that AI assistance doesn’t cleanly stay in the boilerplate lane. It bleeds into business logic, into state management, into the subtle conditional branches that encode actual product decisions.

When those pieces are AI-generated and locally validated rather than deliberately authored, you end up with a codebase that works but isn’t understood. Tests pass. The system runs. But the team’s ability to reason about the system under novel conditions, exactly the ability you need during incidents, degrades.

This connects to something broader about how AI tools change the nature of engineering work. The engineer who writes less code is often worth more precisely because understanding, not output, is the hard part. AI copilots increase output dramatically. The question is what they do to understanding.

What We Can Learn

The fintech team’s response is instructive. They didn’t ban Copilot. They introduced what they called “comprehension reviews,” a layer added to their code review process where the author had to explain, in plain language, the intent and mechanism of any AI-assisted code touching core services. Not the syntax. The reasoning. Why does this branch exist? What invariant does this function protect? What happens when this condition is false?

This sounds like overhead. It is overhead. But it’s overhead that forces engineers to rebuild the mental models that AI assistance short-circuits. The engineers who couldn’t answer the comprehension questions had to go find out, which is exactly the learning that the accept-and-move-on workflow had been skipping.

They also started maintaining what they called “understanding documents” for critical services: not architecture diagrams (those go stale), but narrative explanations of the core decisions and data flows, written by whoever last understood the system, updated whenever that understanding changed significantly. Think of it as a README that a new team member could use to build a mental model before reading a single line of code.

A few practical things that generalize from their experience:

Treat AI assistance differently for different code surfaces. Generated boilerplate in a route handler is low risk. Generated logic in a financial calculation or a state machine is high risk. Your review process should reflect this asymmetry.

Slow down on business logic. When Copilot suggests something in the core domain of your product, the right move is usually to read it, throw it away, and write it yourself. Use the suggestion as a reference, not a draft. The understanding you build by authoring is worth more than the minutes you save.

Make understanding a first-class artifact. Tests verify behavior. Documentation describes behavior. Neither of these is the same as a human being able to reason about behavior under novel conditions. Invest in that explicitly, not as documentation theater but as actual knowledge transfer and retention.

The fintech team is doing fine now. Their reconciliation service has been substantially refactored by engineers who actually understand it. The process was painful and took about two months of careful work. They estimate they’ll do it again in a couple of years if they’re not deliberate about it.

The AI copilots are getting better, and that’s real. But a tool that can generate code faster than you can comprehend it isn’t purely a productivity multiplier. It’s also a comprehension risk. The teams that will do best with these tools are the ones that recognize the difference between shipping code and owning it.