Why AI will never replace human code review

It needs to be said: AI can’t fully replace human code review.

Not now, and likely not ever. And this is coming from someone who spends late nights building a leading AI code review tool. Despite progress, I don’t ever see them becoming a stand-in for an actual human engineer signing off on a pull request.

There’s more code-generation tools out there than I can count on two hands, and as a result, we’re seeing a rise in AI-powered code review tools. Just open a new tab, type “AI code review” into Google, and you’ll find everything from small experimental side-projects to large enterprise solutions promising to "revolutionize how teams ship software."

But the duel explosion of AI code generation, hand-in-hand with review, has me worried.

Why AI code review can’t be the final word

Deterministic vs. contextual

One of my main gripes with AI replacing code review comes from a fundamental difference between code creation and code review. Code creation can be evaluated quickly: if I ask an LLM to generate a function, or spin up a little web app, I can then compile, test, or even run that code. ChatGPT, Claude, and Copilot can generate code and tests, and then you can run them in a tight iteration loop. That cycle is self-reinforcing, so code generation is fast and easy to verify. It’s this “prompt and run” loop which is fueling the rise of vibe-coding.

Code review, however, is a different beast. Go to any GitHub pull request, add .diff at the end of the URL, copy-paste that raw diff into your favorite LLM, and see what it has to say.

For example: https://patch-diff.githubusercontent.com/raw/vercel/next.js/pull/77237.diff

It’ll do an “okay” job. It highlights some good parts, flags a couple of potential issues, and stamps it with some general suggestions like, “You might want to handle edge cases.” But you and I both know: shipping that to production based solely on an AI’s sign-off is a poor practice. In the best scenarios, AI is a decent first pass, kind of like a sophisticated grammar checker for code. It might reduce the back-and-forth with your teammates. But a complete replacement? No way.

I’ve seen big leaps in AI code review. We’ve done a bunch of experiments at Graphite over the years that show how improvements in context window size, tool usage, and better false positive calibration can push these AI-based solutions further. Despite all that progress, to let LLMs leave the final stamp is to misunderstand the purpose of code review.

Hopefully this is reassuring. Engineers everywhere are anxious that tools, like the ones from Anthropic, OpenAI, or whatever, will reduce our jobs to just generating code, leaving the “grunt work” to machines. The Anthropic CEO saying that 90% of code will be AI generated isn’t calming the worry. But the moment we start relying on AI alone to review that code, we lose an essential layer of accountability and shared understanding.

Engineers are so much more than just code-machines. The more AI writes the code, the even more valuable it’ll be to have expert engineers reviewing it, deploying it, and iterating on it.

The importance of context

Let’s start with context. LLMs are only as good as the data and references they’re fed. A well-built AI code reviewer might have:

The diff, PR title, and description
Access to the full codebase through search
Links to historical PRs and comments
The ability to rummage through Slack, Notion, and Google Docs for design specs or past decisions
Possibly a web search for obscure library docs

That’s already a ton of context, but it’s not everything. It probably doesn’t know how your product roadmap shifted after a big meeting with The Customer. It won’t capture your team’s subjective bias toward composition over inheritance. It can’t weigh personal or strategic factors that might dictate an otherwise suboptimal code pattern. Real code review demands domain expertise and forward-thinking alignment.

Sure, you could theoretically log and index everything so the AI is always up to date (and I'm sure some teams will try). But let’s be honest: so much vital knowledge never makes its way into any permanent record. We still rely on that intangible fusion of experience, company culture, personal conversations, and intangible intuition. Human + machine context is always greater than the machine alone.

Consider how the best chess performance in the world comes from computer-assisted humans, not computers alone.

Pedagogy and collaboration

Code review isn’t just about verifying code correctness; it’s a fantastic medium for teaching and learning. New hires get an accelerated education in coding standards, best practices, and cultural norms by reading and reviewing each other’s PRs. Senior folks, even if they’re new to a part of the codebase, pick up the intricacies of how that system was built.

An LLM can give you the official design pattern name or library best practice. But it can’t replicate that collaborative push-and-pull where a senior engineer might say, “Hey, we don’t do that pattern here because we learned the hard way it doesn’t scale.” Or, “We found that approach fails in Production traffic spikes.” AI might highlight an inefficiency, but it won’t jump on a video call (or a whiteboard) to hash out an alternative architecture with you for an hour. That kind of discussion is how real alignment forms, and it’s only possible with actual humans who care about the codebase over the long haul.

Security and accountability

Code review should never be your sole firewall for security issues or major regressions, but it’s still a critical checkpoint. A teammate might notice that your code passes all tests but still opens a subtle back door for data exfiltration. Or perhaps it inadvertently doubles the CPU usage on your main cluster. An LLM might theoretically catch that, but it might not, especially if it’s coaxed into ignoring it or is fed misleading descriptions. A human, on the other hand, will call out suspicious changes or demand more answers in the PR.

Fundamentally, AI cannot accept real accountability. When an incident occurs and your team runs a postmortem, it’s normal to see the person who wrote the code and the person who reviewed and approved it front and center. If your code was partly AI-generated, that just means the real accountability shifts even more heavily onto the human reviewer. You need a flesh-and-blood engineer who’s willing to stand by that code, to say, “Yes, I reviewed it, and I missed the bug.” As soon as you hand the keys over to AI, you lose that chain of responsibility.

Famous IBM memo https://constelisvoss.com/pages/a-computer-can-never-be-held-accountable

Reframing AI code review: Fuzzy CI

So is there no place for AI in code review? Of course not. But I think we need to reframe how we talk about “AI code review.” It’s less about letting AI finalize decisions and more about letting AI do the routine scanning that we historically have put under the umbrella of “continuous integration.”

Humans used to do code inspections by hand in the 1970s (shout-out to Michael Fagan’s work at IBM for systematically introducing code inspections). The whole team would gather around printouts of code, scanning line by line for hours. This was before thorough automated tests or cheap compute cycles. Over the decades, we shifted some of that tedium to machines. We introduced:

Automated linting
Unit tests
Integration suites
Ephemeral deployments, traffic replay, and more advanced techniques

LLMs are the next logical step in that chain. They can do a deeper, fuzzier, more nuanced pass than your average rule-based linter. Call it “AI code review” if you want, but it’s basically advanced, context-aware CI that flags suspicious changes, style mismatches, or glaring bugs. And just like any test suite, it’ll be right 95% of the time, occasionally flake, and still require a real person to evaluate whether the flag is valid.

I find that perspective liberating, because it means we can let AI handle the mundane nits, like minor style infractions, subtle regex missteps, small performance traps. If we can have them pop up within seconds of a commit, while the author is still in the zone, that’s a win-win. The machine does the grunt work, the author fixes or dismisses it right away, and you’ve reduced friction for your teammates who review the final version. That’s the dream of zero-config linting on steroids.

(If you’re curious about some of my opinions on “nits” and why they’re annoying in code review, here’s a blog post I did on them.)

Regardless, humans stay in the loop

In the end, though, AI is not a replacement for real human review.

A truly effective code review is about more than just scanning for bugs or missteps, it’s about exchanging ideas, shaping architectural decisions, and building a shared understanding of the system. That’s the stuff that an LLM, for all its fancy generative abilities, simply cannot replicate.

As code generation becomes more automated and even the “boring” parts of review get covered by these large models, human engineers will have fewer trivial tasks to worry about. That’s actually a good thing. But the final thumbs-up on merging code? The accountability and knowledge sharing that come from one engineer reading and critiquing another engineer’s code? That’s not going anywhere.

Yes, 90% of code might be AI-generated at some point. But if we’re shipping code that’s never actually read or understood by a fellow human, we’re running a huge risk. I, for one, am not willing to hand my entire production deployment pipeline over to an unaccountable robot, no matter how advanced it gets.

Consider the obscene hypothetical where, one day, a Tesla crashes because Anthropic generated the code and OpenAI stamped it. Self driving car crashes are already a moral puzzle: if it’s LLMs all the way down, the real world becomes the test case.

Instead, we can embrace AI where it helps. Let’s keep working on those fuzzy linting and automated suggestion tools that can speed us up. But let’s also keep the human, final stamp of approval exactly where it belongs: in the hands of actual engineers who know and care about the long-term impact of the code.

I remain bullish on AI overall. I love playing with Cursor, ChatGPT, and the rest, but when someone suggests it’s about to replace every aspect of code review, I just shake my head. The future is almost certainly an AI-assisted pipeline with humans firmly at the helm.