Background gradient

LLMs services are cheaper than ever. GPT4 is smart, and seems to accept more tokens at a cheaper cost every six months. At Graphite I obsess day and night about code review - which makes me constantly wonder, why can’t AI review code well?

On the surface, the idea makes sense. GPT4 appears fantastic at reading code and outputting graceful, boring English. You, the reader, may have even tried feeding GPT4 your own code to review. When I first tried it, I was impressed enough to immediately start fantasizing about how AI could revolutionize this daily developer workflow.

The dream of AI code review has many perks. The average PR takes a few hours to write (sometimes taking much longer), a few minutes to build, a few minutes to test, and a few more minutes to deploy. However, the average code review takes ~18 hours from PR publishing. It's often the slowest step in shipping a change.

GPT4, on the other hand, takes seconds to generate a review, coming out faster than CI results. In a world with accurate AI code reviews, authors could get reviews nearly immediately, allowing them to incorporate feedback before they context switch. If the team still requires human approval, the author at least has a chance at cutting down on review cycles. If the team did trust the AI review however, then PR submission could become the final human interaction on the average passing code change - ushering in an era of fire-and-forget PRs.

With full knowledge of the potential value here, we set out experimenting internally with AI code review. Our product already processes every pull request made within a repo, so it was easy for us to call OpenAI and post reviews on our own internal PRs. We had… mixed results.

To the AI’s credit, it did catch a few useful things. It could fairly consistency highlight things like spelling mistakes and minor logical errors. Much more frequently however, the AI reviewer bot would generate false positives. It would comment incorrectly on aspects of the diff based on either misreading the lines, or failing to account for context outside the lines which changed.

We tried a variety of tactics to improve on our initial technique we liked to call: “feeding GPT4 the code diff.”

First, we tackled the problem that GPT4 doesn’t know what you care about when reviewing. For example, it might flag that a function is too long, even though the length is within your style guidelines. Likewise, you might hate ternaries and want them flagged, but the AI reviewer doesn’t mind them at all.

To help account for this, we added the ability for engineers to specify an “AI review guide” of sorts to augment the GPT4 queries and attempt to create more relevant reviews.

Secondly we noticed that many of the AI review errors stemmed from a lack of broader codebase context. In deciding whether a change is “good” or not, it’s crucial to use the existing code as a baseline, so to help here, we tried adding retrieval-augmented-generation (RAG) to the reviews. Before reviewing each diff, we’d have the AI first perform a vector search for similar code snippets from the rest of the codebase. We’d feed this context into the review query - increasing the token window but allowing the AI to compare the diff to existing coding standards.

Lastly, we attempted to address GPT4’s eager suggestions. Traditional use of GPT4 involves calling its completion API - and like an eager student, GPT4 hates staying silent. If a PR would include an obvious typo, GPT4 would reliably call it out. But if the PR was simple and flawless, the AI would invent a concern to mention. We were able to bring down the false-concern rate by switching to using GPT4’s new function-calling features, which have better default support for shutting up. Rather than asking the AI to complete the review, we’d give it the opportunity to call a function reporting a concern only if the AI felt it was right to call the function. In practice, this heavily cut down hallucinations from about 9:1 to 1:1.

Despite all of these improvements however, the AI reviewer's signal to noise ratio was just not good enough. Additionally, when evaluating the AI review experiment as a whole, issues of a completely different, more philosophical kind came up.

While dogfooding and refining our experimental review bot, we began hitting deeper philosophical problems with AI review. Fundamentally, we saw how reviews are about so much more than the code.

  • Incremental improvement: Code review forces you to review conversations that happen offline as well. These conversations outside the office, about where you hope to evolve the architecture and the product, are important. Google’s code review guidelines argue that a change should be accepted if it makes a codebase incrementally better - but that requires aspirations and subjectivity, something that AI lacks.

  • Author trust: Human reviewers consider how much they trust the author. Does the author often write buggy code, or do they have a track record of shipping smoothly? Are they new to this area of the codebase, or are they the original authors?

  • Reviewer learning: Code review forces engineers to maintain context beyond their own code. They force frequent, small conversations with peers, reminding you what's happening and asking you to weigh in. With an AI bot handling reviews, there would be a huge dip in this crucial knowledge sharing.

  • Accountability: When leaving a review, the review is implicitly taking partial accountability for the repercussions of the change. Post mortems are blameless, but the review might be tagged on the incident. Just like with self-driving cars, AI reviewers lack accountability.

  • Security: Code review exists, in part, to defend a codebase from bad actors. Arbitrary code changes could cause untold intentional damage. With mandatory human code review, slipping in a vulnerability becomes much harder.

  • Risk: To truly assess the impact of a change, the reviewer needs deep context on the state of the system and former incidents. Is it a live website or a pre-production demo? Does the DB table being queried have a thousand records or a billion? Did upgrading that dependency cause a bug last time, or does the upgrade fix a critical bug?

Given these broad problems with AI review, will we ever see LLMs performing code reviews in the future? I speculate that we’ll reach a middle ground. Services may use a combination of RAG and fine-tuning to customize models to the current codebase (GitHub’s Copilot has already announced something similar). They may automatically search over old PRs, comments, Notion docs, Slack comments, and more - surfacing relevant flags to authors and reviewers alike.

In this way, I can imagine AI acting as a super-linter for authors and contextual-lense for reviewers. For the indefinite future, however, I believe the final approval will remain human.