Read Anthropic’s case study about Graphite Reviewer

OpenAI is one of the hottest conversations in tech right now . Every other Tweet and article seems to talk about the revolutionary applications of GPT-3, pushing claims that it could bring significant changes to everyone’s day-to-day workflow and even eliminate human-centric tasks altogether. After gaining access to the OpenAI API, the Graphite team decided to develop a couple of features to test GPT-3’s potential to disrupt the human-driven element of code review.

In this post, I’ll highlight a few GPT-3 features we’ve implemented and tested over the last few months, which features stuck (and why), future plans for AI-powered features on Graphite, and our answer that one burning question: could LLMs could one day replace code review entirely?

Right off the bat, the most obvious use-case for GPT-3 was to autofill pull request (PR) summaries. GPT-3’s task here is simple: take a diff as input and generate a PR summary that addresses what changed, the impact of the changes, and a test plan to ensure the changes work as intended.

We found that GPT-3 was really great at summarizing text/code, given some context of what changed, but not nearly as successful in answering questions that required a degree of thoughtfulness or reasoning, such as impact and testing.

A PR I put up recently added some additional styling to a table of commands on Graphite’s docs page (click here, if you’re curious). This is what GPT-3 had to say about my diff:

  • Added 3px solid var(--light-grey) border to th and td elements

  • Added font-size of 16px to th and td elements

  • Added text-align of left to th and td elements

  • Added color of var(--grey) to th elements

  • Added padding of 0.6rem to th elements

  • Added white-space of nowrap to th elements

  • …so on and so forth

It’s clear that GPT-3 is damn good at going through and itemizing all the changes in a given PR. While there’s virtually no thoughtful analysis involved in this output, it proves to be extremely useful from a code review standpoint. For PRs that are extremely large or complicated, a succinct list of its changes serves as a good refresher and aid for the reviewer, especially if they’re context switching frequently. From the author’s standpoint, they may not need to provide nearly as much context in the PR description, if at all. We at Graphite certainly “forget” to write PR descriptions ourselves.

The diff increases the styling of the table and its elements, with a focus on code elements.

The diff can be tested by rendering the table and inspecting the styling of its elements.

Impact and testing are prompts that require GPT-3 to provide more thoughtful answers about the code it’s been asked to analyze. Although the answers themselves are valid, they’re relatively simple-minded and lack a big-picture view of the implications the PR’s changes could have on the broader codebase.

If you give it a diff, is GPT-3 able to leave nits or complements on the code? The answer is sometimes. Similar to its ability to classify the impact and test plan of a body of code, GPT-3 leaves comments that sound good, which may be sufficient for something like a complement, but its ability to identify actionable nits with respect to things such as code quality isn’t very complete. Here’s an example of GPT-3 attempting to complete a response to a question left on my code:

The suggested comment completion sounds great and makes logical sense, but it doesn’t follow the natural cadence of a code review. Rather than offering new information or context, GPT-3 usually ends up regurgitating information it has already digested in slightly different and nuanced ways.

Similar to code summarization, autofilling and auto-completing both PR and commit titles were a slam dunk with GPT-3. It’s clear that after running some trial and error on a few different uses of GPT-3, it was most accurate and useful for tasks that fell within the bucket of summarization.

After using these features internally for a month or two, we noticed several pros and cons of using GPT-3 to assist code review:

  • Pros: A vast majority of PRs (sadly) have no description whatsoever - an AI generated description provides context where there would otherwise be none. Additionally when PRs on Graphite are merged, their descriptions becomes the commit message for the merged PR on trunk. As a result, you end up with a beautiful and thorough commit history when scrolling through trunk—something no one wants to go through the effort of typing out but is amazing to have on hand.

  • Cons: The OpenAI GPT-3 API is relatively slow (5/10 seconds per query) and has a limit on how big the diff input can be (capped at around 150 lines of code). While queries by themselves are relatively cheap at 1-2 cents, the costs can rack up if there’s a large volume of queries being run daily or per page load.

After playing around with and tweaking these features to our liking, we ended up un-shipping all but AI summarize and plan to continue to explore more summarization-related use cases for GPT-3.

  • Insights is a relatively new feature at Graphite that allows all developers within an org to see quantitative metrics around their team’s contributions to select repositories. If we were to feed the API a collection of PR bodies and titles as input, we could use the high-level summary output to provide more qualitative insights on what users have shipped over a configurable period of time.

  • Graphite notifications currently send digests that list PRs which have been created, edited, merged, etc. within a given timeframe. GPT-3 could be used to make the digests more targeted and personalized, allowing them to read less like a laundry list and more like a written summary of what happened during that timeframe.

  • GPT-3 could be used to identify trends across a certain org or repository, such as what kinds of comments are left, the correlation between comments and merged PRs, and so on. It may even eventually be able to describe a specific user’s style of development, as in what kind of comments they leave, where their dev-style could use improvement, etc.

While the applications of OpenAI’s GPT-3 in the world of code review offer interesting integrations and improvements on the current review process, it’s safe to say that the human aspect of reviewing code won’t be replaced any time soon. In the mean time, we’re excited to continue to iterate on existing and new GPT-3 features for Graphite that will hopefully make your code review process as easy and enjoyable as possible.

A note to users: we understand that there are significant privacy/security/data concerns when using OpenAI and other artificial intelligence APIs to process data. Rest assured, none of your source code is being used in OpenAI's training sets. Read more in our Privacy & Security docs here.

Check out the Graphite’s new AI summarize feature in this week’s changelog and let us know what you think on Twitter or Slack!

Built for the world's fastest engineering teams, now available for everyone