Over the past year at Graphite, we've been running various experiments to explore how AI can enhance and improve engineering workflows. As AI capabilities rapidly advance, we see an opportunity to build features that even just two years ago sounded impossible.
In this post, I'll share the key lessons and insights from our experiments applying AI to the create-review-merge developer workflow.
The potential benefits of AI for engineers are clear - increased productivity, reduced toil, and more time to focus on high-level thinking. Generative AI promises to bring more leverage to building software, empowering small teams to accomplish feats previously only possible with armies of engineers.
However, skepticism and caution are certainly warranted. Applying AI naively to code generation can lead to bugs, security vulnerabilities, wasted time on useless features, and poorly architected codebases.
As CTO, In my initial exploration of AI feature, I established guiding principles:
Engineers should retain full visibility into which information is human-authored vs generated.
Start with code analysis, not code generation. Assuming generative AI will always create some rate of false information, I’d rather generate bad advice than subtly dangerous code.
Dogfood thoroughly to distinguish what's actually valuable from what just sounds interesting on the surface.
Several experiments showed particular promise in using AI to improve the review process:
One of the earliest and most straightforward experiments we conducted was automating the generation of titles and descriptions for Pull Requests (PRs). Given that Large Language Models (LLMs) excel at summarizing complex sets of data, and that decades of engineering history show us that writing descriptive PRs is often overlooked, this seemed like a natural starting point.
The status quo in many codebases is disheartening—over half of all PRs are submitted with empty descriptions. In an ideal world, an engineer, after completing a feature, will also construct a PR description delving deeply into the rationale for the changes, how these changes align with broader project goals, as well as share any other relevant context. I’m not arguing here that AI would do better at this task than a competent human engineer who had plenty of time. What I am saying however, is that engineers often DO NOT have plenty of time to sit around writing intensely detailed descriptions and having an auto-generated one is indisputably better than having none at all.
As we dogfooded this feature within our own team, we quickly discovered an unexpected but highly beneficial outcome: Beyond just populating PRs with more comprehensive descriptions, this feature led to a significant improvement in the quality of commit messages within our
main branch. GitHub, by default, transforms the PR description into a commit message when changes are merged into the trunk. The result was a rich, well-annotated commit history that made it much easier to review past changes and, if necessary, pinpoint commits for reversion.
The rich commit history it provided, coupled with its frictionless integration into our existing daily workflow, clearly demonstrated the feature’s value.
Each week, our team receives an automated Slack message summarizing all the changes made to the codebase that week. The digest highlights major projects that launched, summarizes themes across PRs, and calls out great examples of work.
This qualitative overview provides useful context beyond classic quantitative measurements like commits per week. The summaries can be customized for each recipient based on the code they touched, focusing on what the recipient cares most about. Because of the velocity at which we work and the volume of features that get shipped, the average engineer on our team reads less than 20% of the code that gets merged — generated digests help illuminate superb work they may have missed and fill in any contextual gaps.
A common pain point in code reviews is imprecision in feedback. Reviewers will often leave vague comments like,:
// We should avoid allocating extra objects here
Manually translating these kinds of textual comments into specific suggested code changes is tedious for the author and the reviewer.
We built a feature that automatically translates a written comment into a code suggestion:
This saves several review cycles of clarifying intent and avoids ambiguities. Authors immediately see concrete examples of what the change could look like. The original phrasing is preserved to maintain context.
Early testing showed dramatically faster review iteration times. Reviewers were more confident leaving actionable suggestions directly in code since they didn't have to worry about getting the syntax exactly right.
There are still risks of buggy generations leading to incorrect edits, but in practice, the suggestions are short enough to be low risk, and engineers still double-check the translations before posting.
One of the challenges new team members often face is understanding how to navigate and operate within an existing codebase. New hires frequently ask questions like, “How do I do x in this codebase?” We thought this type of query would be another excellent candidate for some AI intervention and so we launched another experiment: a bot that indexes our codebase and provides comprehensive answers to these ramp-up questions.
In our trial, we took an interesting approach: whenever a question arose around ramp-up, we first had a peer engineer draft an answer. Only after receiving the human-generated response did we reveal the answer generated by the AI bot. Over the course of our experiment we found that the AI's answers were not only accurate but also considerably more thorough than their human counterparts. The answers also often included helpful pointers to specific sections of the code, enhancing the utility of the response.
The bot proved to be invaluable in a few key ways. First, it helped new teammates get unblocked more quickly, letting them proceed with their tasks without unnecessary delays. Second, it reduced the hesitancy to ask questions, as team members didn't feel like they were imposing on their peers. The AI bot was also transparent in its operations, citing its sources and posting its answers on Slack. This open approach allowed for a collaborative dynamic, where other teammates could chime in to add historical context or additional explanations when necessary. This transparency also reduced any anxiety over accuracy, or “false confidence” that LLM’s sometimes present, even when they are incorrect.
By successfully integrating a Q&A bot into our onboarding process, we've managed to streamline knowledge transfer and improve the ramp-up experience for new hires.
When reviewing PRs, it's easy to miss subtle issues like outdated patterns or inconsistent style. Humans lack the systematic rigor to reliably catch these minor violations - and linter advocates would argue that humans shouldn't focus on this genre of review in the first place.
We experimented with automatic AI reviews that runs a series of validation checks on new PRs:
Is this code covered by existing unit tests?
Does the new code follow our style guide and existing conventions?
Is it duplicating code that could be abstracted?
Are there related PRs or review discussions worth resurfacing here?
The AI then reports any findings and links to relevant examples from the codebase. Rather than trying to provide an authoritative pass/fail review, it offers a "second pair of eyes" and helps authors catch mistakes before even requesting human review.
While dogfooding, we found one of the most useful aspects was the speed at which the AI review posts to the PR. Generative AI can take tens of seconds, especially if it’s performing multi-step queries. However, 30 seconds is still dramatically faster than CI runs which can often take 5-10 minutes. Then of course there’s human review which often takes hours from PR open to approval.
By delivering feedback ASAP, engineers were often able to spot and fix small issues quickly, resulting in the eventual human reviewer seeing their second or third draft, rather than their first.
Sometimes the AI review would also flag a concern and the author would comment in logical disagreement. By the time a reviewer got to the PR, the could see the concern had already been considered and were more likely to approve the PR.
All in all, AI review has lead to fewer review cycles on internal PRs and we’re optimistic it could speed up engineering teams as we further experiment.
While we had many wins in our AI exploration, not every experiment led to a valuable outcome. Here are a few areas where our efforts didn't quite hit the mark.
Another area we looked into automating was the documentation process. We tried this by recursively generating summaries of the codebase, starting at the file level. These summaries would then be clustered and re-summarized multiple times, sometimes passing through as many as five layers. From these high-level summaries, we generated an outline and filled it in using agent-style search and generation methods.
The result was comprehensive documentation that initially seemed of high quality. However, we ran into a significant issue: around 5% of the generated information was incorrect or hallucinated. This lapse in accuracy had a corrosive effect on engineers' trust in the documentation, and even though editing out the mistakes was theoretically possible, doing so would undermine the time-saving benefits of automated documentation.
With an aim to evolve our style guide, we experimented with English sentence-style linting rules. The generative AI effectively identified true positive violations almost all of the time. The catch, however, was the frequency of false positives—occurring about 5% of the time but at a rate more frequent than actual violations. As a result, when a violation was flagged, there was about a fifty-fifty chance it was a false positive. This of course eroded users' trust in the analysis, rendering the experiment less useful than we had hoped.
Graphite already has robust Slack notifications that alert users about new PRs requiring their review. We decided to augment these notifications with descriptive adjectives, generated through an analysis of the pull requests. While the AI-generated descriptors were often amusing and even borderline clickbaity, they didn't add any real value. For example, instead of simply stating that a new PR is up for review, the enhanced notifications would whimsically announce that an "exciting" or "groundbreaking" new PR was awaiting attention. At best, these descriptors were more entertaining as a novelty than actually useful and often caused the notification title to overflow the popup.
While not every experiment was a success, several key principles emerged:
Leverage existing implementation context whenever possible. Comparing code changes against previous examples improves result quality tremendously while requiring no configuration from the user.
Favor asynchronous, opt-in features over disruptive notifications or blocking workflows. Proactively querying for help is better UX than unsolicited suggestions - automated CI already pioneered this lesson.
Balance false positives and false negatives carefully. Miss out on some potential gains to avoid annoying engineers with constantly mistaken flags.
After months of experimentation, we believe there is meaningful potential for AI to speed up the code review workflow. However, care must be taken to integrate these capabilities seamlessly into existing practices that engineers already understand and trust.
At Graphite, we're committed to exploring these frontiers carefully and focusing maniacally on solving real problems for real developers. If you would like to try out any of the aforementioned features, please email me at firstname.lastname@example.org or reach out on our community slack. We are expanding our beta access and are looking for feedback on where AI can have the most impact in your workflow!