Read Anthropic’s case study about Graphite Reviewer

Nearly every customer, candidate, and investor who I’ve talked to over the past 2 years has asked me the same question: how is Graphite still relevant in a world where GitHub Copilot & ChatGPT generate more and more of the world’s code? Why do we care so much about crafting an amazing DevEx for pull requests if the entire paradigm of software development is changing?

The relentless pace of improvement in AI code generation has forced us to think deeply about these questions and to continually refine our perspective on the new developer toolchain. It’s easy to imagine AI rendering pull requests as we know them obsolete, but the more we’ve examined how AI is changing the software development lifecycle the more convinced we’ve become that it will do the exact opposite: AI will make the entire “outer loop” of code review, testing, and deployments more critical and more challenging than ever.

Ironically, the only people who haven’t been asking me about what AI means for Graphite are IC engineers. So much of the focus on AI in devtools has been on coding assistants that live in the “inner loop” of the software development lifecycle (SDLC), where developers write their changes. However, engineers spend a meaningful fraction of their time - 30% or more at many companies - in the “outer loop” of development - testing, reviewing, and deploying changes they’ve written. Given that the volume of AI-generated code changes will continue to increase dramatically, how will developers keep up with the massive influx of AI-generated code in the coming years? The answer is the missing piece of the new developer toolchain - an “outer loop” review, testing, and deployment platform for the age of AI-generated code.

Before we can think about what this future looks like, we need a bit of context on how AI code generation is changing software development more broadly. The most apparent effect of AI coding assistants so far has been that developers are using them to write a higher volume of code at a faster rate. In early 2023, GitHub estimated that 46% of code on the platform was being written by Copilot, up from 27% in June 2022. Assuming this trend has continued, it's very likely that Copilot now writes the majority of code committed to GitHub. In absolute terms, Copilot is helping developers do more with their time - GitHub estimates that Copilot users complete engineering tasks 55% faster and write 8.7% more pull requests. As impressive as these figures are, they’ll only become more striking as the base models improve and use of assistants becomes more ubiquitous in the coming years.

Currently 27% of Copilot’s suggestions are accepted, and 92% of developers are reporting using AI to help them write code.

To understand how much better AI code generation will get in the coming years, we first have to look at its current shortcomings. One major limitation is the amount of context on a particular codebase that AI assistants can use when generating new code. Although context windows have expanded exponentially over the past few years, the state-of-the-art base models are still unable to take in enterprise-scale codebases - often 100k-1m+ lines of code (LoC) - in their entirety, limiting the model's ability to write new code using existing methods and conventions. Fine-tuning and retrieval-augmented generation (RAG) can help larger enterprises work around this limitation, but these require investments in training and specialized infrastructure. Fortunately, context windows aren’t likely to be a roadblock for long. With another order of magnitude increase in context window size, models will simply take in the entire codebase (and perhaps the entire git history etc) before writing new code.

Source: Towards Data Science

A second, less talked-about limitation of today’s models is the output window. Unlike the exponential increases in context windows, output windows have grown more incrementally from GPT-3’s 2k tokens/~250 LoC to GPT-4 Turbo’s 4k tokens/~500 LoC. This is enough for smaller tasks and changes, but it means that larger features, refactors, and migrations will remain out of reach for AI assistants until base model output windows expand.

A third meaningful frontier in AI code generation is the modality of the assistant. So far we’ve only seen widespread adoption of synchronous tab completion tools that live in your IDE (i.e. Copilot), but much of the promise of LLMs lies in “agents” that can perform more complicated tasks asynchronously. GitHub recently previewed this more advanced interaction in its Copilot Workspaces product, and startups such as Cognition Labs, Factory, Magic, and Sweep are all building towards a similar experience.

Between these and other promising development directions, it’s clear that the base models have significant room to improve in the coming years, which will only make AI coding assistants more powerful and in turn more ubiquitous and prolific.

If the age of AI-generated code changes is only just beginning, what then does the future hold for the pull request workflow? Should we still expect a human to be reviewing every PR in 5 years?

While AI coding assistants have improved dramatically in the past few years, they still have significant shortfalls in terms of correctness and security that make a human-in-the-loop necessary when code quality matters. A recent study found that 52% of ChatGPT answers to Stack Overflow coding questions contained bugs or incorrect code. Another study found that Copilot generated code that contained known security vulnerabilities in 40% of controlled scenarios relevant to MITRE’s “Top 25” Common Weakness Enumeration (CWE) list.

Copilot’s docs recommend that one use “Copilot together with testing practices and security tools, as well as your own judgment”.

Even if future models improve greatly in these areas, they’ll still be far from perfect. For the foreseeable future, engineers will still need to review every pull request even if it was AI-generated.

How does the new outer loop developer toolchain differ from today’s tooling? The biggest requirement is that it needs to handle a much higher throughput of pull requests open in parallel as every engineer becomes more efficient. 10-person teams will soon hit scaling challenges previously faced by 50-100-person orgs, and 100-person teams will hit 1000-person team problems, etc. Overcoming this requires changes across the review, testing, and deployment workflow, including:

  • Tooling that helps engineers track, prioritize, and get notified about their in-flight changes

  • “Driver assist” features for reviewers to focus and streamline the code review process

  • Optimized CI pipelines that minimize runtimes/duplicative tests and help engineers resolve failed tests quickly

  • Merge queues that minimize and resolve conflicts between in-flight PRs in the same repo

  • Deployment tools that gracefully shepherd changes out to production and roll them back when regressions happen

If this list sounds familiar, it’s probably because you’ve heard us talk about this same set of features here at Graphite as the foundation for working with stacked PRs. While stacking and AI code generation are orthogonal concepts, their impacts on the development lifecycle look very similar: the same team of engineers produces a higher volume of pull requests in the same span of time. From the beginning of Graphite we’ve been building tools to enable a future of “more changes faster”, for which AI code generation is a powerful accelerant.

Beyond these core workflow considerations, tomorrow’s outer loop devtools also have a familiar ally: generative AI. While AI won’t obviate the need for testing and review, it will certainly help developers perform these tasks more efficiently, summarizing larger changes and focusing their attention on the riskiest and most critical pieces. We’re already seeing this start to play out - even before the recent advances in LLMs, Google had started using AI models to convert review comments and CI errors to generated code suggestions, which the author can then accept and reply instantly to resolve the feedback. A recent wave of startups (i.e. WhatTheDiff, CodeAnt, CodeRabbit, Ellipsis) is taking this one step further, aiming to deliver a “level 2 self-driving” code review experience where an AI assistant (in the form of a GitHub bot) leaves comments and code suggestions on every PR. However, a combination of base model code understanding and UX limitations make building a compelling experience difficult - the majority of such generated comments are noise, and even when a change is actionable it still requires another “loop” to commit, test, review, and ship the fix.

To go beyond the current paradigm of noisy GitHub bots requires a far more ambitious end-to-end application of AI to the outer loop. The winning platform will start with deep understanding of the codebase and change history, apply AI to summarize, prioritize, and review each change, and shepherd and self-heal each PR as it passes through CI, merge, and deployment. This is the missing piece of the new developer toolchain, and a necessity for software companies to continue to ship quickly and safely in the age of generative AI.

P.S. if this is something you’re excited about building, come join us at Graphite.

Thanks to Peter Levine, Martin Casado, Jennifer Li, Satish Talluri, & Zeya Yang for reading early drafts of this post.

Built for the world's fastest engineering teams, now available for everyone