Read Anthropic’s case study about Graphite Reviewer

Graydon Hoare, the author of Bors, wrote a fantastically titled blog post “technicalities: ‘not rocket science’ (the story of monotone and bors)” just over ten years ago. In the post, the author writes about an early 2000’s fight to stop merging bugs into their version control:

Thirteen years ago I worked at Cygnus/RedHat on a project with a delightful no-nonsense Australian hacker named Ben Elliston. We were faced with a somewhat nightmarish multi-timezone integration-testing scenario that carried a strong penalty for admitting transient bugs to the revision control repository (then CVS). To try to keep the project under control, Ben, Frank, and possibly a few other folks on the team cooked up a system of cron jobs, rsync, multiple CVS repositories and a small postgres database tracking test results.

The system had a simple job:

automatically maintain a repository of code that always passes all the tests.

It gave us peace of mind (customers only pulled from that repository, so never saw breakage) and had the important secondary benefit that engineers could do their day's work from known-good revisions, without hunting someone else's bug that accidentally slipped in. Because the system was automated, we knew there was no chance anyone would rush to commit something sloppy. You could commit all the sloppy stuff you wanted to the non-automated repository; it would just be declined by the automated process if it broke any tests. Unforgiving, but fair. At the time I felt I'd learned an important engineering lesson that any reasonable shop I went to in the future would probably also adopt. It turns out that if you set out to engineer a system with this as an explicit goal, it's not actually that hard to do. Ben described the idea as "not rocket science". Little did I know how elusive it would be! For reference sake, let me re-state the principle here:

The Not Rocket Science Rule Of Software Engineering:automatically maintain a repository of code that always passes all the tests

Graydon’s rule has stood the test of time. Almost all modern code repositories practice some form of continuous integration tests, where they try to build and test code before merging. In many teams, this is as simple as running GitHub Actions unit tests on every new PR before merging. But, as I’ll explain in this post, keeping trunk green and passing tests is harder than it first looks. Two decades of iterative engineering have culminated in complex merge queues that are finally approaching the optimal solution to this problem.


Note

Greg spends full workdays writing weekly deep dives on engineering practices and dev-tools. This is made possible because these articles help get the word out about Graphite. If you like this post, try Graphite today, and start shipping 30% faster!


Let’s start from first principles. If our goal is to “automatically maintain a repository of code that always passes all the tests,” why not just run all your tests on every PR before merging? This sounds a lot like Continuous Integration, and in small repositories, it works well.

Unfortunately, running all tests has a simple but immensely complex flaw - merge skew. Let’s say trunk is at commit (a). You create a new pull request at commit (b), branched off (a), and kick off a CI run. CI takes 10-20 minutes, so you use the time to get lunch. An hour passes and you return to your computer. CI is passing (yay) and the change is ready to merge. But during your CI run and lunch break, a teammate merged in their PR, moving trunk forward to (a’). When you merge your PR, GitHub will create a merge commit on trunk, or will rebase your change onto the latest trunk change. Either way, all you know is that your change passed tests when built on top of (a) - you no longer have any proof that it still passes all tests when rebased on top of (a’).

How many changes merge in between you running CI and merging your PR is what we call merge skew. A little bit of merge skew is bearable - the odds of any one PR causing a semantic conflict with another is low. But what if you work in a highly coupled codebase or in a codebase with a high rate of change? In an 8-hour work day with 100 engineers merging in one PR a day, 1-5 PRs might merge in between every other PR passing CI. As the volume of PRs increases semantic conflicts will become more and more likely.

What's the cost of a semantic conflict due to merge skew? Simply put, trunk breaks. Gradyon’s rule breaks down - the repository of code no longer always passes all the tests. The code may no longer build, or sometimes worse, niche functionality may appear buggy. Other engineers constantly pull and rebase their changes onto trunk - as long as main is broken they’ll be working on unstable footing. New changes start failing pre-merge CI, and engineers begin wondering if their work has a bug or if the whole codebase is borked.

At some point - often later than desirable - folks realize the trunk is broken. The on-call or build cop gets paged, while other engineers choose this as a great opportunity for a coffee break. Eventually, the bad change gets reverted, but at the cost of broad disruption for the engineers working on the codebase.

Semantic conflicts rarely impact end users because most engineering teams re-execute all CI on trunk after merging—partly to protect against this exact problem. Rather, the victim is team-wide engineering productivity.

To understand how bad of a problem these collisions are, one needs to think about the rate of collisions in their repo. One collision a year probably isn’t a problem worth solving. A collision a month is annoying. A collision a week is a major distraction. And daily collisions can grind large teams to a full halt.

The frequency of these collisions is a probabilistic equation directly proportional to the average merge skew for each PR and the direct or indirect coupling between those code changes. (If two engineers are working on the same files all day, a small merge skew can still result in multiple midair collisions).

Let’s do some math. Consider some number of PRs merging in a typical 14-hour workday window. As the team and the PR rate grow, the number of such in-flight PRs increases, amplifying the probability of a collision occurring.

Suppose the team reaches a point where, on average, 10 PRs merge per day. During the 14-hour window a PR is open, around 7 other PRs might merge (assuming merges are uniformly distributed throughout the day).

The probability of experiencing at least one collision while a PR is waiting to be merged is given by:

Terminal
P(<at least one collision>) = 1 - (1 - <collision probability>) ^ <# of PRs>
P(<at least one collision>) = 1 - (1 - 0.01)^7 = 0.0679

Assuming each collision results in trunk breakage that takes scrambling engineers one hour to notice and resolve, we can estimate the frequency of trunk breakages. As the rate of PR merges increases, so does the frequency of these disruptive events.

Daily PR MergesPRs During 14-Hour WindowProbability of Collision per PRExpected Daily CollisionsDaily Trunk Breakage Time (hrs)53.53.44%0.120.121076.79%0.680.68201413.26%2.652.65302119.03%5.715.71402824.22%9.699.69

The table illustrates how, as the rate of PR merges per day increases, the expected daily collisions—and thus the daily trunk breakage time—also increase. When the development team reaches higher PR rates like 30 or 40 merges per day, the system can expect multiple hours of downtime daily, which significantly hampers productivity.

As you can see from this table, even small repositories with only a handful of merges a day can benefit from a merge queue to avoid collisions and trunk breakages. In the real world, the trunk breakages depend heavily on PR time-to-merge and how coupled each PR is to other changes. But while you can tune the variables in the equation, at a high enough volume of merges, collisions will eventually dominate.

Anecdotally, merge skew is an afterthought for five-person teams. On a twenty-person repository, folks start to see occasional collisions. On a 50+ person repo, the problem often becomes frustrating enough to demand a solution. GitHub’s internal development hit major “conflict” problems when merging around 60 PRs a day.

Trains could also grow large, containing the changes of 15 pull requests. Large trains frequently “derailed” due to a deployment issue, conflicts, or the need for an engineer to remove their change. On painful occasions, developers could wait 8+ hours after joining a train for it to ship, only for it to be removed due to a conflict between two pull requests in the train.

The most brute-force way to cure merge skew is to enable “Require status checks to pass → Require branches to be up to date before merging” on GitHub repository.

By doing so, you force authors to rebase, retest, and babysit PRs before every merge such that every PR has zero merge skew. This setting can work for small teams who don’t merge many PRs, but it immediately becomes frustrating at any significant scale. Engineers can find themselves racing one another to rebase and merge, sometimes getting blocked multiple times in a row. What's worse, the more merge skew your repo has, the more painful it becomes for engineers trying to race one another.

Requiring branches to be up-to-date in a large repo is like creating a queue for a restaurant where only one person can stand in line at a time, you can’t see the queue length from afar, and any time someone beats you in line, you’re punished by having to lap the block for 15 minutes. It’s frustratingly manual, a waste of time, and simply a lottery when several people are trying to enter at the same time.

A better solution would be an automated queue.

The first popular open-source attempt to solve this was Bors, created by the aforementioned Graydon. Bors was a simple GitHub app that offered an alternative to vanilla PR merging.

To use it, you need to stop clicking the big green merge button, and instead leave a comment with this in it on any pull request that looks good to you

When a user tagged a PR with a specific comment, Bors would pick up the PR and queue it to merge. With the queue's enforced order, Bors could run CI one at a time on each PR in the queue before merging—ensuring that every change landed on trunk with zero skew.

In some applications, such as slower open-source projects, Bors worked great. PRs might stand open for days or weeks, and CI only takes minutes. In these types of repos, merge skew is large, and the cost of queuing for an additional CI run is low.

Asking engineers to guess and check rebases is a waste of time. Bors saved Eng time with an automated queue - but it came with a significant new problem: time to merge.

The limitation of Bors for us was throughput
https://news.ycombinator.com/item?id=21586644

Consider a repo merging in 40 code changes a day, with an average CI time of 15 minutes - this isn’t at all unusual for modern startups. By forcing CI to run one at a time in order, a naive Bors-style queue can only merge 4 changes an hour or 32 changes in an 8-hour work day. Not only will the average engineer find themselves waiting hours for their PR to land, but some PRs might not even merge before the day is over! In the case where a queued PR fails CI due to a collision or flakey tests, the author is destined to be ejected and will have to queue at the back of the line all over again.

A simple queue succeeds at ensuring correctness but at the massive cost of engineering velocity—the same problem that collision reduction set out to avoid in the first place.

What we really want is a merge queue with high throughput. As with most topics in dev tools, Google was early in building a high-scale solution.

Google chooses to be a monorepo system company, with 95% of engineers at the company contributing to the same codebase. Those choice comes with a slew of meaningful benefits, but means that Google ran into collision-style disruptions years before Bors was released.

Google’s monorepo is so high velocity that they average changes a second. Their short tests take about 10 minutes, while larger hermetic integration tests can take 30+ minutes. It’s not rocket science to realize that sequentially queuing these changes for merge is instantly off the table.

To achieve the necessary throughput, Google leverages the classic principles of batching and parallelization. First, a Google code change passes a suite of pretests before merging. These tests don’t stress about merge skew, and act as a general sanity check for the code change. After these tests pass, the engineer can merge their change directly into Google’s monorepo trunk.

Unlike most GitHub repos, Google maintains two-pointers to their trunk. One pointer links to the latest commit on trunk head, while the other pointer links to the latest commit on trunk which has been verified to pass all all tests. Googlers can choose to pull and rebase code from either of the pointers, and many default to pulling from “last green.” (This pattern of dual git pointers is often referred to as “green-main” outside of Google.)

Google’s systems treat the gap between these two pointers as an optimistic-but-unverified queue. Their TAP system batches the commits in this queue into related groups, and runs large parallelized tests to verify that everything is healthy and passing. If a batch succeed, the lagging pointer moves up. If a batch fails, TAP automatically begins rolling back the changes and kicks off automated bisects in an attempt to find the specific commit that broke the build.

This system is complicated but fast. The average engineer has the pleasant devEx of being able to merge in a PR after a short battery of tests without needing to think about merge skew. Immediately, teammates have the option of pulling in that change and developing off of it. After merging, all commits get subjected to a longer battery of tests, which guarantee maximum correctness, regardless of how fast changes are merging in. Technically, Google could parallelize large tests on every single commit, but that would become unaffordable and expensive - which is why parallelized batches make more economic sense.

Google doesn’t know for sure that every commit on trunk passes every test - but that’s okay. Only certain checkpoints in the trunk end up getting deployed (because the deploy rate is much slower than the commit rate). And users default to pulling from verified commit checkpoints, meaning engineers aren’t at risk of pulling broken code. It works well enough for Google to have continued using this system for over a decade!

It’s all nice and good that Google has advanced internal monorepos to handle 100,000+ commits a day. They serve as a proof point of what’s practically possible and help give monorepo enthusiasts a leg to stand on. But most of the world’s developers don’t work on Google’s monorepos - they instead mostly work on GitHub repos, and sometimes GitLab. How can the rest of folks orchestrate fast merges with low rates of disruptive collisions?

The early Bors system provided an automated way to ensure zero merge skew - but it was painfully slow for repositories hoping to merge in tens of changes a day with low latency. (Bors eventually updated to support batching which increased throughput at the cost of no longer verifying every commit)

Larger mid-2010 tech companies like Shopify, Uber, and Airbnb invested teams of engineers in developing internal merge queue solutions, such as Deployboard and ShipIt. Each followed in the footsteps of Bors, relying on engineers to stop pressing the native GitHub merge button and instead queue into a merge orchestration server. Each re-implementation layered in the same optimizations like batching and parallelization to deal with the inevitable speed complaints.

In 2020, GitLab launched a “merge trains” feature to address merge skew. The same year, GitHub themselves began developing their own merge queue to address internal developer velocity. By mid-2023, they launched the solution to all GitHub users, acknowledging the universal need for coordination in high-velocity repositories. (You can read about it here in the emoji-riddled press release)

GitHub’s native merge queue incorporated many of the learnings from prior efforts - as of this writing, the feature takes of the traditional green merge button and replaces it with a “Merge when ready” button. The queue can be configured to support both parallelizing CI execution across all eneuqued PRs, as well as batching multiple PRs from the queue. Hacker News comments point out how both GitLab and GitHub roughly copied the pattern laid out by the Bors project:

At the end of the day, merge queues should be a commodity solution to a mathematical problem. Bad merge queues have the right idea but break down with higher throughput. The theoretical best solution to this problem is a merge queue that runs CI in parallel, batches PRs, automatically handles failures, and implements virtual queues based on build dependencies. Google and other mega-tech companies iterated their way to these solutions over a decade ago, and it’s only a matter of time until engineers at small companies normalize using similar tooling.

Built for the world's fastest engineering teams, now available for everyone