What is a merge queue, and does your team need one?

Overview

A few days ago, GitHub made their merge queue generally available, and the comments on this HN post inspired me to write a short piece about the purpose a merge queue serves. Having contributed to the design of both GitHub's and Graphite's merge queue, I wanted to make sure folks understood how a merge queue works, and whether or not it's right for their team!

The problem: semantic merge conflicts

Our team works within a monorepo, with up to 15 developers pushing changes to our trunk branch (main) continuously every single day. Each of our PRs has to be approved, conflict-free, and pass our CI checks in order to be successfully merged.

As we’ve grown the team, started developing faster and creating more PRs, we began hitting the case where my_branch was passing CI, main was passing CI, but merging my_branch into main would break the build.

Consider this example:

main contains a function createUser that takes a name and an optional email parameter
my_branch utilizes the createUser function on a form submission, excluding email from the function call if it isn’t provided
someone_elses_branch makes the email parameter of the createUser function required instead of optional, and is merged into main after my PR becomes mergeable

As far as I know, my_branch is passing tests and is ready to be merged. main_branch will work fine until it is rebased on top of someone_elses_branch, and the changes likely won’t have merge conflicts since they might not even be touching the same file.

Since my_branch was merged before it was rebased onto someone_elses_branch, this error goes unnoticed and BAM - main is broken! This is known as a semantic merge conflict — a merge is technically possible, but results in a regression.

Our pre-merge checks take roughly 7-8 minutes. If your team has CI that takes even longer, this problem is exacerbated since potentially conflicting changes could be introduced to main at any point after CI has started on my_branch.

GitHub's require branches to be up to date branch protection rule

An immediate solution to this problem would be to enable the require branches to be up to date branch protection rule for our main branch.

This quickly became a point of frustration for our team since we’d be prompted to rebase/re-run CI far too often. In the time it would take for our checks to run, main had advanced and I’d have to rebase and re-run CI again before I could merge, creating an endless cycle of rebase hell.

If we were already experiencing these issues at our small team size, I could only imagine what this problem looked like for companies with upwards of 100 or even 1,000 collaborators and/or those with long-running CI checks.

Enter merge queue

Instead of merging PRs directly into the trunk, our developers started submitting their changes to our merge queue. The merge queue is solely responsible for determining an order that all PRs in the repo should be merged in, ensuring the PRs are healthy to merge in that order (watching for conflicts both actual and logical) and finally, performing the actual merge.

The simplest implementation of a merge queue is first-in, first out, but a more advanced merge queue like ours can optimize the order in which PRs are merged (more on this below). Ultimately, a merge queue prevents semantic merge conflicts by automating the rebase process during merge, and ensuring that the trunk branch stays “green.”

Our merge queue has a few additional functionalities including removals, hot-fixes, and pausing/resuming the queue. If you’re curious, you can read more in our merge queue documentation!

Optimizations

Preventing main from breaking is great, but enforcing an ordering on merges can come with the unfortunate side effect of slowing them down. The merge time in a merge queue depends mostly on your CI run time. Thankfully, there are a number of optimizations that the queue can employ to increase throughput as much as possible, even to the point where merges can be faster than they were before adopting the queue!

A good metric to use when determining if you need to use merge queue optimizations is your merge budget: if your checks take 20 minutes, your merge budget is 3 merges per hour, or 72 merges per day. If your team is merging more changes than you have time for or your CI is extremely long, your team would benefit from using a merge queue with optimizations like these:

Batching

One way to increase PR throughput is to run CI on PRs in “batches.” For example, if I have 6 PRs in the queue and a batch size of 3, rather than executing 6 CI runs on each of the PRs independently, the merge queue only needs to run CI twice in the best case: once on batch 1 which contains trunk + PR1 + PR2 + PR3, and once on batch 2 which contains trunk + PR4 + PR5 + PR6. Since most PRs will not have semantic merge conflicts with each other, the expectation is that both batches will pass CI, and we’ve cut the time required to merge these PRs down by 66%!

In the event that a batch does fail, the merge queue will break the batch down into groups and run CI on each group to determine the source of the failure. If the CI run for batch 1 fails, the queue would then run CI on trunk + PR1 + PR2 and on trunk + PR3. If the first group passes CI, PR3 is determined to be the source of the failure and will be removed before moving on to batch 2. An engineering team can look at how often this happens to determine what their optimal batch size is.

If you’re interested in the reading more merge queue optimization, check out the paper ”Keeping master green at scale” by the backend engineering team at Uber.

Fast-forward merge (for Graphite users!)

Merge queues enforce linear git history. Since a stack of changes also naturally has a linear history between the PRs in the stack, stacks of changes work really well in a stack-aware merge queue (like Graphite’s). The Graphite Merge Queue is able to rebase the entire stack onto trunk together, process CI on the PRs in parallel, and then quickly “fast-forward” the trunk branch to the top of the stack, effectively merging the stack atomically.

So do I need a merge queue?

Merge queues aren’t right for every team and whether or not your team needs a merge queue depends on a number of factors. If you’re experiencing any of the issues below, enabling a merge queue could increase your team’s productivity:

Your trunk branch is frequently in a broken state
Developers spend so much time rebasing their changes that PR throughput is affected
The combined issue of long-running checks and a high PR merge rate
Your team is growing so quickly that you’ll soon run into the above issues

If your team is stacking PRs with Graphite, using a stack-aware merge queue like Graphite’s is a no-brainer regardless of whether or not you're experiencing the issues above due to the optimizations fast-forward merge provides for stacks of PRs. The Graphite Merge Queue is offered on the Graphite Team/Enterprise plans, and you can learn more about it here.