Read Anthropic’s case study about Graphite Reviewer

Graphite's engineering team has a culture of moving extremely fast. This is very much a stylistic choice - some engineering teams move more carefully, but we like to iteratively ship small, fast changes out to users as quickly as possible. That culture is reflected in the nature of the Graphite product itself, which aims to accelerate code changes - and I’d argue this is somewhat a Conway Law-esque result.

The downside to our organization’s need for speed is that I’m also personally responsible for keeping the site online. For years, I was the first to get paged, and oh boy was I paged a lot during those early days.

In its first year, Graphite was plagued with regressions. Our small engineering team and our blistering pace meant that we’d frequently ship a regression to production and then have to scramble to roll back or fix forward. The approach was functional, but we were relying on paper-cut users giving us feedback. Without slowing down our velocity (culturally unattractive), or getting ten times better at unit testing (tricky because of our strong integration to GitHub’s API), there was little we could do to catch regressions before they impacted users.


Note

Greg spends full workdays writing weekly deep dives on engineering practices and dev-tools. This is made possible because these articles help get the word out about Graphite. If you like this post, try Graphite today, and start shipping 30% faster!


Something needed to change. We needed a reliable way to catch unknown unknowns without imposing extra work on developers. Around the end of 2022, after much debate, I chose to build a Graphite staging environment where we’d bake all deployments automatically, pre-production.

While I could have simply duplicated our application servers and created a second Postgres cluster, I decided to go whole-hog and create an entirely separate AWS account, complete with duplicate load balancers, VPCs, S3 buckets, ECS containers, and more. I used a third “tooling” AWS account to house what little was shared - in particular, an AWS code pipeline that deployed changes to both staging and production.

After creating a staging account, I tried dogfooding Graphite on the staging servers for a week. Initially, I was blocked by a few hardcoded url bugs - but eventually, I got the experience workable. Because Graphite used GitHub production as our shared database, I was still able to interface with teammates’ PRs, even if we were working across two separate Postgres clusters.

With me happily doing all my work on the staging cluster, I next added a banner visible only to employees on prod, reminding them that they should instead be using the staging version of the site to help us dogfood the stream of daily changes. This nudging banner proved to be a perfect balance of driving employees to use staging for daily work while still making production easily accessible when necessary.

At this point, I had two duplicate environments, with employees on one and all other Graphite users on the other. Each new release however, would deploy to both environments simultaneously, meaning that unexpected regressions would hit external users at the same time as employees.

💡 How did I do all this without breaking deployments for our team you may ask? I did indeed break deployments - but I tactically did it over a long weekend. While my partner was out of town on a work trip, I was left alone re-watching Scott Pilgrim vs. The World on one screen while trial-and-error running hundreds of deployments on another. The stress of knowing I had to reassemble our deployments before the work week pushed me to land the first pass all at once. After a few sleepless, Scott Pilgrim filled nights, I eventually got things into a stable state.

With everything in place, the final step was to reap the actual benefits of the project. With begrudging team buy-in, I updated our AWS code pipeline to sequence staging deployments, followed by a one-hour wait stage and, finally, an automatic promotion to production. We now had a buffer before production deployments.

I didn’t need to wait long to test the new capability. A few days after adding the staging bake, we released a regression. A bug broke our diffing algorithm, preventing the site from loading pull requests. Half of our site was inaccessible - Nick on our team noticed within 60 seconds of deploying. Previously, this would have taken us half an hour to roll back or longer to fix forward, meaning real external engineers would be blocked from reviewing or merging code. What before would have been a post-mortem-worthy incident was now a minor distraction.

We calmly navigated to AWS and paused our code pipeline, disabling automated promotion from staging to production. With deployments paused, we could take all the time we needed to debug, fix, and re-deploy to staging without stressing that external users would be having a bad time. Folks immediately felt the benefit.

Subsequently, we found ourselves pausing the deployment pipeline for accidental regression about one-to-two times a week. Our rate of production regression dropped by three-quarters, and engineers were able to continue iterating just as fast as before.

We are heavy users of the pause-deployments capability…

I’d be lying if I said everything was perfect off the bat.

Teammates initially complained about the extra hour they needed to wait to see their changes out in the wild. Their fears were slowly assuaged as folks shifted their focus to the staging environment, which still received new builds just as quickly as before.

Folks also were initially annoyed at needing to maintain two environments. In particular, Postgres migrations became a spot of possible drift - someone might apply a DB schema migration to staging but forget to apply it to prod and vice-versa. Also, to migrate both was net-more work than previously required. We considered building some auto-migration system, but felt there were too many edge cases in our DB to ever guarantee safety there. In fact, I’ve come to see running migrations twice as more of a feature than a bug. While doing so indeed takes more work, it gives engineers a chance to test out applying a risky migration to the site before repeating in production. It only took a few scary DB operations before the sentiment on the team shifted to appreciating having an additional DB to stage migrations on.

Being able to pause the deployment pipeline isn’t a perfect catch-all. While it buys time to fix a regression, it also blocks folks across the team from shipping new code. Because pipelines are linear, no one can get new changes out while an issue is being triaged, meaning it’s somewhat unhealthy to lock for more than 12 hours. For one, you block unrelated bug fixes from deploying, which makes it difficult to handle two different incidents at the same time. Secondly, locking deployments for too long risks unleashing a tidal wave of changes onto production. Any regressions beyond this point become hard to correlate back to a specific PR as the accumulated changeset grows to be unwieldy.

Lastly, we were initially concerned that seeding test data into our staging environment would be a tricky maintenance burden. In practice, this proved to be no issue at all. Because Graphite as a service does not share data across organizational boundaries, we could simply shift all our internal usage to the staging environment and populate it with an organic, long-lived dataset for just our org.

Over the following year, the team improved upon my initial setup. First, we added an “emergency deployment pipeline” that could be manually triggered. This pipeline would build the latest code artifact but skip both staging and bake stages, instead deploying straight to production as fast as possible. We only trigger this pipeline on rare occasions, but it’s nice to have a break-glass way of fixing forward as quickly as possible.

Secondly, we added a manual “skip-bake” command that would call AWS and automatically finish whatever bake period was running. This command has proven useful in times when an engineer is confident that the staging environment is healthy and simply wants to roll their change out to all users faster than would otherwise be possible.

Thirdly, we added code to programmatically lock staging-to-prod promotions outside of weekdays, 9-6 working hours. This small change has saved more on-call sleep than anything else we’ve done. Best of all, deployments continue releasing to staging on weekends or nights, which helps off-hour engineers feel like they can keep releasing code without endangering users.

Lastly, we updated our employee-only page within Graphite to include GUI-based deployment controls. We added the ability to see what SHA each environment was running on and a button to lock deployments during an incident. These changes added quality-of-life improvements to on call where locking our deployment promotion became muscle memory each time we suspected a regression had deployed.

I take no credit for being the first to invent a staging bake period, nor even self-discovering the idea. Rather, everything I’ve built at Graphite has been inspired by my time working on Airbnb’s Continuous Delivery team. There, I worked to help migrate thousands of microservices onto the open-source Spinnaker and create a default pattern for all deployments to release with automated canary analysis. While not every service used this pattern, the majority adopted it, and I witnessed firsthand the major reduction in incidents. You can read more about the amazing work my old teammates completed here.

That being said, not everything has been the same between Airbnb and Graphite. While Airbnb had a cross-service staging environment, there was no continuous human traffic on it. That made test data a real struggle while I was there. We tried various processes of automatically generating data and mirroring production traffic, but in practice, everything we tried at Airbnb paled in comparison to the steady stream of dogfood activity I see today at Graphite.

While Graphite’s deployments have Airbnb beat when it comes to staging traffic, Airbnb had a more sophisticated blue-green prod rollout. Without realistic staging traffic, gradual canary releases to production became important. Each production deployment would gently ramp traffic up to the new release before cutting over, and Spinnaker would call Datadog’s API to monitor for any spikes in errors. If a pre-specified metric regressed, Spinnaker would automatically halt the production rollback and cut traffic back to the previously safe service version. We don’t have this sophistication at Graphite (yet) - if we don't spot a regression through our organic staging usage within an hour, we still promote the bug to prod and wreak havoc on unwitting users.

If you’ve enjoyed my story about how we implemented a staging bake at Graphite and are interested in doing something similar on your team. Here’s what you’ll need:

First off, you’ll need some kind of deployment pipeline. Automatically deploying the top of your main branch to your servers is not enough. You need a system that allows you to build an artifact and progress it through a sequence of stages. You need that system to support manual pausing, waiting, and resuming, as well as the ability to continue deploying to staging even if the prod-promotion transition is paused. You could hand code this system or self-host Spinnaker - though these are options that are too time-intensive for a smaller startup. In my experience, the only pre-built tool I’ve found that fits my needs is AWS’ CodePipeline, though there might be something better out there. If you know of a good one, please let me know!

Secondly, you need a way to populate your staging environment with realistic data. There are three options I know of:

  1. Write custom logic to generate fake data in the environment

  2. Clone and sanitize production data

  3. Have some set of real people living on staging creating the data.

In my opinion, the third option is the best, followed by option two. I’d recommend against the first option - creating test data is a Sisyphean maintenance burden that never quite lives up to the real thing. Back in 2018, I created a system for generating fake test data at Airbnb, and I was never quite satisfied with the outcome.

Thirdly, you need activity on the staging database. Having data there is not enough; you need something to trigger user flows and peck at APIs. At Graphite, this is our own engineering team using our application to create all our own PRs, reviews, and merges. At Airbnb, it was a mixture of API traffic replayed against specific services, coupled with begging engineers to “poke staging” and make sure their most recent merge looked functional. In someone else’s application, it might be QA testers or even a free tier of users. If possible, I’d strongly recommend finding a way for your traffic to be real-time humans.

Lastly, you need a way to detect and alert on regressions. At Airbnb, this was mostly done through Datadog monitors and automated server monitoring for statistical regressions. At Graphite, this is an engineer on our team who noticed that the comment button no longer works and reported it on our internal Slack. We also have the added benefit that the person spotting the regression is the same person who’s qualified to pause the pipeline and debug the issue - though I respect that not all products benefit from being so thoroughly dogfooded by the engineering team creating them.

If you can think of an answer to each of these requirements, I’d strongly recommend you consider implementing a staging bake. The upfront cost of setting one up will pay back tenfold, and anecdotally, I can assert that the maintenance cost is lower than any other approach I know to catch regressions.

Choosing to build a staging bake process at Graphite took me longer than it should have. I was more afraid of drift than I should have been. I was worried that the maintenance cost of maintaining dual Postgres instances, each needing the same schema migrations, would become cripplingly annoying. I wondered if engineers on the team would ever become comfortable with the added wait time. In the end however, none of these things were true, and I’d say every member of our engineering team appreciates the staging-bake’s existence.

Built for the world's fastest engineering teams, now available for everyone