99.9% Uptime
One of the most impressive parts of Amazon Web Services is its commitment to uptime. For each of their offerings, AWS commits to a “Monthly Uptime Percentage” and if they miss the mark, a service credit is granted.
Here, for example, is the SLA for Amazon Simple Storage Service. Note that if S3 falls below 99.9% uptime — less than just four minutes and 20 seconds of downtime per month — AWS starts issuing partial refunds:
Scanning through the SLAs, 99.9% is most common but in a few cases AWS goes even higher. At the top? The DynamoDB Global Tables SLA and the AWS Key Management Service SLA which each commit to five nines, 99.999% — less than five seconds of downtime per month.
To consider how truly impressive this is, consider your team’s current deployment infrastructure. How long would it take you to rollback a bad change completely? How long does it typically take for you tend to detect a bad release? Would you be able to do both of these consistently, spending no more than four minutes and 20 seconds cumulatively over the course of the month?
Automation to the rescue
It’s easy to think about incident management as manual heroics. An oncall jumps in, debugs some cryptic error, and finds a deep hidden behavior in the system.
The reality of AWS’s tight SLAs, however, mean that there’s no time for manual action at all; if you only have four minutes and 20 seconds to respond cumulatively over a month, in order to respond in time, everything must be automated.
In a 2021 interview, Clare Liguori, a Principal Engineer at AWS mentioned, “when an alarm goes off and the on-call gets engaged, usually if it's a problem caused by a deployment, the pipeline has already rolling back that change before the on-call engineer is even logged in and started looking at what's going on.”
Liguori’s earlier 2020 blog post, “Automating safe, hands-off deployments”, explains how much of this works.
The post has inspired us at Graphite to think more deeply about our own CodePipeline instance — and in the spirit of learning, I want to highlight some of the key principles here.
Idea 1: Automated rollbacks
At the core of AWS’s automated deployment pipeline is the ability to conduct automated rollbacks.
This shortcuts the toil of requiring an engineer watch each and every deploy going out, the risk that engineers conduct the right set of rollback steps, and ensures this all happens with near-immediate speed.
At multiple steps, as the pipeline rolls out changes to its pre-production environment, a new Availability Zone, or a new Region, the pipeline monitors a set of metrics (some custom to the service and others core vitals like CPU and memory usage). If any of these take a dip, the pipeline halts the pipeline, automatically rolls back the changes, and notifies the oncall.
ALARM("FrontEndApiService_High_Fault_Rate") ORALARM("FrontEndApiService_High_P50_Latency") ORALARM("FrontEndApiService_High_P90_Latency") ORALARM("FrontEndApiService_High_P99_Latency") ORALARM("FrontEndApiService_High_Cpu_Usage") ORALARM("FrontEndApiService_High_Memory_Usage") ORALARM("FrontEndApiService_High_Disk_Usage") ORALARM("FrontEndApiService_High_Errors_In_Logs") ORALARM("FrontEndApiService_High_Failing_Health_Checks")
One other interesting idea mentioned: the deployment pipeline monitors not just the metrics of the service under deployment, but its upstream and downstream dependencies as well — the idea here being that the freshly deployed service can be placing extra load causing its dependencies to dip.
Idea 2: Progressive rollouts
One-box
At each stage where a change is pushed (pre-production stages, production), the deployment starts with the “one-box” stage where changes are deployed to just one box, “a single virtual machine, single container, or a small percentage of Lambda function invocations”.
The deployment pipeline monitors the metrics of the box specifically; as earlier, if the metrics regress, the deployment is automatically halted and rolled back.
In addition to ensuring top-line service health, this one-box approach also helps guarantee backwards and forwards compatibility; for a duration during the deploy, services will run both the old and new (the one-box) versions of code. Adding an additional layer of compatibility checks, the service under deployment also connects to other pre-production resources at some stages and production versions at others.
Waves
After one-box, deployments continue to roll out slowly in isolated portions of the service, either single Availability Zones or individual “cells” (a “service’s individual internal shards”).
The key idea here is to prevent system-wide degradation by limiting deployments to units the service can withstand losing. ”All services are scaled to withstand losing an Availability Zone in the Region, so we know that the service can still serve production load at this capacity.” By deploying to only one Availability Zone in a region at a time, even a bad deploy’s impact is contained.
In order to balance safety with speed, as confidence builds in a change through successive successful deployments, the “waves” in which it gets deployed become successively larger and larger as well (e.g. being deployed to more regions at once).
It’s also worth noting that multiple concurrent waves of deployments can contain multiple versions of deployments. A deployment from earlier in the day may be deploying in a large, later-stage wave while a deployment from later in the day begins to deploy one of the initial waves.
Idea 3: Slow rollouts (baking)
As an extra layer of redundancy, AWS waits between each wave (longer for earlier waves, shorter for later waves) to give it extended time to catch regressions highlighted by user traffic.
The “bake” period is also smart enough to require a certain volume of traffic; it “includes requirements to wait for a specific number of data points in the team’s metrics (for example, "wait for at least 100 requests to the Create API") to ensure that enough requests have occurred to make it likely that the new code has been fully exercised.”
In practice, this can add a significant amount of time to deploys: “A typical pipeline waits at least one hour after each one-box stage, at least 12 hours after the first regional wave, and at least two to four hours after each of the rest of the regional waves, with additional bake time for individual Regions, Availability Zones, and cells within each wave.”
In total this can mean “four or five business days” to deploy a typical service and even more for services that require extra careful deploys.
Idea 4: Rollouts, not just for code changes
It’s not just the obvious changes — e.g. application code — that go through pipelines; items like feature flags and configuration changes do too.
Each type of change exists in its own pipeline which adds safety (each pipeline hooks into the same infrastructure, e.g. a bad feature flag change can be automatically detected and rolled back) and also helps with velocity: for example, “Application code changes that fail integration tests and block the application pipeline don’t affect other pipelines.”
Idea 5: Deployment blockers
Sometimes the safest deployment is no deployment at all.
AWS’s pipeline can block deployments when there are active incidents or during particular time windows (e.g. outside of core business hours).
This is custom per team; some “prefer to release software when there are plenty of people who can quickly respond and mitigate an issue caused by a deployment” whereas others “prefer to release software when there is low customer traffic.”
Takeaways
At Graphite, we’ve been inspired to adopt some of these ideas: we block our deployment pipeline outside of core business hours and now bake changes for a set period of time in a pre-production environment before they release to production.
Other bits of AWS’s deployment process have been harder to adopt as-is without AWS scale and maturity. As a startup, speed is one of our strongest assets; we can’t afford to roll out changes to production over four or five days like AWS does. And for us, the key part of automated rollouts — having high-signal health metrics — is an area that is slowly improving, but still has some ways to go.
The nice part is that the ideas don’t have to be adopted in an all-or-nothing fashion. We’re now thinking about adopting one-box deployments, even if we don’t deploy later in waves across multiple days. And the core principles behind the specific AWS implementations have also led us to reflect more deeply about things we’ve historically just accepted about our incident response process — namely the opportunities for automation, even if just as a debugging tool and not yet as a tool with the authority for full rollbacks.
We might not yet be at five nines, but we dream.