If you haven’t seen the global headlines yet: Graphite had an outage.
In the interest of full transparency and accountability, I, Greg Foster, as the CTO and responding on-call engineer, want to give a full report of the incident, and reaffirm our commitment to our community and customers.
While Graphite was down for ~2 minutes, and you'll never get those valuable moments of your life back, in this article I will provide you with ~6 minutes of content. This nets out to a GAIN of 4 minutes, putting you back on top. Let's dive in.
Time and Duration: On December 6, 2023, at 2:13:26 PM EST, Graphite’s web application experienced an unexpected downtime lasting precisely 124 seconds.
Impact: Nearly a dozen users encountered a 404 error page when attempting to load new pages, particularly affecting the pull request functionality. This interruption temporarily hindered user code reviews while encouraging premature coffee breaks.
Luckily, the issue was solely isolated to new page loads of the web application. Other key components of the Graphite ecosystem, such as asynchronous jobs, merge queues, the CLI tool, the VS Code extension, and the system tray application remained fully operational during this time. This maintained continuity in several backend and integrated services, thereby limiting the overall disruption to our users.
The problem was first identified thousands of milliseconds into the incident, at precisely 2:13:26 PM EST by myself, Greg Foster, following a user report in the Graphite Community Slack channel.
2:13:35 PM: Adhering to my on-call duties, I promptly instructed Alyssa, another key software dev on our team, to update the Graphite status page to communicate the service disruption.
2:13:37 PM: Concurrent with the status page update, belated PagerDuty alarms began firing, indicating a significant uptick in errors and triggering parasympathetic trauma responses across the room. The office was literally “abuzz” with vibrating phones, and apple watches warning of imminent cardiac arrest (see above).
2:13:40 PM: Brendan, another Graphite engineer, began an exploration in the AWS console to identify and address the underlying issue.
2:13:45 PM: Suspecting an anomaly with the S3 bucket serving the web application, I started battling through multiple of AWS’s “best-in-industry" login flows to access the AWS console. After only 16 redirects through different authentication portals, I was in.
2:13:50 PM: Graphite’s internal Slack channels experienced a surge in activity, with employees reporting the 404 error, and users clamoring for answers in the community Slack. The ➕ emoji reacts to error reports, and dumpster fire gifs were frantically rolling in. It was now clear that this was a... widespread issue.
2:14:30 PM: In the meantime, as I was handling the slack meltdown, Brendan successfully identified a misconfiguration in the S3 bucket settings and proceeded to revert the changes, effectively restoring site functionality.
2:14:40 PM: Following the resolution, site performance returned to normal, and blood pressure readings finally fell below the “fatal hypertension” level. Benchmarks across the board were looking good.
2:14:50 PM: Brendan reminded me of what I said in our 1:1 this morning: “Wow, it’s been so long since we had any kind of outage!” Thank you Brendan.
2:15:40 PM: I updated the community on Slack regarding the resolution and shared a brief explanation of the incident.
Precursor event: My presentation on the preceding day at 4:50 PM EST, highlighting the exceptional uptime in November, inadvertently angering the SRE gods and cursing the team with subsequent downtime.
Immediate cause: The downtime was initiated by an unintended change to the S3 production bucket configuration at approximately 2:12:35 PM EST, causing CloudFront to serve 404 errors.
The issue was promptly addressed by reverting the changes to the S3 bucket configuration.
Graphite will encourage a new internal policy requiring testing of configuration changes in a staging environment prior to production deployment.
Add linter warning to caution against making verbal jinxing statements about system reliability.
First, I want to extend my sincere thanks to our engineering team for their rapid and effective response to this incident.
I also deeply appreciate our community members on Slack who quickly reported the issue and showed tremendous support throughout the incident.
And thank you, the reader, for making it to the end of the report, hopefully learning from our experience along the way.