Over the past 6+ weeks, I’ve spent most of my time improving the performance of Graphite’s application server (affectionately nicknamed “subwoofer”). While this effort is still fresh in my mind, I wanted to share 3 key lessons that helped us find meaningful performance gains and in doing so greatly improved the Graphite user experience.
Context
We started to investigate our app server’s performance in earnest after noticing a few concerning trends in the data over the summer:
At the time, we would semi-frequently experience large, but unexplained regressions in server endpoint latency.
Every three or four weeks, our team would get paged because of a large regression in endpoint response time (think something like the P90 drifting up to 8 seconds for a few minutes). In practice, these regressions would render the Graphite dashboard unusably slow for reviewing or merging PRs.
Despite day-long investigations into these regressions each time they cropped up, the cause of these incidents remained a mystery.
As a result, we began to settle into an uneasy rhythm where we’d notice a large performance regression, scale up the number of server pods, dig in (usually finding nothing), and just move on, repeating the exercise again a few weeks later.
We also had limited visibility into smaller site performance changes.
Between the aforementioned regressions and outlier users we noticed were adding disproportionate load, our site performance graphs were extremely spiky.
When teammates and/or users in our Slack community remarked that our site had gotten slower over some past window of time, it was difficult to authoritatively verify or disprove this — let alone pinpoint the cause.
There was clear user pain to be addressed — if only we could get a wrangle on the system.
The hope for this investigation was two-part:
Learn where these large regressions were coming from, and
Remove as many of them as possible, reducing these bad user experiences and making it easier to understand app server performance trends over time.
Let’s dive in.
Investigating large regressions
Throughout this investigation, I used our existing P95 server endpoint metric as my guide; whenever I saw a spike in the metric, I would dig in. As opposed to past incidents where we would investigate, things would return to normal, and then we’d move on to other work, this time I didn’t rest until I had answers. We had hypotheses of course, but the results were actually surprising.
A few themes emerged:
Lesson #1: Verify the underlying impact; P95 latency can be too blunt.
This lesson is key for any application that integrates with a third-party API, in our case GitHub’s. In the case of such an integration, it’s possible for an upstream third-party API provider experiencing an incident to spike your app’s P95 response time - all without any real degradation on your side.
During the 6 weeks that I was working on performance, a large customer onboarded onto Graphite. A few days after this customer’s onboarding session, we noticed a surge in server latency:
Looking at a rollup of the top user-endpoint combinations, we noticed that the surge in the number of requests was coming from a specific set of users: when one of these users would try to load the insights page on the Graphite dashboard, our app server would try to grab profile info for every GitHub user in their monorepo (incl. their GitHub avatar URLs) — which needed to be proxied through our server — resulting in a single network request for each avatar URL and thousands of requests within a span of seconds. This was not ideal, but easily remedied.
We quickly stemmed the bleeding for this customer and, based on the knowledge we had around queries per second and the cost of each request, assumed that we had taken a site-wide hit in performance because of this spike in overall server latency.
However, digging in showed a different picture: our server actually processed the thousands of avatar URL proxies with no trouble.
Instead, it was the customer’s GitHub Enterprise instance that had become overwhelmed, leading to high endpoint latencies for this particular organization — but perfectly normal behavior for all other Graphite users.
The lesson here: sometimes P95 latency can be too blunt and shift as a result of a few outliers (even though the outliers may be deserving of a smaller incident in their own right).
We’ve since seen other instances in which the P95 latency was pushed up because of a regression, in one endpoint in particular. While I still use P95 latency as the simplest, leading indicator, it’s crucial to attach it to a chart of P95s grouped by endpoint:
For example, you can spot that the significant jump in P95 at just before 6am above is actually due to the spike in response time for the yellow endpoint.
On the other hand, here’s a clear example on Thursday, July 24 where there is a true site-wide regression, with all the lines spiking, each color representing a different endpoint.
Lesson #2: Watch out for event loop blockers and, in particular, big queries.
With the per-endpoint graphs in place, I started honing in on the cases where we’d see a site-wide regression and a surge across many endpoints latencies simultaneously.
Some of these regressions were the result of issues we were already familiar with —
High database CPU
An outlier number of requests
Servers running out of memory
— but for other regressions, all of our variables seemed to hold at healthy levels, despite the jump in latency. In some of these cases, we’d just see a quiet period, where everything would look healthy, there would be few or no error logs, but latencies would still be sky high.
Because of the lack of concerning external variables, our hypothesis was that there must be something occurring within the Node process itself — enter Node event loop blockers.
The simple mental model here is that Node is single-threaded so when you do heavy, synchronous compute, you can block the process and thus the server from handling any of the other in-flight requests.
Datadog provides a basic out-of-the-box dashboard here, though granted, it does have some weaknesses; namely that once you get down to the service level, you can’t slice on a per-pod basis. Despite each pod having a separate Node process with its own event loop, you can’t examine each pod granularly. This means that large event loop blockers may get hidden and averaged out if the other 9 server pods at the time are doing fine.
Instead, as suggested via Ashby in an extremely helpful blog post, I found logging to be a more fruitful approach.
As a side note we also tried profiling our CPU using the Linux perf tools as per Netflix’s suggestion but we wound up preferring the Ashby approach for a couple reasons:
AWS doesn’t expose the underlying settings needed to use Linux perf tools on ECS containers.
The Ashby approach allows us to identify the specific request that blocked the event loop, making it easy to replay said request with the exact data to re-create the blocking conditions. This wouldn’t be possible with a plain flame chart.
Here’s what the end result of implementing this logging approach looks like:
When the event loop gets blocked we log to Datadog
In Datadog, we can view the APM trace associated with the log:
The APM trace includes our “Request start” log which allows us to replay the request locally with a manage command that then spits out an associate curl command to replay the request and debug against a local server:
$ AWS_PROFILE=production yarn manage-prod replay--requestId SOME_REQUEST_IDcurl -v --cookie 'auth={"auth": "<redacted>", "exp": 1695163851636}'"<https://localhost:8000/api/v1/graphite/section/prs>"
After debugging lots of requests in this manner, there were a few common themes:
Half of the event loop blockers were from intensive synchronous work that we were doing.
In these cases, we moved the computation off of the main event loop and into worker threads.
The other half came from database queries — specifically database queries that return a significant amount of data.
When TypeORM receives results back from the database, it parses these raw results and transforms them into the typed entities that we then interact with in our codebase.
When these raw results are large (3 mb or larger in our particular case), this parsing can be extremely expensive. I verified this by cloning TypeORM locally, plugging it into our server, and adding custom debug logs.
A very satisfying correlation I found between payloads larger than 3 mb and event loop blockers greater than 200 ms.
The results may be large because we’re querying for a large number of rows — or because of the joined data in the query. TypeORM maps the joins in its entity manager to left joins which can result in a significantly inflated number of rows if there are lots of relations between the queried data. This means that there were a few seemingly innocuous queries that turned out to be timing out at Postgres serialization time (not event loop blocker related though the concept is the same) because of all of the joined data.
The solutions here were to either remove the joins, reduce the number of rows that the query returns, or, in the worst case, simply break the query up into batches. Unfortunately, TypeORM doesn’t expose any escape hatches here.
There were some other interesting, but niche fixes as well such as our pattern of mapping over a list of objects to generate promises blocking the event loop when 1) there could be lots of objects in that list or 2) creating the promise took some non-trivial synchronous work (e.g. constructing a TypeORM query builder).
The takeaways from this:
This isn’t a call to prematurely optimize. I wouldn’t worry about synchronous CPU-intensive type of work off-the-bat, but would rather monitor it as its rolled out, knowing that this might later have to move to a worker thread. There were also cases however, where there was logic I thought would be quite CPU intensive which turned out to be not so. This is why proper logging, and monitoring are crucial in web applications at scale.
Be very wary of large volumes of data returned from the database. It turns out that this not only has an expense on the database level, and in sending this down to the client — but also can affect all other in-flight requests on the server.
From this investigation we were able to burn down a number of these event loop blockers, however, keeping an eye on our performance will continue to be required as we inevitably introduce new blockers.
Lesson #3: Everything has a cost — but only the most expensive things have an impact.
One of the first things I dug into when looking at performance was the background fetching that our app did.
We had noticed in Datadog metrics that our server spent a lot of time evaluating these requests — by server wall time this was the third-highest endpoint — but because of requests that occurred in the background for users, we weren’t sure whether this was worth optimizing.
At the time, we decided to experiment with unshipping the feature that relied on this background fetching, curious to see if impact showed up anywhere.
After unshipping, while server CPU and memory stayed the same, we saw a dramatic drop-off in the event loop delay:
The key takeaway here for me here — and with much of this performance work — was that nothing comes for free, even non-user facing latency.
That being said, an important caveat to note is that while everything has a cost, not everything affects our top-line site performance and is worth optimizing.
For a while, we’ve theorized that some of the large, unpaginated endpoints we have are costly. However, we haven’t seen this show up in our top-level site performance metrics; these queries don’t seem to be using a disproportionate amount of CPU or memory or blocking the event loop more than other endpoints. And so while we could staff a large effort to overhaul these endpoints and iron out our pagination scheme — I’m not yet confident that this tradeoff would be worth it.
Over the course of these six weeks, I took some brief side-quests to also reduce the volume of returned data from some of our worst TypeORM queries:
And while some of these had a discernible impact, namely in eliminating event loop blockers, many queries had no effect at all, which makes it hard to draw a firm conclusion for our larger queries that return under 3mb of data. This is yet another instance where it’s important to validate gut instinct hypotheses against what the actual data is indicating.
This of course, is what applied to our particular tech stack and isn’t reflective of the fact that there is no impact but rather that it is not the current thing to improve if we want to drive top-line performance gains. This is yet another instance where it’s important to validate gut instinct hypotheses and let the data guide you to the most impactful work; which might lead you to some surprising places! We thought for sure that returning more than a mb of data failed the smell test.
Capturing performance trends
After fixing many of the outliers, our latency graphs are much smoother and more consistent week-over-week. For example, consider the following time series graphs that show the number of requests that took >5 seconds, >2 seconds, and >1 second. The blue line shows the data from today and the orange tracking line shows data at the same point in time from the week before. Aside from one small spike at around 4:15 pm, the graphs look nearly identical.
Now when we see performance regressions, they become easy to concretely identify. In the below graphs, for example, a regression around midnight immediately jumps out.
Previously we were using these graphs to monitor latency across our entire application, but the hope is that we can now use this approach to track performance trends on key endpoints.
The work continues
Through this work, we have significantly improved our ability to:
Understand where large performance regressions on the Graphite app are coming from and how to combat them
Understand how our overall site performance is tracking over time
While most of this work has centered around our understanding or the most extreme outliers, we’re continuously hard at work continuing to drive down endpoint latencies across the board — and more importantly locking our wins in.
All in all, this investigation and the resulting fixes have greatly improved Graphite’s app server performance over the past 6 weeks, and I’m confident that these learnings will only help us continue to get better going forward.