Background gradient

As traffic and data grow, queries that were fast last week can suddenly cripple your app this week. Pinpointing exactly where these slow queries originate can be tricky, but achieving "call site attribution", identifying the exact point in the source code from where the SQL query is being executed, allows developers to find and fix performance bottlenecks quickly.

In this post, I'll walk through how we added call site attribution to our Node.js web app in anticipation of scaling our hundreds of millions of SQL queries a day, into billions. For this article, I interviewed Aryaman Naik, one of our engineers leading this database project, to do a deeper dive into our call site attribution process. Here are the key takeaways:

With our current user growth trends, we anticipate traffic could grow 10X from current levels. This is a good problem to have, but still requires preparation: mainly ensuring we prioritize infrastructure tasks, like database scalability, and make sure this work doesn’t slip through to the icebox. The last few months, our engineering team has been heads down building new features, and in that interim our DB usage crept up to nearly 40% capacity of AWS’ 4x large database. The largest machine AWS provides is a 16x large, meaning without optimizations, at 10x growth, we would hit the limits of the machines AWS can provide, and have to start looking into more complex problems like sharding Postgres writes or worst case, moving off Postgres entirely.

At this point, we realized: if traffic does grow 10X (as we’re predicting), we'll quickly hit a ceiling on our database utilization, grinding application performance to a halt. All this during a crucial period when an influx of new users are just evaluating Graphite for the first time.

In order to handle this expected surge in traffic and mitigate performance concerns, we needed to reduce SQL query load substantially. The first step in this process was identifying which queries were hogging resources. From there, we could trace those queries back to the originating code and optimize the worst offenders.

Easier said than done.

Our initial thought was to use Datadog to trace queries originating from HTTP requests, which we already had in place for general monitoring. But unfortunately, most of our load comes from background workers not directly tied to these requests, preventing Datadog from giving us any visibility into these queries.

We then explored a couple of other options:

  • Expand Datadog usage: Wrap every SQL query in a Datadog span, to provide tracing and attribution. While this would work in theory, it would also raise our Datadog spend by a factor of 100x, so this solution was quickly canned.

  • Monkey patching: Programmatically insert the call stack into each SQL query as a comment. The downsides here however were legibility and limitations with how to fetch the call stack, so this would not work either.

Ultimately, neither solution provided the desired end-to-end tracing from SQL to code, blocked by both cost and functionality.

The solution we ultimately implemented was to require all SQL queries to be tagged with a user-defined string indicating the call site. For example:

sql
SELECT * FROM users WHERE id = 1 -- Callsite=app/users.ts:getUser

Now when we view slow queries, the call site tag instantly tells us where the query originated in our codebase.

Some key benefits of this approach:

  • Legibility: The tag is readable at a glance when viewing SQL queries.

  • Traceability: We can instantly jump to the tagged file in our codebase.

  • Grouping: Similar queries from the same part of the codebase can be grouped together, allowing us to identify problem areas more holistically.

  • Cost: No additional cost, unlike with Datadog tracing.

Once we settled on this solution however, there were some additional implementation challenges to work through.

One of the first challenges we ran into was the ORM we use in our Node.js/Postgres stack, TypeORM, only supports query tagging in some of its APIs. For example, the .create() and .find() methods can add tags, but not .update() or .delete().

Under the hood however, TypeORM uses a query builder to construct SQL queries, so we wrapped all TypeORM calls in the query builder API, then added tagging there. This took some work to replicate TypeORM's custom logic on top of the query builder, but eventually we were able to ensure every query was tagged with its corresponding call site.

We also added linting rules to prohibit use of any TypeORM APIs that couldn't support tagging, ensuring the use of our wrappers, and preventing our custom tagging logic from being bypassed.

Overall we were able to achieve nearly 100% coverage of all queries with call site tags. From here, we put together a dashboard and can now tell from a glance which parts of our codebase need optimization.

Here are my key conclusions:

  • Don't shy away from ORM libraries like TypeORM: ORM’s provide immense productivity gains, however, don’t be afraid to examine multiple different offerings. Evaluate them based on your specific needs, as some are easier to implement out of the box, but are much more rigid in their functionality.

  • Add call site tagging from the beginning: Eventually as you scale, you’re going to need it when optimizing your database performance. Having visibility into performance early on helps you optimize along the way and prevents you from accumulating potential tech debt.

  • Ensure your ORM supports query tagging: Not all of them do, and this metadata is crucial.

Call site tagging got us 95% of what we needed without a major overhaul of our data infrastructure. However, there are still opportunities for improvement:

  • Support tagging across all APIs called by TypeORM.

  • Include the line number along with the file name in each tag.

  • Build internal tooling preventing engineers from creating untagged queries, ensuring we maintain visibility into our database performance.

As a result of the increased visibility that call site optimization gave us, we were able to quickly identify and correct problem areas in our code base. As a result of these optimizations, we finally feel like we’re in a place where we can handle our predicted traffic increase. Of course, the work is never over, and we’ll continue to optimize and implement performance improvements as we scale our userbase from 10x, to 100x, to 1,000x, making blog posts along the way.