How we sped up code search for Graphite Chat

Searching a codebase is usually a pretty easy problem. You want to know what files use your MAX_EMAILS_TO_SEND constant? One command gets you there in milliseconds:

Terminal

grep -r MAX_EMAILS_TO_SEND .

Modern tools like ripgrep make it even faster. But this simplicity depends on a couple of things being true:

Your files all reside on a (hopefully) fast disk.
There aren't too many of them.

When building out a code search tool for the agentic Graphite Chat, we immediately ran into both of these limitations. We needed to support searches across hundreds of thousands (or millions) of files, at any commit, without maintaining a whole traditional VM+disk for each one. Could grep even support that kind of use case?

Let’s find out!

Default-branch-only search isn’t enough

You might read this and think: What's supposed to be novel here? We've had fast code search via API for big repositories for years now. Sourcegraph made a whole company out of the idea, and GitHub even gives you something similar for free.

True, but only on the default branch. GitHub’s API docs are explicit:

Only the default branch is considered. In most cases, this will be the main branch.

And Sourcegraph, while it’s closed-source now, still has an archive of their documentation and code as it stood in 2023, which details the same thing:

Sourcegraph also has a fast search path for code that isn't indexed yet, or for code that will never be indexed (for example: code that is not on a default branch). Indexing every branch of every repository isn't a pragmatic use of resources for most customers, so this decision balances optimizing the common case (searching all default branches) with space savings (not indexing everything).
…
Provides on-demand unindexed search for repositories. It scans through a git archive fetched from gitserver to find results, similar in nature to git grep.

The searches our model makes almost never target main; they’re against arbitrary commits. For us, fast search at any commit was the requirement, and that turned out to be a much harder problem.

What didn’t work

In the spirit of KISS (or “Choose Boring Technology”, or “The Grug Brained Developer”, feel free to pick your favorite decade’s spin on the idea), we figured our first attempt at a solution should also be the most straightforward: a plain old git grep on the repository. If that worked, then we could save ourselves a lot of time.

We spent a week running experiments on several block-based storage solutions on AWS:

On-demand AWS Lambdas to execute searches via one shared EFS volume.
An ECS cluster where tasks dynamically mount per-repo EFS volumes at query time.
Persistent EC2 instances which use (faster) EBS with multi-attach for horizontal scaling.

We saw some interesting results here, for example, we measured the EBS-mounted volumes to consistently perform I/O intensive operations (like searches) about 3x as fast as EFS-mounted ones. From a test we ran searching for a string in the react-native repository:

Terminal

# EBS

Performance counter stats for 'git grep interactionManager 8f1ae53':

     17,541      minor-faults              #    0.062 M/sec                  
        878      major-faults              #    0.003 M/sec                  
     282.34 msec task-clock                #    0.184 CPUs utilized          

1.535785602 seconds time elapsed

0.208365000 seconds user
0.075011000 seconds sys

Terminal

# EFS

Performance counter stats for 'git grep interactionManager 8f1ae53':

     17,620      minor-faults              #    0.050 M/sec                  
         53      major-faults              #    0.149 K/sec                  
     354.59 msec task-clock                #    0.078 CPUs utilized          

4.560184652 seconds time elapsed

0.212312000 seconds user
0.129052000 seconds sys

However, in spite of the individual advantages each had over the others, they all buckled in the same place: large repositories.

While react-native is a decently-sized repo by most measures, it only has a few thousand files:

Terminal

$ git ls-files | wc -l
    7019

We need to support repositories hundreds of times this size, and in cases like that, we actually end up at the mercy of the Linux page cache. The major-faults rows in the snippets above represent times the grep process needed to actually reach out to the disk for blocks, and for sample large codebases we tested, searches only returned “fast enough” (under 10 seconds) when major-faults was 0 (for instance, when the same search is run twice in a row). In other words, searches are fast only when the entire repository is already in the page cache.

At this point, the grep operation becomes mostly CPU bound, and perhaps unsurprisingly, this is also where EFS and EBS see their performance disparity vanish:

Terminal

Performance counter stats for 'git grep interactionManager 8f1ae53':

     17,667      minor-faults              #    0.071 M/sec                  
          0      major-faults              #    0.000 K/sec                  
     248.55 msec task-clock                #    0.957 CPUs utilized          

0.259805935 seconds time elapsed

0.208592000 seconds user
0.040113000 seconds sys

Of course, the only way of ensuring that every block needed for a search is already in the page cache when the search comes in, is to have a crystal ball, which we can’t afford.

So, we have to look into some more earthly solutions!

What somewhat worked

If we can't brute-force our way through a big repository with grep, then it looks like we'll need to do some indexing.

We tested loading the repository’s files into Elasticsearch, which tokenizes documents and executes keyword queries quickly. The results were promising:

Terminal

$ cat search.json 
{
  "query": {
    "wildcard": {
      "content": {
        "value": "*interactionManager*"
      }
    }
  }
}

$ https GET <elastic-endpoint>/blobs/_search?size=100 < search.json 
{
    "hits": {
        "hits": [...],
        "max_score": 1.0,
        "total": {
            "relation": "eq",
            "value": 83
        }
    },
    "timed_out": false,
    "took": 79
}

Even using a tiny node type (i4g.large.search), we saw searches over even huge repositories taking well under 500ms. And that's with a cluster of just one node; we could easily split the dataset into shards (ES manages this on its own) such that 10 different nodes handle 1/10 of the files each, and our searches would be even faster.

Problem solved, right? Not quite. What we just tried with Elasticsearch worked great for one commit, but our requirement was fast search at any commit. The direct approach would be to index the repo’s state at each commit, for easy lookups, but — multiply thousands of repos by thousands of files by thousands of commits, and suddenly you’re managing tens of billions of documents. Running a cluster of that size would be prohibitively complex and expensive.

So, is the “document database” idea dead-on-arrival?

Back to basics

When looking at the problem through the above lens (the challenge of efficiently managing thousands of files, each with potentially thousands of versions), there is one database that's relevant to the discussion: namely, Git itself!

A Git repository seems to store an entire copy of the repository at each commit of a potentially-decades-long history, but does so without terabytes of disk space. This led us to wonder, "how does it do that, and can we somehow adapt that approach to work with a document search database instead of a filesystem?"

For the "how" question, it's honestly very clever if you haven't read about it before. The short version is, Git splits the data it stores into several different kinds of objects:

Blobs → raw file contents, content-addressed.
Trees → lists of blob IDs at a given commit.

And as for the "can we" question, it turns out, yes absolutely!

The life of a search query

The key change was that, like Git, we needed to start storing two kinds of objects instead of just one: the blobs and trees. Then, when it comes time to perform a search, we execute 2 queries:

Get the tree object for the commit SHA specified in the query.
Get all blobs that match the search term specified in the query.

And then, in memory, we filter the results from (2) based on the blob IDs in the result of (1), such that we only return files which are actually in-scope for the request. Easy!

Both of these queries are fast. The first simply retrieves a single document by its ID. While the second does need to consider every version of every file in the repo (which might sound like a lot), our tests show it’s actually not nearly so bad. Some files in the repository have dozens of unique versions, but the vast majority are rarely—if ever—updated. So the total number of documents is generally a small multiple (think ~3x) of the number of files you see in the working tree.

Plus, we’ve made a few optimizations to help things move even faster. If you notice, the above 2 queries are totally independent—you don’t need the result from (1) in order to perform (2), so we do them in parallel. And because a commit’s tree (which just specifies a set of blob IDs) is often much smaller than the result set of blobs, we’ve implemented the in-memory filter as a stream transformer, so that we can start responding to the client the moment we receive our first in-scope match.

The future

We have a few other ideas for optimizations, some a bit more experimental. For example, if we only ever use the tree for a set-contains operation, do we even need to fetch the entire tree at all? Maybe we can pre-construct the set of blob IDs, and only fetch that at query time, or even pre-construct something that acts just like that set, most of the time, but is just 10% the size?

And since we’re using Turbopuffer as the document database backing the Graphite code index, adding embeddings is on the table too, which opens the door to semantic search. Maybe we will soon be able to ask PR Chat to check if a pattern you tried is consistent with examples from similar files.

The now

Today, the system is live, and we’ve already indexed tens of millions of source files across thousands of repositories. For several weeks, Graphite Chat has been using this index to power LLM tool calls, where it’s been a big upgrade from our previous strategy of calling the GitHub API.

We can now fetch as many files as we need (no arbitrary rate limit), our searches accurately target the PR’s branch (instead of only being able to search main), and best of all — the index has now served thousands of queries with a median latency consistently under 100 milliseconds.

What started as a “just grep it” idea turned into a deep dive through storage tradeoffs, indexing engines, and Git’s internals. By rethinking search in terms of blobs and trees, we’ve unlocked fast search at any commit—something we couldn’t get off the shelf. And we’ve hardly touched on the ingestion part! It turns out there are ways to have Git shrink a 5GB repo clone down to 20MB, and you'd be surprised how useful an anemic clone like that can still be.

Which is to say, check back soon for a part two!

How we sped up code search for Graphite Chat

Default-branch-only search isn’t enough

What didn’t work

What somewhat worked

Back to basics

The life of a search query

The future

The now

Related posts

Built for the world's fastest engineering teams, now available for everyone