Searching a codebase is usually a pretty easy problem. You want to know what files use your MAX_EMAILS_TO_SEND
constant? One command gets you there in milliseconds:
grep -r MAX_EMAILS_TO_SEND .
Modern tools like ripgrep
make it even faster. But this simplicity depends on a couple of things being true:
Your files all reside on a (hopefully) fast disk.
There aren't too many of them.
When building out a code search tool for the agentic Graphite Chat, we immediately ran into both of these limitations. We needed to support searches across hundreds of thousands (or millions) of files, at any commit, without maintaining a whole traditional VM+disk for each one. Could grep
even support that kind of use case?
Let’s find out!
Default-branch-only search isn’t enough
You might read this and think: What's supposed to be novel here? We've had fast code search via API for big repositories for years now. Sourcegraph made a whole company out of the idea, and GitHub even gives you something similar for free.
True, but only on the default branch. GitHub’s API docs are explicit:
Only the default branch is considered. In most cases, this will be the
main
branch.
And Sourcegraph, while it’s closed-source now, still has an archive of their documentation and code as it stood in 2023, which details the same thing:
Sourcegraph also has a fast search path for code that isn't indexed yet, or for code that will never be indexed (for example: code that is not on a default branch). Indexing every branch of every repository isn't a pragmatic use of resources for most customers, so this decision balances optimizing the common case (searching all default branches) with space savings (not indexing everything).
…
Provides on-demand unindexed search for repositories. It scans through a git archive fetched from gitserver to find results, similar in nature to
git grep
.
The searches our model makes almost never target main
; they’re against arbitrary commits. For us, fast search at any commit was the requirement, and that turned out to be a much harder problem.
What didn’t work
In the spirit of KISS (or “Choose Boring Technology”, or “The Grug Brained Developer”, feel free to pick your favorite decade’s spin on the idea), we figured our first attempt at a solution should also be the most straightforward: a plain old git grep
on the repository. If that worked, then we could save ourselves a lot of time.
We spent a week running experiments on several block-based storage solutions on AWS:
On-demand AWS Lambdas to execute searches via one shared EFS volume.
An ECS cluster where tasks dynamically mount per-repo EFS volumes at query time.
Persistent EC2 instances which use (faster) EBS with multi-attach for horizontal scaling.
We saw some interesting results here, for example, we measured the EBS-mounted volumes to consistently perform I/O intensive operations (like searches) about 3x as fast as EFS-mounted ones. From a test we ran searching for a string in the react-native
repository:
# EBSPerformance counter stats for 'git grep interactionManager 8f1ae53':17,541 minor-faults # 0.062 M/sec878 major-faults # 0.003 M/sec282.34 msec task-clock # 0.184 CPUs utilized1.535785602 seconds time elapsed0.208365000 seconds user0.075011000 seconds sys
# EFSPerformance counter stats for 'git grep interactionManager 8f1ae53':17,620 minor-faults # 0.050 M/sec53 major-faults # 0.149 K/sec354.59 msec task-clock # 0.078 CPUs utilized4.560184652 seconds time elapsed0.212312000 seconds user0.129052000 seconds sys
However, in spite of the individual advantages each had over the others, they all buckled in the same place: large repositories.
While react-native
is a decently-sized repo by most measures, it only has a few thousand files:
$ git ls-files | wc -l7019
We need to support repositories hundreds of times this size, and in cases like that, we actually end up at the mercy of the Linux page cache. The major-faults
rows in the snippets above represent times the grep
process needed to actually reach out to the disk for blocks, and for sample large codebases we tested, searches only returned “fast enough” (under 10 seconds) when major-faults
was 0 (for instance, when the same search is run twice in a row). In other words, searches are fast only when the entire repository is already in the page cache.
At this point, the grep
operation becomes mostly CPU bound, and perhaps unsurprisingly, this is also where EFS and EBS see their performance disparity vanish:
Performance counter stats for 'git grep interactionManager 8f1ae53':17,667 minor-faults # 0.071 M/sec0 major-faults # 0.000 K/sec248.55 msec task-clock # 0.957 CPUs utilized0.259805935 seconds time elapsed0.208592000 seconds user0.040113000 seconds sys
Of course, the only way of ensuring that every block needed for a search is already in the page cache when the search comes in, is to have a crystal ball, which we can’t afford.
So, we have to look into some more earthly solutions!
What somewhat worked
If we can't brute-force our way through a big repository with grep
, then it looks like we'll need to do some indexing.
We tested loading the repository’s files into Elasticsearch, which tokenizes documents and executes keyword queries quickly. The results were promising:
$ cat search.json{"query": {"wildcard": {"content": {"value": "*interactionManager*"}}}}$ https GET <elastic-endpoint>/blobs/_search?size=100 < search.json{"hits": {"hits": [...],"max_score": 1.0,"total": {"relation": "eq","value": 83}},"timed_out": false,"took": 79}
Even using a tiny node type (i4g.large.search
), we saw searches over even huge repositories taking well under 500ms. And that's with a cluster of just one node; we could easily split the dataset into shards (ES manages this on its own) such that 10 different nodes handle 1/10 of the files each, and our searches would be even faster.
Problem solved, right? Not quite. What we just tried with Elasticsearch worked great for one commit, but our requirement was fast search at any commit. The direct approach would be to index the repo’s state at each commit, for easy lookups, but — multiply thousands of repos by thousands of files by thousands of commits, and suddenly you’re managing tens of billions of documents. Running a cluster of that size would be prohibitively complex and expensive.
So, is the “document database” idea dead-on-arrival?
Back to basics
When looking at the problem through the above lens (the challenge of efficiently managing thousands of files, each with potentially thousands of versions), there is one database that's relevant to the discussion: namely, Git itself!
A Git repository seems to store an entire copy of the repository at each commit of a potentially-decades-long history, but does so without terabytes of disk space. This led us to wonder, "how does it do that, and can we somehow adapt that approach to work with a document search database instead of a filesystem?"
For the "how" question, it's honestly very clever if you haven't read about it before. The short version is, Git splits the data it stores into several different kinds of objects:
Blobs → raw file contents, content-addressed.
Trees → lists of blob IDs at a given commit.
And as for the "can we" question, it turns out, yes absolutely!
The life of a search query
The key change was that, like Git, we needed to start storing two kinds of objects instead of just one: the blobs and trees. Then, when it comes time to perform a search, we execute 2 queries:
Get the tree object for the commit SHA specified in the query.
Get all blobs that match the search term specified in the query.
And then, in memory, we filter the results from (2) based on the blob IDs in the result of (1), such that we only return files which are actually in-scope for the request. Easy!
Both of these queries are fast. The first simply retrieves a single document by its ID. While the second does need to consider every version of every file in the repo (which might sound like a lot), our tests show it’s actually not nearly so bad. Some files in the repository have dozens of unique versions, but the vast majority are rarely—if ever—updated. So the total number of documents is generally a small multiple (think ~3x) of the number of files you see in the working tree.
Plus, we’ve made a few optimizations to help things move even faster. If you notice, the above 2 queries are totally independent—you don’t need the result from (1) in order to perform (2), so we do them in parallel. And because a commit’s tree (which just specifies a set of blob IDs) is often much smaller than the result set of blobs, we’ve implemented the in-memory filter as a stream transformer, so that we can start responding to the client the moment we receive our first in-scope match.
The future
We have a few other ideas for optimizations, some a bit more experimental. For example, if we only ever use the tree for a set-contains operation, do we even need to fetch the entire tree at all? Maybe we can pre-construct the set of blob IDs, and only fetch that at query time, or even pre-construct something that acts just like that set, most of the time, but is just 10% the size?
And since we’re using Turbopuffer as the document database backing the Graphite code index, adding embeddings is on the table too, which opens the door to semantic search. Maybe we will soon be able to ask PR Chat to check if a pattern you tried is consistent with examples from similar files.
The now
Today, the system is live, and we’ve already indexed tens of millions of source files across thousands of repositories. For several weeks, Graphite Chat has been using this index to power LLM tool calls, where it’s been a big upgrade from our previous strategy of calling the GitHub API.
We can now fetch as many files as we need (no arbitrary rate limit), our searches accurately target the PR’s branch (instead of only being able to search main
), and best of all — the index has now served thousands of queries with a median latency consistently under 100 milliseconds.
What started as a “just grep it” idea turned into a deep dive through storage tradeoffs, indexing engines, and Git’s internals. By rethinking search in terms of blobs and trees, we’ve unlocked fast search at any commit—something we couldn’t get off the shelf. And we’ve hardly touched on the ingestion part! It turns out there are ways to have Git shrink a 5GB repo clone down to 20MB, and you'd be surprised how useful an anemic clone like that can still be.
Which is to say, check back soon for a part two!