Last week, Anthropic released a statement showing that Claude Code usage has increased by 300% in the last 3 months. The appeal of Claude Code is obvious: Every team has a laundry list of tasks that are nice-to-have but never make it over the cut line. You know the drill:

  • Refactor that brittle module I’ve been side-eyeing since early-2024.

  • Clean up some interfaces that scream “we rushed this out to hit the deadline, and have been scared to touch it ever since.”

  • Write tests.

Proponents claim that Claude Code can now automate all of this work. If this were true, it would change a lot of things. So, I figured we had to try it out for ourselves.

I started with something small: identifying the cause of a bug that was triggering off-by-one errors under very specific edge cases on the Graphite dashboard. After reticulating for a minute or so Claude, impressively, returned a concise summary of the problem, the offending change that regressed this behavior (spoiler: it was me) and staged a fix.

Emboldened by this success, I moved on to having Claude perform a refactor in some well tested code that I’d been putting off. It confidently responded with a few thousand lines of code—an unreadable mess that would take hours to carefully review. Worse, Claude had modified the test files, too. Though the PR was “green,” I was no longer sure it was safe to land.

It turns out this was a theme. Trying to vibe code even medium‑complexity features was proving difficult—not because writing the code was hard, but because it was too easy. It was too easy to create thousands of lines of code where even you as the author have no idea how it works. When that code makes it to review, your only option is to “LGTM” and pray, which isn't really an option in any serious engineering org. For many engineers on our team, this tradeoff meant that they were either using Claude Code for point fixes or not at all.

Large PRs are not a novel problem. Studies show that only 24% of large PRs (>1000 lines) receive any review comments. AI only exacerbates this by producing far more code than humans do. And let’s face it, that code often fails to capture the intent of the person who prompted it.

The solution to this problem, for humans, is something we at Graphite know a thing or two about—stacked diffs. Instead of a single, large PR that implements an entire feature, engineers can put up a sequence of smaller PRs in a “stack” that keeps each change focused and easily reviewable.

This got me wondering, “Could we teach Claude to stack?”

I started by adding instructions that explained stacking and how to use the gt CLI to our monorepo’s CLAUDE.md file:

With this new prompt, Claude started using gt to create stacks of PRs, but the boundaries between them were messy. Tests were added before their dependencies existed, and a refactor was split across multiple PRs, leaving none able to pass CI.

I realized that stacking is about more than just writing “small PRs.” It forces you to think critically about the work you intend to do, so you can build a clear plan and share it meaningfully with reviewers. The PRs were small, but they didn’t build and were hard to understand. Turns out, stacking like a senior engineer takes planning, which Claude wasn’t doing on its own, so we need to teach it that too.

Before I create a stack myself, I like to think about where the feature is going and the building blocks that will be needed to get there. For example: I might start by defining interfaces, then the server logic, and finally the frontend UX.

This maps cleanly to the concept of “Todos” in Claude Code, which is the mechanism that the agent uses internally to plan out complex tasks. I added this to Claude’s instructions:

With a few more tweaks to the instructions, Claude Code started organizing its “Todos” as a stack—one pull request for each task. Check out this example, where I asked Claude to build a Wordle app:

And voilà! Claude Code built Wordle, as a stack of pull requests:

Each PR focuses on a different area of the project, and can be independently reviewed.

Now with stacking in its repertoire, Claude Code is my daily driver for building complex features, and completing a long list of refactors and cleanup that I otherwise thought would never happen. Plus, with GT MCP, Claude can now plan and break down its thinking so that’s easier for me to review and ship. GT MCP aligns well with how agents internally plan their work, turning their “Todo” lists into real, reviewable PRs—each one atomic, CI-passing, and logically scoped.

With GT MCP, myself and other engineers at Graphite have also:

  • Built entire features across our web app and backend.

  • Investigated and resolved tricky bugs and performance regressions.

  • Added tests all across our codebase (there’s no excuse to ship code without them)

Our community’s been seeing positive gains from it too. One engineer says:

"I Claude Coded about 200 lines of JS and then really wanted PRs for two things and a draft for the third. I one-shotted it all with gt mcp, and it saved me about 30-45 minutes."

Jordan Scales, Engineer at Notion

We’re still exploring how to best enable coding agents to stack pull requests and collaborate with human engineers. Our learnings so far are packaged up in the GT MCP, which is now available in the beta CLI (v1.6.7), so anyone can teach their agents to stack. Learn how to set it up here.

Built for the world's fastest engineering teams, now available for everyone