How I got Claude to write code I could actually ship

Black glass sphere with 3D wireframe cubes on dark background.

Last week, Anthropic released a statement showing that Claude Code usage has increased by 300% in the last 3 months. The appeal of Claude Code is obvious; every team has a laundry list of tasks that are nice-to-have but never make it over the cut line. You know the drill:

Refactor that brittle module I’ve been side-eyeing since early-2024.
Clean up some interfaces that scream “we rushed this out to hit the deadline, and have been scared to touch it ever since.”
Write tests.

Proponents claim that Claude Code can now automate all of this work. If this were true, it would change a lot of things. So, I figured we had to try it out for ourselves.

Testing Claude Code’s abilities

I started with something small: identifying the cause of a bug that was triggering off-by-one errors under very specific edge cases on the Graphite dashboard. After reticulating for a minute or so Claude, impressively, returned a concise summary of the problem, the offending change that regressed this behavior (spoiler: it was me) and staged a fix.

Emboldened by this success, I moved on to having Claude perform a refactor in some well tested code that I’d been putting off. It confidently responded with a few thousand lines of code—an unreadable mess that would take hours to carefully review. Worse, Claude had modified the test files, too. Though the PR was “green,” I was no longer sure it was safe to land.

It turns out this was a theme. Trying to vibe code even medium‑complexity features was proving difficult—not because writing the code was hard, but because it was too easy. It was too easy to create thousands of lines of code where even you as the author have no idea how it works. When that code makes it to review, your only option is to “LGTM” and pray, which isn't really an option in any serious engineering org. For many engineers on our team, this tradeoff meant that they were either using Claude Code for point fixes or not at all.

AI-generated code needs stacked PRs

Large PRs are not a novel problem. Studies show that only 24% of large PRs (>1000 lines) receive any review comments. AI only exacerbates this by producing far more code than humans do. And let’s face it, that code often fails to capture the intent of the person who prompted it.

The solution to this problem, for humans, is something we at Graphite know a thing or two about—stacked diffs. Instead of a single, large PR that implements an entire feature, engineers can put up a sequence of smaller PRs in a “stack” that keeps each change focused and easily reviewable.

This got me wondering, “Could we teach Claude to stack?”

Teaching Claude to stack

I started by adding instructions that explained stacking and how to use the gt CLI to our monorepo’s CLAUDE.md file:

With this new prompt, Claude started using gt to create stacks of PRs, but the boundaries between them were messy. Tests were added before their dependencies existed, and a refactor was split across multiple PRs, leaving none able to pass CI.

I realized that stacking is about more than just writing “small PRs.” It forces you to think critically about the work you intend to do, so you can build a clear plan and share it meaningfully with reviewers. The PRs were small, but they didn’t build and were hard to understand. Turns out, stacking like a senior engineer takes planning, which Claude wasn’t doing on its own, so we need to teach it that too.

Before I create a stack myself, I like to think about where the feature is going and the building blocks that will be needed to get there. For example: I might start by defining interfaces, then the server logic, and finally the frontend UX.

This maps cleanly to the concept of “Todos” in Claude Code, which is the mechanism that the agent uses internally to plan out complex tasks. I added this to Claude’s instructions:

With a few more tweaks to the instructions, Claude Code started organizing its “Todos” as a stack—one pull request for each task. Check out this example, where I asked Claude to build a Wordle app:

And voilà! Claude Code built Wordle, as a stack of pull requests:

#1 feat: add basic HTML structure and styling for Wordle game +268/-0
#2 feat: implement core game logic and state management +129/-2
#3 feat: add keyboard input handling +29/-1
#4 feat: add animations and visual feedback +33/-4

Each PR focuses on a different area of the project, and can be independently reviewed.

Building with Claude Code and GT MCP

Claude Code has become my go-to tool for getting real engineering work done—building features, cleaning up legacy code, and finally tackling the long list of refactors I’d been putting off. It works well because it plans the way engineers do: breaking things down into smaller, reviewable steps. We built the GT MCP to capture this workflow and make it so any agent can follow stacking best practices.

The GT MCP allows AI agents to automatically generate stacked PRs and transform large, AI-generated diffs into a sequence of smaller, focused pull requests. This helps Claude plan its work and produce PRs that are scoped, testable, and easy to review. With the MCP connected, any agent will stack like a senior engineer, without any back-and-forth prompting.

With GT MCP, myself and other engineers at Graphite have also:

Built entire features across our web app and backend.
Investigated and resolved tricky bugs and performance regressions.
Added tests all across our codebase (there’s no excuse to ship code without them)

Our community’s been seeing positive gains from it too. One engineer says:

"I Claude Coded about 200 lines of JS and then really wanted PRs for two things and a draft for the third. I one-shotted it all with gt mcp, and it saved me about 30-45 minutes."
Jordan Scales, SWE

Agents 🤝 stacking

We’re still exploring how to best enable coding agents to stack pull requests and collaborate with human engineers. Our learnings so far are packaged up in the GT MCP, which is now available in the CLI (v1.6.7), so anyone can teach their agents to stack. Learn how to set it up here.