Skip to content

Meet Graphite Agent — your collaborative AI reviewer, built right into your PR page.

Read more

How much context do AI code reviews need?

Greg Foster
Greg Foster
Graphite software engineer
Try Graphite

Table of contents

When applying AI to code review, one of the central design decisions is: how much of the existing codebase should the model “see” or reference when evaluating a proposed change? At one extreme, you feed in (or index) the entire repository; at the other, you show only the diff (the lines changed plus perhaps some local context). The choice has major implications for accuracy, latency, cost, and utility.

In this guide, we'll:

  • Define the different levels of context (diff-only, module/feature slice, entire repo).
  • Examine pros and cons of each in practice.
  • Discuss how AI models handle context limits.
  • Explore how Graphite's Diamond approaches this problem.
  • Conclude with recommendations and future directions.

First, let's categorize possible scopes of context that an AI reviewer might use:

Context scopeWhat's typically availableUse cases / benefitsLimitations / risks
Diff-onlyThe changed lines, plus a few lines of surrounding context (e.g. 3–5 lines before/after)Fast feedback, minimal data ingestion, lightweight computeMissing global invariants, hidden dependencies, blind to cross-module impacts
Module / feature sliceThe files or modules likely impacted (e.g. same package, imported code, neighbor files)Better heuristics, catch inter-file issues within a moduleMight still miss subtle interactions or cross-cutting concerns
Full repo / indexed code graphThe entire codebase, possibly represented via precomputed embeddings, dependency graphs, or indexesMaximal awareness of overall architecture, potential to reason across modulesHigh cost (storage, compute), more latency, risk of information overload or hallucination

These scopes are not rigid categories; many systems use hybrid strategies (e.g. “diff + relevant slices + embeddings from full repo”).

Here are key trade-dimensions to consider when choosing how much context your AI code review should use:

  • Diff-only reviewers can catch many local issues (typos, style violations, simple logic changes), but may miss violations of global invariants, API misuse, or architectural consistency problems.
  • Larger context (module or full repo) enables detection of cross-file issues (e.g. a change in one module that breaks usage elsewhere), consistency across coding patterns, or detecting semantic drift.
  • But more context also increases the chance of noise—irrelevant parts may distract the model, or lead to hallucinated suggestions if the model is not well guided.
  • Diff-only is the fastest: small input size, few tokens processed.
  • Module or slice strategy slows things down more, but is often acceptable when changes are moderate.
  • Full-repo or heavy graph indexing can cause latency, especially for large monorepos; careful caching or precomputation is required.
  • More context demands more compute, memory, and storage (e.g. embeddings, indices). Models may need to chunk or filter context.
  • For large teams or high volume of PRs, costs scale nonlinearly.
  • You may need to balance between "always full context" vs "fallback only when necessary."
  • With full-repo context or embeddings, you can maintain memory across reviews: the system "remembers" decisions, style, and patterns over time.
  • Diff-only models generally lack memory, so context resets each review.
  • When giving full repo access to third-party AI services, security and IP safety become concerns.
  • Diff-only or module-only access reduces exposure.

Because models (especially LLMs) have token limits, you can’t feed arbitrarily large context into them naively. Some strategies:

  1. Context pruning / relevance ranking Select only the most relevant files or functions to include (e.g. those imported, callers, definitions). Use static analysis or code graphs to pick.

  2. Embeddings / indexing + retrieval Precompute vector embeddings (e.g. per file or per function) for the full repo offline. Then, at review time, retrieve the top-k relevant embeddings (code chunks) to include in the prompt. This gives a “soft” full-repo context without sending everything.

  3. Hierarchical prompting / chunking Break the codebase into chunks; use a multi-stage reasoning approach (first coarse-grained, then drill down). E.g. “does this change interact with module X?” — if yes, fetch module X and re-evaluate.

  4. Memory / stateful agents Maintain state or cache between reviews; reuse prior context so you don’t have to reprocess everything every time.

  5. External rule engines and static analysis Use conventional static analysis or symbolic tools to flag certain interactions; the AI acts on top of that. This reduces the need for full-repo reasoning in every pass.

These techniques help bridge the gap between minimal context and full-repo awareness in a scalable fashion.

Now let's dig into Graphite (and its AI reviewer product, Diamond) and see how it handles context, what design choices it makes, and how that relates to the general trade-offs above.

  • Graphite is a platform for managing pull requests using stacked diffs / stacked PRs (breaking a large change into small, reviewable PRs) and offering a dashboard / CLI to manage them.
  • Graphite launched Diamond, an AI-based reviewer that integrates into PR workflows.
  • Diamond emphasizes "codebase-aware" feedback (i.e. not just analyzing the diff) and customizable rules.
  • Graphite insists that AI will not fully replace human review—they view the AI as assistive, not authoritative.

From the available documentation, here's what we know (and what we can reasonably infer) about how Graphite handles context in its AI code review:

  • Diamond is "codebase-aware" — meaning it doesn't only look at the diff; it considers the surrounding code and possibly other modules in the repo to make suggestions.
  • When Graphite's AI is invoked on a pull request, the underlying system clones or fetches code (or at least diff plus nearby context) to feed into models.
  • Graphite's infrastructure likely maintains indexes or caching so that parts of the codebase have precomputed context (e.g. ASTs, embeddings) to speed up review. While not explicitly documented, this is a common pattern in AI code tools.
  • Graphite's focus on stacked diffs helps control the size of changes, making it more feasible for AI review: by breaking down changes, the AI rarely has to reason over enormous diffs, which mitigates context explosion.
  • Diamond allows for custom rules (e.g. enforce team preferences) which suggests that part of its reasoning has a rule-based overlay rather than purely end-to-end learned models.

To decide when diff-only is acceptable vs needing broader context, consider:

  1. Size and complexity of the change

    • Small bugfixes, style changes, or isolated logic patches sometimes don't require full context
    • Feature-level changes, refactorings, API shifts often demand more context
  2. Inter-module coupling

    • If your modules are highly decoupled (clean architecture, clear interfaces), diff-only reviews work better
    • Tightly coupled or pervasive shared state (global variables, shared config) demands awareness of cross-cutting effects
  3. Domain risk

    • In critical systems (security, performance, compliance), you want more context to avoid blind spots
    • Lower-risk or prototype code might accept lighter reviews
  4. Team scale & latency tolerance

    • Large teams, high volume of PRs: diff-only gives speed
    • You may adopt an approach where you start with diff-only AI review, and escalate to full-context review for bigger PRs
  5. Tooling and infrastructure capabilities

    • If your tool supports embeddings, caching, filtered retrieval, you can get "near full context" without sending everything
    • If not, full context may be impractical

Here are practical patterns and guidelines for building or configuring an AI code review system with context tradeoffs:

  • Hybrid strategy ("diff + relevant slices"): Always send the diff plus a ranked list of relevant nearby code (imports, callers, test files). This often captures 80–90% of the needed context without full repo cost.

  • Precompute embeddings / code graph: Maintain a vector index of all code (per file / function) so that, at review time, you retrieve only the top-k relevant chunks. This gives a "soft context" without flooding the input.

  • Chunking + hierarchical reasoning: Break the repo into logical areas and perform coarse-grained reasoning first; if the change seems to impact module boundaries, fetch more context and rerun.

  • Caching / incremental updates: Avoid reindexing the entire repo each time; use incremental updates as new commits land.

  • Feedback loop & "feedback rejection": Allow developers to mark AI suggestions as irrelevant; use that signal to refine context selection in future runs (reinforcement / pruning).

  • Bounding the context window: Adopt token limits (e.g. 8k, 16k tokens) and truncation rules (e.g. favor newer code, skip rarely touched modules).

  • Combine with static analyzers: Use static analysis tools to mark obvious cross-file dependencies; use the AI to reason about higher-level semantics.

  • Monitor performance / latency vs quality tradeoff: Instrument how often more context leads to better suggestions, and when the overhead isn't justified.

  • The question "full repo or diff only?" isn't binary: most effective systems use hybrid approaches (diff + selective context) to balance speed and depth.
  • Diff-only models remain fast and useful for many small changes, but risk missing global interactions.
  • Full-repo models give maximum awareness but demand careful engineering (indexing, caching, chunking) to scale.
  • Graphite's Diamond is a representative modern tool: it positions itself as "codebase-aware" rather than diff-only, layered over a stacked diff setup to control change size.
  • Evolving trends include better memory / stateful agents (so the AI "remembers" past PRs), more efficient embeddings, and more precise relevance ranking to reduce extraneous context.

Diff-only AI code review analyzes just the changed lines plus a few surrounding lines of context, making it fast but potentially missing global issues. Full repo review gives the AI access to the entire codebase, enabling it to catch cross-file dependencies and architectural consistency issues, but requires more compute and can increase latency. Most modern tools use a hybrid approach, combining diff analysis with selective access to relevant parts of the codebase.

The choice depends on several factors: For small bugfixes and style changes, diff-only is often sufficient. For feature-level changes, refactorings, or critical systems (security, compliance), you'll want broader context. Consider your modules' coupling—if highly decoupled, diff-only works better. For large teams with high PR volume, start with diff-only and escalate to broader context for complex changes. The best approach is often hybrid: diff plus relevant slices of related code.

Graphite's Diamond is "codebase-aware," meaning it uses more than just the diff—it considers surrounding code and potentially other modules in the repository. The exact depth varies, but it's designed to provide nuanced feedback beyond what diff-only tools offer, while leveraging Graphite's stacked PR approach to keep changes manageable. This hybrid strategy helps Diamond catch cross-file issues without the full overhead of indexing entire large repositories.

AI models use several strategies to handle large codebases within token limits: context pruning (selecting only relevant files/functions), embeddings and retrieval (precomputing vector representations and fetching top-k relevant chunks at review time), hierarchical reasoning (coarse-grained analysis first, then drilling down), and caching/incremental updates. Many tools also combine AI with static analysis to identify dependencies, reducing the need for full-repo reasoning on every pass.

Yes, AI code reviewers can be very effective with limited context, especially for local issues like style violations, simple logic errors, and common bugs. However, they'll miss some classes of issues—violations of global invariants, subtle cross-module interactions, or architectural consistency problems. The key is matching the context level to the change type and risk level, and using hybrid strategies that provide targeted context where needed rather than sending everything or nothing.

Built for the world's fastest engineering teams, now available for everyone