Skip to content

Meet Graphite Agent — your collaborative AI reviewer, built right into your PR page.

Read more

How to write better prompts for AI code generation

Greg Foster
Greg Foster
Graphite software engineer
Try Graphite

Table of contents

AI models (LLMs, code generation systems) are increasingly capable of producing nontrivial code. But the quality of the output heavily depends on how you prompt them. A well-crafted prompt can make the difference between usable, maintainable code and output full of bugs, inefficiencies, or security flaws. This guide walks you through principles, prompt templates, examples, and how to combine prompt engineering with robust review processes.

Importantly: AI-generated code must always be reviewed (by humans, augmented with tools). Relying blindly on generated code is risky. Later in this guide I'll explain how tools like Graphite Agent can help reduce the burden of reviewing AI output.

  • Output variance: Small changes in phrasing may yield drastically different code (correct vs broken).
  • Guiding the model: You need to drive it toward the correct domain, style, constraints.
  • Reducing hallucination or irrelevant output: Good prompts reduce imagined APIs or errors.
  • Saving your time: Better prompt → less iteration and debugging.
  • Enabling reviewability: You want code that is understandable, testable, and auditable.
  • Avoid vagueness. Instead of "write a parser," say "write a Python JSON parser that handles nested objects and arrays."
  • Define inputs, outputs, edge cases.
  • Avoid ambiguous pronouns (“it,” “this”) without clear referent.
  • Supply required context: e.g. existing module APIs, environment constraints, dependencies.
  • Scope the problem: Don’t ask for a full system if you actually want one function or component.
  • Performance, memory, time complexity constraints.
  • Style guidelines (naming conventions, idiomatic style).
  • Error handling, edge conditions.
  • Language version, library versions.
  • Request inline comments, docstrings, design rationale.
  • Ask for test cases or verification code.
  • Optionally ask for a brief summary of algorithm or time complexity.

Give a skeleton or partial code and ask to fill in missing parts.

Example template:

Terminal
# File: data_processing.py
def transform(input_data):
# Your code here to transform a dict into required output
pass
def validate_schema(obj):
# Your code here
pass

Prompt: “Given the skeleton above, fill out transform and validate_schema so that transform converts nested input structures to a flat dict, and validate_schema ensures required keys ['a','b','c'] exist, raising ValueError otherwise. Write unit tests too.”

Encourage the model to break down the problem in steps:

Prompt:

“First, outline a plan in numbered steps for how to sort a list of dictionary objects by multiple keys (primary, secondary). Then implement the code in Python. Finally provide tests and a complexity analysis.”

Provide 1–3 example inputs and outputs before asking the model to generalize.

Prompt:

Terminal
# Examples
Input: [{"name":"Alice","score":10}, {"name":"Bob","score":8}]
Output: ["Alice", "Bob"] # Sorted by descending score
Input: [{"name":"X","score":5}, {"name":"Y","score":5}, {"name":"Z","score":7}]
Output: ["Z","X","Y"]
Now write a function sort_names_by_score(records) that, given a list of dict with keys name and score, returns the names sorted by descending score. Include error handling for missing keys.

Wrap your request to include code, tests, and instructions for usage.

Prompt:

"Write a Go function ReverseUnicode(s string) string that correctly reverses a string by Unicode grapheme clusters (not raw codepoints). Include unit tests (using testing package). Include comments explaining each step."

PitfallSymptomMitigation
Missing edge casese.g. None inputs, empty containersExplicitly mention edge cases in prompt
Mismatched types or interfacesModel picks wrong type signaturesProvide stubs / type hints
Overlong outputsModel writes too much scaffolding or unrelated codeSet “only generate X function(s)” or “no code outside this file”
Hallucinated APIsUsing nonexistent library callsRestrict libraries (e.g. “only use standard library”)
Forgetting error handlingOutput code lacks exceptions or validationsAsk explicitly for “error checking / validation”
Inconsistent stylee.g. mixing snake_case and camelCaseInclude style instructions in prompt

Also note that LLMs have token/context limits: Very large prompts or large contexts may cause truncation or loss of coherence. Break tasks into smaller units when needed.

“Write a sorting function in Python for my custom data structures”

Problems: Vague, no spec, no examples, no constraints.

“In Python 3.10+, write a function sort_by_fields(objs: list[dict], keys: list[str]) -> list[dict] that sorts a list of dictionaries by multiple fields in order. If a key is missing in a dict, treat it as None and push it to the end. Include docstring, error handling, and three example test cases.”

This is precise, gives signature, behavior, constraints, and asks for tests.

“I have a module user_service with:

Terminal
class User:
id: int
name: str
is_active: bool
def fetch_users_from_db() -> list[User]:
pass
def filter_active(users: list[User]) -> list[User]:
pass

Fill in filter_active to filter users whose is_active==True. Also write a function group_users_by_initial(users: list[User]) -> dict[str, list[User]] grouping by first letter of name (uppercase). Write pytest test cases covering empty list, all inactive, mixed. Comment your design decisions."

This gives context and multiple tasks.

Even with a perfect prompt, AI output is not trustworthy without review. Here’s how to approach reviewing code:

  • Every pull request or generated snippet should be eyeballed by a developer.
  • Check semantics, edge cases, performance, style, security.
  • Compare with expectations and test coverage.
  • Run unit tests, integration tests.
  • Use linters (Flake8, ESLint, golangci-lint).
  • Use static analyzers (e.g. MyPy, type checkers).
  • Use fuzzing or property-based testing if applicable.

These help catch issues missed by model or reviewer.

Tools like Graphite Agent, part of the Graphite code review platform, provide automated reviews of pull requests using AI. Graphite Agent is context-aware (analyzes the entire codebase) and flags logic bugs, performance issues, style inconsistencies, security vulnerabilities, and more.

Key features:

  • Instant reviews on PRs with suggestions
  • High signal (few false positives) to reduce noise
  • Custom rules and configurations so it conforms to your team’s style
  • Integration with GitHub for in-line comments and one-click fixes

Despite these strengths, AI should augment, not replace, human code review. Human reviewers provide domain knowledge, business context, and judgment that AI lacks.

By using Graphite Agent as a “first pass” reviewer, you can offload much of the low-hanging bug detection, freeing human reviewers to focus on architecture, domain logic, correctness, and subtle tradeoffs.

Here’s a suggested workflow:

  1. Define the requirement / spec for the component you want to generate
  2. Craft a prompt using patterns (scaffold, examples, constraints)
  3. Invoke the model (possibly iteratively)
  4. Capture the generated code (store in a draft PR or branch)
  5. Run automated tests, linters, static analysis
  6. Submit as pull request
  7. Allow Graphite Agent (Graphite AI review) to comment / flag issues
  8. Human reviewer inspects remaining issues, considering context, domain, performance
  9. Iterate with fixes, rerun automated and AI reviews
  10. Merge once confidence is high

Over time, refine your prompt styles and rules (e.g. set up internal prompt templates, or prompt boilerplate in your toolchain).

  • Better prompts = more reliable, maintainable output
  • Always include context, constraints, examples, tests
  • Break large tasks into modular prompts
  • Always review AI output — human + automated + AI review
  • Tools like Graphite Agent can reduce manual review burden, but can't replace human judgment
  • Iterate prompt styles, build reusable prompt templates
  • Monitor acceptance rates of AI suggestions (in your own team) to improve prompts / review rules

Ready to streamline your AI code generation workflow? Try Graphite Agent for automated code review that catches bugs, improves code quality, and integrates seamlessly with your existing GitHub workflow.

There's no one-size-fits-all answer, but aim for clarity over brevity. Include all necessary context, constraints, and examples. Most effective prompts range from 100-500 words, but complex tasks may require longer prompts. The key is ensuring the AI has enough information to produce quality output.

Never. AI-generated code should always be reviewed by humans, augmented with automated tools. Even with perfect prompts, AI models can introduce bugs, security vulnerabilities, or logic errors. Use tools like Graphite Agent as a first-pass reviewer, but human judgment remains essential.

Prevention is key: specify exact libraries, versions, and APIs in your prompts. Use constraints like "only use standard library" or "use version X of library Y." When hallucinations occur, refine your prompt with more specific constraints and examples.

Use AI as a starting point, not a final solution. For critical systems, apply extra scrutiny: comprehensive testing, security audits, and multiple review cycles. Consider AI-generated code as a first draft that requires significant human refinement and validation.

Built for the world's fastest engineering teams, now available for everyone