How to write better prompts for AI code generation

Why prompt quality matters
Core principles of prompt design
Prompt patterns and templates
Pitfalls, gotchas, and mitigation
Examples of good vs poor prompts
Reviewing AI-generated code: risks, strategies, and tooling
Sample workflow integrating prompt + review
Summary and best practices
Frequently asked questions

AI models (LLMs, code generation systems) are increasingly capable of producing nontrivial code. But the quality of the output heavily depends on how you prompt them. A well-crafted prompt can make the difference between usable, maintainable code and output full of bugs, inefficiencies, or security flaws. This guide walks you through principles, prompt templates, examples, and how to combine prompt engineering with robust review processes.

Importantly: AI-generated code must always be reviewed (by humans, augmented with tools). Relying blindly on generated code is risky. Later in this guide I'll explain how tools like Graphite Agent can help reduce the burden of reviewing AI output.

Why prompt quality matters

Output variance: Small changes in phrasing may yield drastically different code (correct vs broken).
Guiding the model: You need to drive it toward the correct domain, style, constraints.
Reducing hallucination or irrelevant output: Good prompts reduce imagined APIs or errors.
Saving your time: Better prompt → less iteration and debugging.
Enabling reviewability: You want code that is understandable, testable, and auditable.

Core principles of prompt design

Clarity and specificity

Avoid vagueness. Instead of "write a parser," say "write a Python JSON parser that handles nested objects and arrays."
Define inputs, outputs, edge cases.
Avoid ambiguous pronouns (“it,” “this”) without clear referent.

Context and scope

Supply required context: e.g. existing module APIs, environment constraints, dependencies.
Scope the problem: Don’t ask for a full system if you actually want one function or component.

Constraints and requirements

Performance, memory, time complexity constraints.
Style guidelines (naming conventions, idiomatic style).
Error handling, edge conditions.
Language version, library versions.

Asking for explanation or commentary

Request inline comments, docstrings, design rationale.
Ask for test cases or verification code.
Optionally ask for a brief summary of algorithm or time complexity.

Prompt patterns and templates

Scaffold + fill in

Give a skeleton or partial code and ask to fill in missing parts.

Example template:

Terminal

# File: data_processing.py

def transform(input_data):
    # Your code here to transform a dict into required output
    pass

def validate_schema(obj):
    # Your code here
    pass

Prompt: “Given the skeleton above, fill out transform and validate_schema so that transform converts nested input structures to a flat dict, and validate_schema ensures required keys ['a','b','c'] exist, raising ValueError otherwise. Write unit tests too.”

Step-by-step reasoning / chain of thought

Encourage the model to break down the problem in steps:

Prompt:

“First, outline a plan in numbered steps for how to sort a list of dictionary objects by multiple keys (primary, secondary). Then implement the code in Python. Finally provide tests and a complexity analysis.”

Few-shot examples

Provide 1–3 example inputs and outputs before asking the model to generalize.

Prompt:

Terminal

# Examples
Input: [{"name":"Alice","score":10}, {"name":"Bob","score":8}]
Output: ["Alice", "Bob"]  # Sorted by descending score

Input: [{"name":"X","score":5}, {"name":"Y","score":5}, {"name":"Z","score":7}]
Output: ["Z","X","Y"]

Now write a function sort_names_by_score(records) that, given a list of dict with keys name and score, returns the names sorted by descending score. Include error handling for missing keys.

Code + tests + instructions

Wrap your request to include code, tests, and instructions for usage.

Prompt:

"Write a Go function ReverseUnicode(s string) string that correctly reverses a string by Unicode grapheme clusters (not raw codepoints). Include unit tests (using testing package). Include comments explaining each step."

Pitfalls, gotchas, and mitigation

Pitfall	Symptom	Mitigation
Missing edge cases	e.g. `None` inputs, empty containers	Explicitly mention edge cases in prompt
Mismatched types or interfaces	Model picks wrong type signatures	Provide stubs / type hints
Overlong outputs	Model writes too much scaffolding or unrelated code	Set “only generate X function(s)” or “no code outside this file”
Hallucinated APIs	Using nonexistent library calls	Restrict libraries (e.g. “only use standard library”)
Forgetting error handling	Output code lacks exceptions or validations	Ask explicitly for “error checking / validation”
Inconsistent style	e.g. mixing snake_case and camelCase	Include style instructions in prompt

Also note that LLMs have token/context limits: Very large prompts or large contexts may cause truncation or loss of coherence. Break tasks into smaller units when needed.

Examples of good vs poor prompts

Poor prompt

“Write a sorting function in Python for my custom data structures”

Problems: Vague, no spec, no examples, no constraints.

Better prompt

“In Python 3.10+, write a function sort_by_fields(objs: list[dict], keys: list[str]) -> list[dict] that sorts a list of dictionaries by multiple fields in order. If a key is missing in a dict, treat it as None and push it to the end. Include docstring, error handling, and three example test cases.”

This is precise, gives signature, behavior, constraints, and asks for tests.

More advanced prompt

“I have a module user_service with:
Terminal
class User:
    id: int
    name: str
    is_active: bool

def fetch_users_from_db() -> list[User]:
    pass

def filter_active(users: list[User]) -> list[User]:
    pass
Fill in filter_active to filter users whose is_active==True. Also write a function group_users_by_initial(users: list[User]) -> dict[str, list[User]] grouping by first letter of name (uppercase). Write pytest test cases covering empty list, all inactive, mixed. Comment your design decisions."

This gives context and multiple tasks.

Reviewing AI-generated code: risks, strategies, and tooling

Even with a perfect prompt, AI output is not trustworthy without review. Here’s how to approach reviewing code:

Human in the loop

Every pull request or generated snippet should be eyeballed by a developer.
Check semantics, edge cases, performance, style, security.
Compare with expectations and test coverage.

Automated checks, testing, static analysis

Run unit tests, integration tests.
Use linters (Flake8, ESLint, golangci-lint).
Use static analyzers (e.g. MyPy, type checkers).
Use fuzzing or property-based testing if applicable.

These help catch issues missed by model or reviewer.

AI code review assistants (e.g. Graphite Agent by Graphite)

Tools like Graphite Agent, part of the Graphite code review platform, provide automated reviews of pull requests using AI. Graphite Agent is context-aware (analyzes the entire codebase) and flags logic bugs, performance issues, style inconsistencies, security vulnerabilities, and more.

Key features:

Instant reviews on PRs with suggestions
High signal (few false positives) to reduce noise
Custom rules and configurations so it conforms to your team’s style
Integration with GitHub for in-line comments and one-click fixes

Despite these strengths, AI should augment, not replace, human code review. Human reviewers provide domain knowledge, business context, and judgment that AI lacks.

By using Graphite Agent as a “first pass” reviewer, you can offload much of the low-hanging bug detection, freeing human reviewers to focus on architecture, domain logic, correctness, and subtle tradeoffs.

Sample workflow integrating prompt + review

Here’s a suggested workflow:

Define the requirement / spec for the component you want to generate
Craft a prompt using patterns (scaffold, examples, constraints)
Invoke the model (possibly iteratively)
Capture the generated code (store in a draft PR or branch)
Run automated tests, linters, static analysis
Submit as pull request
Allow Graphite Agent (Graphite AI review) to comment / flag issues
Human reviewer inspects remaining issues, considering context, domain, performance
Iterate with fixes, rerun automated and AI reviews
Merge once confidence is high

Over time, refine your prompt styles and rules (e.g. set up internal prompt templates, or prompt boilerplate in your toolchain).

Summary and best practices

Better prompts = more reliable, maintainable output
Always include context, constraints, examples, tests
Break large tasks into modular prompts
Always review AI output — human + automated + AI review
Tools like Graphite Agent can reduce manual review burden, but can't replace human judgment
Iterate prompt styles, build reusable prompt templates
Monitor acceptance rates of AI suggestions (in your own team) to improve prompts / review rules

Ready to streamline your AI code generation workflow? Try Graphite Agent for automated code review that catches bugs, improves code quality, and integrates seamlessly with your existing GitHub workflow.

Frequently asked questions

How long should my prompts be?

There's no one-size-fits-all answer, but aim for clarity over brevity. Include all necessary context, constraints, and examples. Most effective prompts range from 100-500 words, but complex tasks may require longer prompts. The key is ensuring the AI has enough information to produce quality output.

Can I trust AI-generated code without review?

Never. AI-generated code should always be reviewed by humans, augmented with automated tools. Even with perfect prompts, AI models can introduce bugs, security vulnerabilities, or logic errors. Use tools like Graphite Agent as a first-pass reviewer, but human judgment remains essential.

What's the best way to handle AI hallucinations?

Prevention is key: specify exact libraries, versions, and APIs in your prompts. Use constraints like "only use standard library" or "use version X of library Y." When hallucinations occur, refine your prompt with more specific constraints and examples.

Should I use AI for critical production code?

Use AI as a starting point, not a final solution. For critical systems, apply extra scrutiny: comprehensive testing, security audits, and multiple review cycles. Consider AI-generated code as a first draft that requires significant human refinement and validation.

How to write better prompts for AI code generation

Table of contents

Why prompt quality matters

Core principles of prompt design

Clarity and specificity

Context and scope

Constraints and requirements

Asking for explanation or commentary

Prompt patterns and templates

Scaffold + fill in

Step-by-step reasoning / chain of thought

Few-shot examples

Code + tests + instructions

Pitfalls, gotchas, and mitigation

Examples of good vs poor prompts

Poor prompt

Better prompt

More advanced prompt

Reviewing AI-generated code: risks, strategies, and tooling

Human in the loop

Automated checks, testing, static analysis

AI code review assistants (e.g. Graphite Agent by Graphite)

Sample workflow integrating prompt + review

Summary and best practices

Frequently asked questions

How long should my prompts be?

Can I trust AI-generated code without review?

What's the best way to handle AI hallucinations?

Should I use AI for critical production code?

How to write good code review comments

How to write better prompts for AI code generation

How to quickly write clear, detailed pull request descriptions

Built for the world's fastest engineering teams, now available for everyone