Table of contents
- Why prompt quality matters
- Core principles of prompt design
- Prompt patterns and templates
- Pitfalls, gotchas, and mitigation
- Examples of good vs poor prompts
- Reviewing AI-generated code: risks, strategies, and tooling
- Sample workflow integrating prompt + review
- Summary and best practices
- Frequently asked questions
AI models (LLMs, code generation systems) are increasingly capable of producing nontrivial code. But the quality of the output heavily depends on how you prompt them. A well-crafted prompt can make the difference between usable, maintainable code and output full of bugs, inefficiencies, or security flaws. This guide walks you through principles, prompt templates, examples, and how to combine prompt engineering with robust review processes.
Importantly: AI-generated code must always be reviewed (by humans, augmented with tools). Relying blindly on generated code is risky. Later in this guide I'll explain how tools like Graphite Agent can help reduce the burden of reviewing AI output.
Why prompt quality matters
- Output variance: Small changes in phrasing may yield drastically different code (correct vs broken).
- Guiding the model: You need to drive it toward the correct domain, style, constraints.
- Reducing hallucination or irrelevant output: Good prompts reduce imagined APIs or errors.
- Saving your time: Better prompt → less iteration and debugging.
- Enabling reviewability: You want code that is understandable, testable, and auditable.
Core principles of prompt design
Clarity and specificity
- Avoid vagueness. Instead of "write a parser," say "write a Python JSON parser that handles nested objects and arrays."
- Define inputs, outputs, edge cases.
- Avoid ambiguous pronouns (“it,” “this”) without clear referent.
Context and scope
- Supply required context: e.g. existing module APIs, environment constraints, dependencies.
- Scope the problem: Don’t ask for a full system if you actually want one function or component.
Constraints and requirements
- Performance, memory, time complexity constraints.
- Style guidelines (naming conventions, idiomatic style).
- Error handling, edge conditions.
- Language version, library versions.
Asking for explanation or commentary
- Request inline comments, docstrings, design rationale.
- Ask for test cases or verification code.
- Optionally ask for a brief summary of algorithm or time complexity.
Prompt patterns and templates
Scaffold + fill in
Give a skeleton or partial code and ask to fill in missing parts.
Example template:
# File: data_processing.pydef transform(input_data):# Your code here to transform a dict into required outputpassdef validate_schema(obj):# Your code herepass
Prompt: “Given the skeleton above, fill out transform
and validate_schema
so that transform
converts nested input structures to a flat dict, and validate_schema
ensures required keys ['a','b','c']
exist, raising ValueError otherwise. Write unit tests too.”
Step-by-step reasoning / chain of thought
Encourage the model to break down the problem in steps:
Prompt:
“First, outline a plan in numbered steps for how to sort a list of dictionary objects by multiple keys (primary, secondary). Then implement the code in Python. Finally provide tests and a complexity analysis.”
Few-shot examples
Provide 1–3 example inputs and outputs before asking the model to generalize.
Prompt:
# ExamplesInput: [{"name":"Alice","score":10}, {"name":"Bob","score":8}]Output: ["Alice", "Bob"] # Sorted by descending scoreInput: [{"name":"X","score":5}, {"name":"Y","score":5}, {"name":"Z","score":7}]Output: ["Z","X","Y"]Now write a function sort_names_by_score(records) that, given a list of dict with keys name and score, returns the names sorted by descending score. Include error handling for missing keys.
Code + tests + instructions
Wrap your request to include code, tests, and instructions for usage.
Prompt:
"Write a Go function
ReverseUnicode(s string) string
that correctly reverses a string by Unicode grapheme clusters (not raw codepoints). Include unit tests (usingtesting
package). Include comments explaining each step."
Pitfalls, gotchas, and mitigation
Pitfall | Symptom | Mitigation |
---|---|---|
Missing edge cases | e.g. None inputs, empty containers | Explicitly mention edge cases in prompt |
Mismatched types or interfaces | Model picks wrong type signatures | Provide stubs / type hints |
Overlong outputs | Model writes too much scaffolding or unrelated code | Set “only generate X function(s)” or “no code outside this file” |
Hallucinated APIs | Using nonexistent library calls | Restrict libraries (e.g. “only use standard library”) |
Forgetting error handling | Output code lacks exceptions or validations | Ask explicitly for “error checking / validation” |
Inconsistent style | e.g. mixing snake_case and camelCase | Include style instructions in prompt |
Also note that LLMs have token/context limits: Very large prompts or large contexts may cause truncation or loss of coherence. Break tasks into smaller units when needed.
Examples of good vs poor prompts
Poor prompt
“Write a sorting function in Python for my custom data structures”
Problems: Vague, no spec, no examples, no constraints.
Better prompt
“In Python 3.10+, write a function
sort_by_fields(objs: list[dict], keys: list[str]) -> list[dict]
that sorts a list of dictionaries by multiple fields in order. If a key is missing in a dict, treat it asNone
and push it to the end. Include docstring, error handling, and three example test cases.”
This is precise, gives signature, behavior, constraints, and asks for tests.
More advanced prompt
“I have a module
user_service
with:Terminalclass User:id: intname: stris_active: booldef fetch_users_from_db() -> list[User]:passdef filter_active(users: list[User]) -> list[User]:passFill in
filter_active
to filter users whoseis_active==True
. Also write a functiongroup_users_by_initial(users: list[User]) -> dict[str, list[User]]
grouping by first letter of name (uppercase). Write pytest test cases covering empty list, all inactive, mixed. Comment your design decisions."
This gives context and multiple tasks.
Reviewing AI-generated code: risks, strategies, and tooling
Even with a perfect prompt, AI output is not trustworthy without review. Here’s how to approach reviewing code:
Human in the loop
- Every pull request or generated snippet should be eyeballed by a developer.
- Check semantics, edge cases, performance, style, security.
- Compare with expectations and test coverage.
Automated checks, testing, static analysis
- Run unit tests, integration tests.
- Use linters (Flake8, ESLint, golangci-lint).
- Use static analyzers (e.g. MyPy, type checkers).
- Use fuzzing or property-based testing if applicable.
These help catch issues missed by model or reviewer.
AI code review assistants (e.g. Graphite Agent by Graphite)
Tools like Graphite Agent, part of the Graphite code review platform, provide automated reviews of pull requests using AI. Graphite Agent is context-aware (analyzes the entire codebase) and flags logic bugs, performance issues, style inconsistencies, security vulnerabilities, and more.
Key features:
- Instant reviews on PRs with suggestions
- High signal (few false positives) to reduce noise
- Custom rules and configurations so it conforms to your team’s style
- Integration with GitHub for in-line comments and one-click fixes
Despite these strengths, AI should augment, not replace, human code review. Human reviewers provide domain knowledge, business context, and judgment that AI lacks.
By using Graphite Agent as a “first pass” reviewer, you can offload much of the low-hanging bug detection, freeing human reviewers to focus on architecture, domain logic, correctness, and subtle tradeoffs.
Sample workflow integrating prompt + review
Here’s a suggested workflow:
- Define the requirement / spec for the component you want to generate
- Craft a prompt using patterns (scaffold, examples, constraints)
- Invoke the model (possibly iteratively)
- Capture the generated code (store in a draft PR or branch)
- Run automated tests, linters, static analysis
- Submit as pull request
- Allow Graphite Agent (Graphite AI review) to comment / flag issues
- Human reviewer inspects remaining issues, considering context, domain, performance
- Iterate with fixes, rerun automated and AI reviews
- Merge once confidence is high
Over time, refine your prompt styles and rules (e.g. set up internal prompt templates, or prompt boilerplate in your toolchain).
Summary and best practices
- Better prompts = more reliable, maintainable output
- Always include context, constraints, examples, tests
- Break large tasks into modular prompts
- Always review AI output — human + automated + AI review
- Tools like Graphite Agent can reduce manual review burden, but can't replace human judgment
- Iterate prompt styles, build reusable prompt templates
- Monitor acceptance rates of AI suggestions (in your own team) to improve prompts / review rules
Ready to streamline your AI code generation workflow? Try Graphite Agent for automated code review that catches bugs, improves code quality, and integrates seamlessly with your existing GitHub workflow.
Frequently asked questions
How long should my prompts be?
There's no one-size-fits-all answer, but aim for clarity over brevity. Include all necessary context, constraints, and examples. Most effective prompts range from 100-500 words, but complex tasks may require longer prompts. The key is ensuring the AI has enough information to produce quality output.
Can I trust AI-generated code without review?
Never. AI-generated code should always be reviewed by humans, augmented with automated tools. Even with perfect prompts, AI models can introduce bugs, security vulnerabilities, or logic errors. Use tools like Graphite Agent as a first-pass reviewer, but human judgment remains essential.
What's the best way to handle AI hallucinations?
Prevention is key: specify exact libraries, versions, and APIs in your prompts. Use constraints like "only use standard library" or "use version X of library Y." When hallucinations occur, refine your prompt with more specific constraints and examples.
Should I use AI for critical production code?
Use AI as a starting point, not a final solution. For critical systems, apply extra scrutiny: comprehensive testing, security audits, and multiple review cycles. Consider AI-generated code as a first draft that requires significant human refinement and validation.