Code Agent
Also known as: Software Engineering Agent, SWE Agent, Coding Agent
Wires the tool-use loop to a codebase, an editor, a shell, and a test runner.
Claude Code
- Declare in `CLAUDE.md`: the lead agent orchestrates and reviews; implementer subagents hold the edit tools.
- Define the test-gate sequence (
test:unit → lint → typecheck → test:e2e) inCLAUDE.md— the loop terminates on green, not on the agent's assessment. - Give each issue its own git worktree (
git worktree add); implementer agents work in isolation and the lead checkout stays clean. - Scope implementer subagents to file-editing tools only via `settings.json`
permissions.allow; the lead holds Bash.
Primitives
Related patterns
Cursor
- Use Agent mode — it ships Read/Edit/Bash equivalents and the read-edit-test loop natively without additional configuration.
- Use Plan mode before implementation to produce an inspectable step list; review it before the agent touches files.
- Add a
.cursor/rules/*.mdcrule defining the test-gate sequence (pnpm test:unit, etc.) as the success criterion the loop targets. - Launch Cloud Agents for longer-running code tasks; they run in isolated VMs and surface a diff for review on completion.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Use this when the success criterion is executable (a failing test that must pass, a build that must compile, a benchmark that must hit a target) so each iteration gets unambiguous feedback from the runtime. | −When the desired change spans many subsystems and depends on judgement no test encodes, the loop converges on patches that pass tests without solving the problem. |
| +Justified where the change is local enough to fit a sandboxed loop: bug fixes scoped by a stack trace, refactors with characterisation tests, ticket-shaped work with clear acceptance steps. | −Without a hermetic sandbox and a step budget, the agent will mutate state, exhaust an API quota, or rewrite history; the same loop running on a developer's laptop is a footgun. |
| +A good fit when the agent will run dozens of read-edit-test cycles per task and a custom Agent-Computer Interface (line-numbered viewer, structured edit verbs, syntax-checked patches) earns its keep over a raw shell. | −When the task is exploratory ("understand this codebase") rather than transformative, a chat-mode read-only assistant is cheaper and less destructive than booting the edit-test loop. |
| +Useful when the diff stays reviewable by a human at the end. The loop is a faster path to a candidate change, not a substitute for the merge gate. |
In the wild
| Source | Claim |
|---|---|
| swe-agent.com → | SWE-agent ships the Agent-Computer Interface its paper introduces (a custom file viewer, edit verb, and Python linter wrapped around a sandboxed shell) and reports the interface itself, not the model, accounts for most of the SWE-bench Verified gain. |
| github.com → | OpenHands runs the same read-edit-test loop inside a Docker sandbox that exposes a code editor, a bash terminal, and a Jupyter kernel; the open-source platform powers a community of coding agents that share the runtime contract. |
| docs.claude.com → | Anthropic documents Claude Code as the loop the user holds open in a terminal: the CLI exposes file, shell, and search tools, parses the model's tool calls, executes them locally, and surfaces every proposed edit before it lands. |
Reader gotcha
When the model emits a search-and-replace block whose context lines drift from the file by even one whitespace, the patch fails silently and the agent retries on a stale view of the code. Aider documents the failure mode (model-specific edit formats, the "diffs not applying" loop, and the fix of pinning a stricter format with a stronger model) as the most common stall in production. source
Implementation sketch
import { generateText, tool } from 'ai'
import { openai } from '@ai-sdk/openai'
import { promisify } from 'node:util'
import { exec } from 'node:child_process'
import { readFile, writeFile } from 'node:fs/promises'
import { z } from 'zod'
const sh = promisify(exec)
const tools = {
readFile: tool({
description: 'Read a file from the working tree, with line numbers.',
parameters: z.object({ path: z.string() }),
execute: async ({ path }) => {
const text = await readFile(path, 'utf8')
return text.split('\n').map((l, i) => `${i + 1}\t${l}`).join('\n')
},
}),
applyPatch: tool({
description: 'Overwrite a file with new contents (full-file replace).',
parameters: z.object({ path: z.string(), contents: z.string() }),
execute: async ({ path, contents }) => {
await writeFile(path, contents, 'utf8')
return `wrote ${contents.length} bytes to ${path}`
},
}),
runTests: tool({
description: 'Run the project test suite and return stdout/stderr.',
parameters: z.object({}),
execute: async () => {
const r = await sh('pnpm test:unit', { timeout: 60_000 }).catch((e) => e)
return { code: r.code ?? 0, stdout: r.stdout?.slice(-2000), stderr: r.stderr?.slice(-2000) }
},
}),
}
await generateText({
model: openai('gpt-4o'),
tools,
maxSteps: 12, // bounded read-edit-test loop; runtime feeds tool results back
prompt: 'Make tests/sum.test.ts pass without changing the test file.',
})
export {}
- Vercel AI SDK
- LangGraph
- OpenAI Agents
- Mastra
References
- DOCSCursor — Agent
Code Agent is the Tool Use / ReAct loop wired to a software-engineering toolkit. The agent is given a goal stated against a working tree (fix this failing test, implement this ticket, refactor this module) and a small load-bearing set of actions: list files, open and search a file, edit a file by patch, run a shell command, read the result. Each turn is a thought, one of those actions, and the observation produced by running it against a real or sandboxed checkout. The loop terminates when the agent emits a final answer or, more commonly, when the test suite the user pinned as the success criterion comes back green.
Background · context and trade-offs
What separates the pattern from generic Tool Use is that the toolkit is designed for the agent, not borrowed from a human. SWE-agent calls this layer the Agent-Computer Interface and reports that swapping the bare Linux shell for a line-numbered file viewer, structured edit verbs, and a syntax-checked patch tool roughly doubles SWE-bench Verified scores at the same model. OpenHands generalises the move into a sandboxed runtime with an editor, a Jupyter kernel, and a browser hung off the same loop. Aider sits at the opposite end: a thin terminal that pairs a repository map with two narrow edit formats and forces the model to commit through git on every turn.
The pattern earns its keep when the success criterion is executable: the patch compiles or it does not, the test passes or it does not, the refactor preserves behaviour or it does not. On those tasks the agent gets dense feedback from the runtime that no LLM-judged evaluation matches. Where it struggles is the inverse: work judged on style, intent, or downstream user impact. Production deployments add a human diff review step (Cursor, Claude Code, Devin all surface proposed changes before they land) because the test suite is necessary but rarely sufficient.