Skip to main content
d-n
← Back to Agentic Design Patterns
Layer 1: Topology / Control Flow

Code Agent

Also known as: Software Engineering Agent, SWE Agent, Coding Agent

Wires the tool-use loop to a codebase, an editor, a shell, and a test runner.

by

The Tool Use loop wired to a software-engineering toolkit: read code, plan an edit, apply a patch, run tests — the loop continues on failure and terminates when the test suite comes back green, then proposes a final diff for human review.

Claude Code

  • Declare in `CLAUDE.md`: the lead agent orchestrates and reviews; implementer subagents hold the edit tools.
  • Define the test-gate sequence (test:unit → lint → typecheck → test:e2e) in CLAUDE.md — the loop terminates on green, not on the agent's assessment.
  • Give each issue its own git worktree (git worktree add); implementer agents work in isolation and the lead checkout stays clean.
  • Scope implementer subagents to file-editing tools only via `settings.json` permissions.allow; the lead holds Bash.

Primitives

  • Edit / Bash / Read tool loop (built-in ACI)
  • Task tool (implementer subagents)
  • Git worktrees (isolation)
  • CLAUDE.md test-gate definition

Cursor

  • Use Agent mode — it ships Read/Edit/Bash equivalents and the read-edit-test loop natively without additional configuration.
  • Use Plan mode before implementation to produce an inspectable step list; review it before the agent touches files.
  • Add a .cursor/rules/*.mdc rule defining the test-gate sequence (pnpm test:unit, etc.) as the success criterion the loop targets.
  • Launch Cloud Agents for longer-running code tasks; they run in isolated VMs and surface a diff for review on completion.

Primitives

  • Agent mode (built-in read-edit-run loop)
  • Plan mode (pre-implementation review)
  • Cloud Agents (isolated VM execution)
  • .cursor/rules/*.mdc (test-gate definition)

Decision

Use when ✓Avoid when ✗
+Use this when the success criterion is executable (a failing test that must pass, a build that must compile, a benchmark that must hit a target) so each iteration gets unambiguous feedback from the runtime.When the desired change spans many subsystems and depends on judgement no test encodes, the loop converges on patches that pass tests without solving the problem.
+Justified where the change is local enough to fit a sandboxed loop: bug fixes scoped by a stack trace, refactors with characterisation tests, ticket-shaped work with clear acceptance steps.Without a hermetic sandbox and a step budget, the agent will mutate state, exhaust an API quota, or rewrite history; the same loop running on a developer's laptop is a footgun.
+A good fit when the agent will run dozens of read-edit-test cycles per task and a custom Agent-Computer Interface (line-numbered viewer, structured edit verbs, syntax-checked patches) earns its keep over a raw shell.When the task is exploratory ("understand this codebase") rather than transformative, a chat-mode read-only assistant is cheaper and less destructive than booting the edit-test loop.
+Useful when the diff stays reviewable by a human at the end. The loop is a faster path to a candidate change, not a substitute for the merge gate.

In the wild

SourceClaim
swe-agent.comSWE-agent ships the Agent-Computer Interface its paper introduces (a custom file viewer, edit verb, and Python linter wrapped around a sandboxed shell) and reports the interface itself, not the model, accounts for most of the SWE-bench Verified gain.
github.comOpenHands runs the same read-edit-test loop inside a Docker sandbox that exposes a code editor, a bash terminal, and a Jupyter kernel; the open-source platform powers a community of coding agents that share the runtime contract.
docs.claude.comAnthropic documents Claude Code as the loop the user holds open in a terminal: the CLI exposes file, shell, and search tools, parses the model's tool calls, executes them locally, and surfaces every proposed edit before it lands.

Reader gotcha

When the model emits a search-and-replace block whose context lines drift from the file by even one whitespace, the patch fails silently and the agent retries on a stale view of the code. Aider documents the failure mode (model-specific edit formats, the "diffs not applying" loop, and the fix of pinning a stricter format with a stronger model) as the most common stall in production. source

Implementation sketch

import { generateText, tool } from 'ai'
import { openai } from '@ai-sdk/openai'
import { promisify } from 'node:util'
import { exec } from 'node:child_process'
import { readFile, writeFile } from 'node:fs/promises'
import { z } from 'zod'

const sh = promisify(exec)

const tools = {
  readFile: tool({
    description: 'Read a file from the working tree, with line numbers.',
    parameters: z.object({ path: z.string() }),
    execute: async ({ path }) => {
      const text = await readFile(path, 'utf8')
      return text.split('\n').map((l, i) => `${i + 1}\t${l}`).join('\n')
    },
  }),
  applyPatch: tool({
    description: 'Overwrite a file with new contents (full-file replace).',
    parameters: z.object({ path: z.string(), contents: z.string() }),
    execute: async ({ path, contents }) => {
      await writeFile(path, contents, 'utf8')
      return `wrote ${contents.length} bytes to ${path}`
    },
  }),
  runTests: tool({
    description: 'Run the project test suite and return stdout/stderr.',
    parameters: z.object({}),
    execute: async () => {
      const r = await sh('pnpm test:unit', { timeout: 60_000 }).catch((e) => e)
      return { code: r.code ?? 0, stdout: r.stdout?.slice(-2000), stderr: r.stderr?.slice(-2000) }
    },
  }),
}

await generateText({
  model: openai('gpt-4o'),
  tools,
  maxSteps: 12, // bounded read-edit-test loop; runtime feeds tool results back
  prompt: 'Make tests/sum.test.ts pass without changing the test file.',
})

export {}
First-party TS SDK
  • Vercel AI SDK
  • LangGraph
  • OpenAI Agents
  • Mastra

References

  1. Yang et al.·2024·NeurIPS 2024 · DOI: 10.48550/arXiv.2405.15793

    foundational paper; introduces the Agent-Computer Interface as the lever that doubles SWE-bench scores

  2. Wang et al.·2024·ICLR 2025 · DOI: 10.48550/arXiv.2407.16741

    sandboxed runtime that hangs editor, shell, and browser off the same agent loop

  3. Anthropic·2025·accessed

    production CLI that exposes file, shell, and search tools and surfaces every diff for review

  4. Cursor team·2025·accessed

    IDE-embedded variant; agent edits across files and runs commands inside the editor

  5. Paul Gauthier·2025·accessed

    open-source code agent; canonical source for repository-map context-building and structured edit formats

  6. Cognition·2024

    production code agent (Devin) on the SWE-bench harness — the executable-success-criterion case study

  7. Antonio Gulli·2026·Springer·pp. 120134

Code Agent is the Tool Use / ReAct loop wired to a software-engineering toolkit. The agent is given a goal stated against a working tree (fix this failing test, implement this ticket, refactor this module) and a small load-bearing set of actions: list files, open and search a file, edit a file by patch, run a shell command, read the result. Each turn is a thought, one of those actions, and the observation produced by running it against a real or sandboxed checkout. The loop terminates when the agent emits a final answer or, more commonly, when the test suite the user pinned as the success criterion comes back green.

Background · context and trade-offs

What separates the pattern from generic Tool Use is that the toolkit is designed for the agent, not borrowed from a human. SWE-agent calls this layer the Agent-Computer Interface and reports that swapping the bare Linux shell for a line-numbered file viewer, structured edit verbs, and a syntax-checked patch tool roughly doubles SWE-bench Verified scores at the same model. OpenHands generalises the move into a sandboxed runtime with an editor, a Jupyter kernel, and a browser hung off the same loop. Aider sits at the opposite end: a thin terminal that pairs a repository map with two narrow edit formats and forces the model to commit through git on every turn.

The pattern earns its keep when the success criterion is executable: the patch compiles or it does not, the test passes or it does not, the refactor preserves behaviour or it does not. On those tasks the agent gets dense feedback from the runtime that no LLM-judged evaluation matches. Where it struggles is the inverse: work judged on style, intent, or downstream user impact. Production deployments add a human diff review step (Cursor, Claude Code, Devin all surface proposed changes before they land) because the test suite is necessary but rarely sufficient.