Checkpointing
Also known as: Durable Execution, Checkpoint and Rollback, Workflow Persistence
Persists run state at step boundaries so a fresh worker resumes after a crash.
Claude Code
- Write each phase's outputs to deterministic disk paths before the next subagent wave dispatches — the filesystem is the checkpoint store.
- Maintain a
STATUS.mdwith one row per phase (status, artifact count, timestamp); a fresh session reads it to find the resume point without replaying transcripts. - Gate each phase transition on a shell assertion that all expected artifact files exist and are non-empty — fail loudly rather than forward silently.
- Forward a compressed context packet (1–2 K tokens) between phases, not raw transcripts; this keeps successor contexts clean and bounded.
Primitives
Related patterns
Cursor
- Commit partial work to a dedicated branch after each milestone; cloud agents already work on isolated branches by default.
- Use a
STATUS.mdat the repo root as the recovery anchor; reference it via@fileat the start of a resumed session. - Write task progress to a `.cursor/rules/*.mdc` file with
alwaysApply: trueso context survives Cursor session restarts. - Reference committed artifacts via
@gitor file references when resuming so the agent reads the actual persisted state, not its in-context assumption.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Apply when an agent run is long enough that a process restart between steps is likely (multi-minute tool calls, multi-hour batch jobs, sessions that span days). | −When the run is single-shot, completes inside one short request, and the cost of restarting from the prompt is lower than the cost of a checkpoint write on every step. |
| +Required where steps have side effects you cannot afford to fire twice (charging a card, sending an email, opening a pull request) and need an idempotency anchor that survives crashes. | −Without idempotent steps or a journal of completed effects, replay-style durable execution will re-fire the same side effect twice on recovery and produce a worse outcome than no checkpointing at all. |
| +A good fit when the deployment target is preemptible (spot instances, serverless functions with execution caps, Kubernetes pods that the autoscaler may evict) and the agent must continue on a new worker without losing context. | −When the checkpoint backend (Postgres, SQLite, or the journal store) is less reliable than the worker it is supposed to recover, the pattern moves the failure mode rather than removing it. |
| +Useful when an operator needs to inspect or rewind a run between steps for debugging, audit, or human review of the trajectory before it continues. |
In the wild
| Source | Claim |
|---|---|
| docs.langchain.com → | LangGraph ships a BaseCheckpointSaver interface with InMemorySaver, SqliteSaver, and PostgresSaver implementations; compiling a graph with a checkpointer writes a state snapshot at every super-step, keyed by thread_id, and a fresh process resumes by loading the latest checkpoint for that thread. |
| temporal.io → | Temporal records a complete event history for every workflow execution and recovers from worker crashes by replaying that history on a new worker, short-circuiting Activities (LLM calls, tool invocations) whose results are already journaled. The same primitive backs Vercel's AI SDK durability plugin and makes generateText() crash-safe. |
| code.claude.com → | The Anthropic Claude Agent SDK persists every session (prompt, tool calls, tool results, responses) to disk under ~/.claude/projects/ as JSONL and exposes resume, continue, and fork options on query() so a process restart on the same machine picks up the conversation with full context intact. |
Reader gotcha
Replay-based durable execution requires workflow code to be deterministic: the same input history must yield the same decisions on every replay. A stray Date.now(), a Math.random(), or an unguarded network call inside the workflow function will diverge from the journaled history on recovery and the runtime will throw a non-determinism error mid-replay. Temporal documents this as the cost of admission: side-effecting work must live inside Activities, not in the workflow body. source
Implementation sketch
// Pseudocode — community-ts SDKs (LangGraph TS, Inngest, Temporal) provide
// these primitives directly; this sketch shows the snapshot-checkpoint shape
// the pattern requires regardless of backend.
type Checkpoint = { runId: string; step: number; state: AgentState; nextStep: string | null }
declare const store: {
load(runId: string): Promise<Checkpoint | null>
save(cp: Checkpoint): Promise<void> // atomic; tolerates concurrent writers
}
declare function executeStep(state: AgentState, step: string): Promise<{ next: string | null; state: AgentState }>
async function runWithCheckpoints(runId: string, initialStep: string, initialState: AgentState) {
const resumed = await store.load(runId)
let state = resumed?.state ?? initialState
let step: string | null = resumed?.nextStep ?? initialStep
let i = (resumed?.step ?? -1) + 1
while (step) {
const result = await executeStep(state, step) // LLM call or tool
state = result.state
await store.save({ runId, step: i, state, nextStep: result.next }) // commit boundary
step = result.next // crash here -> new worker resumes from saved checkpoint
i += 1
}
return state
}
type AgentState = Record<string, unknown>
export {}
- LangGraph
- Mastra
References
Checkpointing makes an agent run survive the failure of the process executing it. The orchestrator writes the run's state (accumulated messages, completed tool results, the next pending step, any pending writes) to durable storage at well-defined boundaries, keyed by a stable run identifier. When the worker dies mid-loop because a Kubernetes pod was evicted, a serverless function timed out, or an LLM call took fifteen minutes and the connection dropped, a fresh worker reads the last checkpoint, rehydrates the agent's state, and resumes from the boundary rather than from the prompt. The unit of recovery is the boundary, not the run; work already done is not redone.
Background · context and trade-offs
Two implementations dominate. Snapshot checkpointing (what LangGraph, Mastra, and the Anthropic Agent SDK ship) atomically writes the full state object after each step into a backend (in-memory for development, SQLite or Postgres for production) and resumes by loading the latest record for that thread or session. Replay-based durable execution (Temporal, Inngest, Restate) instead records a journal of completed steps and their results, then re-runs the agent code from the top on recovery, short-circuiting any step whose result is already in the journal. Both reach the same place: the LLM does not get called twice for the same prompt, and the tool whose side effect already fired does not fire again.
The pattern is distinct from Memory Management, which retrieves prior context to inject into the next prompt, and from generic database transactions, which protect a single write. Checkpointing protects the unit-of-work that is the agent run itself. The cost is operational: someone owns the schema as it migrates between releases, the determinism contract if the implementation is replay-based, the retention policy for stale checkpoints, and the question of what happens when a checkpoint is loaded by a worker running a different version of the agent code than the one that wrote it.