Skip to main content
d-n
← Back to Agentic Design Patterns
Layer 3: State & Context

Checkpointing

Also known as: Durable Execution, Checkpoint and Rollback, Workflow Persistence

Persists run state at step boundaries so a fresh worker resumes after a crash.

by

After each step, run state is atomically written to durable storage keyed by run ID; when the worker crashes, a fresh worker loads the last checkpoint and resumes from that boundary — completed steps are not redone, side effects do not re-fire.

Claude Code

  • Write each phase's outputs to deterministic disk paths before the next subagent wave dispatches — the filesystem is the checkpoint store.
  • Maintain a STATUS.md with one row per phase (status, artifact count, timestamp); a fresh session reads it to find the resume point without replaying transcripts.
  • Gate each phase transition on a shell assertion that all expected artifact files exist and are non-empty — fail loudly rather than forward silently.
  • Forward a compressed context packet (1–2 K tokens) between phases, not raw transcripts; this keeps successor contexts clean and bounded.

Primitives

  • Disk artifact checkpoint protocol
  • STATUS.md recovery anchor
  • Task tool (phase workers)
  • PreToolUse hooks (phase-exit gate)

Cursor

  • Commit partial work to a dedicated branch after each milestone; cloud agents already work on isolated branches by default.
  • Use a STATUS.md at the repo root as the recovery anchor; reference it via @file at the start of a resumed session.
  • Write task progress to a `.cursor/rules/*.mdc` file with alwaysApply: true so context survives Cursor session restarts.
  • Reference committed artifacts via @git or file references when resuming so the agent reads the actual persisted state, not its in-context assumption.

Primitives

  • Cloud agents (branch isolation)
  • .cursor/rules/*.mdc (persistent state)
  • @file (recovery anchor)
  • Agent mode

Decision

Use when ✓Avoid when ✗
+Apply when an agent run is long enough that a process restart between steps is likely (multi-minute tool calls, multi-hour batch jobs, sessions that span days).When the run is single-shot, completes inside one short request, and the cost of restarting from the prompt is lower than the cost of a checkpoint write on every step.
+Required where steps have side effects you cannot afford to fire twice (charging a card, sending an email, opening a pull request) and need an idempotency anchor that survives crashes.Without idempotent steps or a journal of completed effects, replay-style durable execution will re-fire the same side effect twice on recovery and produce a worse outcome than no checkpointing at all.
+A good fit when the deployment target is preemptible (spot instances, serverless functions with execution caps, Kubernetes pods that the autoscaler may evict) and the agent must continue on a new worker without losing context.When the checkpoint backend (Postgres, SQLite, or the journal store) is less reliable than the worker it is supposed to recover, the pattern moves the failure mode rather than removing it.
+Useful when an operator needs to inspect or rewind a run between steps for debugging, audit, or human review of the trajectory before it continues.

In the wild

SourceClaim
docs.langchain.comLangGraph ships a BaseCheckpointSaver interface with InMemorySaver, SqliteSaver, and PostgresSaver implementations; compiling a graph with a checkpointer writes a state snapshot at every super-step, keyed by thread_id, and a fresh process resumes by loading the latest checkpoint for that thread.
temporal.ioTemporal records a complete event history for every workflow execution and recovers from worker crashes by replaying that history on a new worker, short-circuiting Activities (LLM calls, tool invocations) whose results are already journaled. The same primitive backs Vercel's AI SDK durability plugin and makes generateText() crash-safe.
code.claude.comThe Anthropic Claude Agent SDK persists every session (prompt, tool calls, tool results, responses) to disk under ~/.claude/projects/ as JSONL and exposes resume, continue, and fork options on query() so a process restart on the same machine picks up the conversation with full context intact.

Reader gotcha

Replay-based durable execution requires workflow code to be deterministic: the same input history must yield the same decisions on every replay. A stray Date.now(), a Math.random(), or an unguarded network call inside the workflow function will diverge from the journaled history on recovery and the runtime will throw a non-determinism error mid-replay. Temporal documents this as the cost of admission: side-effecting work must live inside Activities, not in the workflow body. source

Implementation sketch

// Pseudocode — community-ts SDKs (LangGraph TS, Inngest, Temporal) provide
// these primitives directly; this sketch shows the snapshot-checkpoint shape
// the pattern requires regardless of backend.

type Checkpoint = { runId: string; step: number; state: AgentState; nextStep: string | null }

declare const store: {
  load(runId: string): Promise<Checkpoint | null>
  save(cp: Checkpoint): Promise<void>           // atomic; tolerates concurrent writers
}
declare function executeStep(state: AgentState, step: string): Promise<{ next: string | null; state: AgentState }>

async function runWithCheckpoints(runId: string, initialStep: string, initialState: AgentState) {
  const resumed = await store.load(runId)
  let state = resumed?.state ?? initialState
  let step: string | null = resumed?.nextStep ?? initialStep
  let i = (resumed?.step ?? -1) + 1

  while (step) {
    const result = await executeStep(state, step)        // LLM call or tool
    state = result.state
    await store.save({ runId, step: i, state, nextStep: result.next })   // commit boundary
    step = result.next                                   // crash here -> new worker resumes from saved checkpoint
    i += 1
  }
  return state
}

type AgentState = Record<string, unknown>
export {}
Community TS SDK
  • LangGraph
  • Mastra

References

  1. Malewicz et al.·2010·SIGMOD 2010 · DOI: 10.1145/1807167.1807184

    foundational super-step checkpointing model that LangGraph's BaseCheckpointSaver inherits

  2. LangChain team·2025·accessed

    BaseCheckpointSaver interface and StateSnapshot semantics

  3. Temporal Technologies·2025·accessed

    event-history replay model and the determinism contract

  4. Inngest team·2025·accessed

    step.run() memoization for resumable durable execution

  5. Restate team·2025

    argues for journal-based durability wrapping existing agent SDKs

  6. Anthropic·2025·accessed

    first-party TS session persistence: continue, resume, fork

  7. Antonio Gulli·2026·Springer·pp. 302303

    frames Checkpoint and Rollback as the agent analogue of database commit/rollback

Checkpointing makes an agent run survive the failure of the process executing it. The orchestrator writes the run's state (accumulated messages, completed tool results, the next pending step, any pending writes) to durable storage at well-defined boundaries, keyed by a stable run identifier. When the worker dies mid-loop because a Kubernetes pod was evicted, a serverless function timed out, or an LLM call took fifteen minutes and the connection dropped, a fresh worker reads the last checkpoint, rehydrates the agent's state, and resumes from the boundary rather than from the prompt. The unit of recovery is the boundary, not the run; work already done is not redone.

Background · context and trade-offs

Two implementations dominate. Snapshot checkpointing (what LangGraph, Mastra, and the Anthropic Agent SDK ship) atomically writes the full state object after each step into a backend (in-memory for development, SQLite or Postgres for production) and resumes by loading the latest record for that thread or session. Replay-based durable execution (Temporal, Inngest, Restate) instead records a journal of completed steps and their results, then re-runs the agent code from the top on recovery, short-circuiting any step whose result is already in the journal. Both reach the same place: the LLM does not get called twice for the same prompt, and the tool whose side effect already fired does not fire again.

The pattern is distinct from Memory Management, which retrieves prior context to inject into the next prompt, and from generic database transactions, which protect a single write. Checkpointing protects the unit-of-work that is the agent run itself. The cost is operational: someone owns the schema as it migrates between releases, the determinism contract if the implementation is replay-based, the retention policy for stale checkpoints, and the question of what happens when a checkpoint is loaded by a worker running a different version of the agent code than the one that wrote it.