Checkpointing
Also known as: Durable Execution, Checkpoint and Rollback, Workflow Persistence
Agent persists its run state at safe boundaries so a crash mid-loop loses no work.
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Apply when an agent run is long enough that a process restart between steps is likely (multi-minute tool calls, multi-hour batch jobs, sessions that span days). | −When the run is single-shot, completes inside one short request, and the cost of restarting from the prompt is lower than the cost of a checkpoint write on every step. |
| +Use where steps have side effects you cannot afford to fire twice (charging a card, sending an email, opening a pull request) and need an idempotency anchor that survives crashes. | −Without idempotent steps or a journal of completed effects, replay-style durable execution will re-fire the same side effect twice on recovery and produce a worse outcome than no checkpointing at all. |
| +Reach for it when the deployment target is preemptible — spot instances, serverless functions with execution caps, Kubernetes pods that the autoscaler may evict — and the agent must continue on a new worker without losing context. | −When the checkpoint backend (Postgres, SQLite, or the journal store) is less reliable than the worker it is supposed to recover, the pattern moves the failure mode rather than removing it. |
| +Prefer it when an operator needs to inspect or rewind a run between steps for debugging, audit, or human review of the trajectory before it continues. |
In the wild
| Source | Claim |
|---|---|
| docs.langchain.com → | LangGraph ships a BaseCheckpointSaver interface with InMemorySaver, SqliteSaver, and PostgresSaver implementations; compiling a graph with a checkpointer writes a state snapshot at every super-step, keyed by thread_id, and a fresh process resumes by loading the latest checkpoint for that thread. |
| temporal.io → | Temporal records a complete event history for every workflow execution and recovers from worker crashes by replaying that history on a new worker, short-circuiting Activities (LLM calls, tool invocations) whose results are already journaled — the same primitive Vercel's AI SDK durability plugin uses to make generateText() crash-safe. |
| code.claude.com → | The Anthropic Claude Agent SDK persists every session — prompt, tool calls, tool results, responses — to disk under ~/.claude/projects/ as JSONL and exposes resume, continue, and fork options on query() so a process restart on the same machine picks up the conversation with full context intact. |
Reader gotcha
Replay-based durable execution requires workflow code to be deterministic — the same input history must yield the same decisions on every replay. A stray Date.now(), a Math.random(), or an unguarded network call inside the workflow function will diverge from the journaled history on recovery and the runtime will throw a non-determinism error mid-replay. Temporal documents this as the cost of admission: side-effecting work must live inside Activities, not in the workflow body. source
Implementation sketch
// Pseudocode — community-ts SDKs (LangGraph TS, Inngest, Temporal) provide
// these primitives directly; this sketch shows the snapshot-checkpoint shape
// the pattern requires regardless of backend.
type Checkpoint = { runId: string; step: number; state: AgentState; nextStep: string | null }
declare const store: {
load(runId: string): Promise<Checkpoint | null>
save(cp: Checkpoint): Promise<void> // atomic; tolerates concurrent writers
}
declare function executeStep(state: AgentState, step: string): Promise<{ next: string | null; state: AgentState }>
async function runWithCheckpoints(runId: string, initialStep: string, initialState: AgentState) {
const resumed = await store.load(runId)
let state = resumed?.state ?? initialState
let step: string | null = resumed?.nextStep ?? initialStep
let i = (resumed?.step ?? -1) + 1
while (step) {
const result = await executeStep(state, step) // LLM call or tool
state = result.state
await store.save({ runId, step: i, state, nextStep: result.next }) // commit boundary
step = result.next // crash here -> new worker resumes from saved checkpoint
i += 1
}
return state
}
type AgentState = Record<string, unknown>
export {}
- LangGraph
- Mastra
References
- Malewicz et al.·2010·SIGMOD 2010 · DOI: 10.1145/1807167.1807184
foundational super-step checkpointing model that LangGraph's BaseCheckpointSaver inherits
- LangChain team·2025·accessed
BaseCheckpointSaver interface and StateSnapshot semantics
- Temporal Technologies·2025·accessed
event-history replay model and the determinism contract
- Inngest team·2025·accessed
step.run() memoization for resumable durable execution
- Restate team·2025
argues for journal-based durability wrapping existing agent SDKs
- Anthropic·2025·accessed
first-party TS session persistence: continue, resume, fork
- Antonio Gulli·2026·Springer·pp. 302–303
frames Checkpoint and Rollback as the agent analogue of database commit/rollback
Overview · 1-paragraph mechanism
Checkpointing makes an agent run survive the failure of the process executing it. The orchestrator writes the run's state — accumulated messages, completed tool results, the next pending step, any pending writes — to durable storage at well-defined boundaries, keyed by a stable run identifier. When the worker dies mid-loop because a Kubernetes pod was evicted, a serverless function timed out, or an LLM call took fifteen minutes and the connection dropped, a fresh worker reads the last checkpoint, rehydrates the agent's state, and resumes from the boundary rather than from the prompt. The unit of recovery is the boundary, not the run; work already done is not redone.
Background · context and trade-offs
Two implementations dominate. Snapshot checkpointing — what LangGraph, Mastra, and the Anthropic Agent SDK ship — atomically writes the full state object after each step into a backend (in-memory for development, SQLite or Postgres for production) and resumes by loading the latest record for that thread or session. Replay-based durable execution — Temporal, Inngest, Restate — instead records a journal of completed steps and their results, then re-runs the agent code from the top on recovery, short-circuiting any step whose result is already in the journal. Both reach the same place: the LLM does not get called twice for the same prompt, and the tool whose side effect already fired does not fire again.
The pattern is distinct from Memory Management, which retrieves prior context to inject into the next prompt, and from generic database transactions, which protect a single write. Checkpointing protects the unit-of-work that is the agent run itself. The cost is operational: someone owns the schema as it migrates between releases, the determinism contract if the implementation is replay-based, the retention policy for stale checkpoints, and the question of what happens when a checkpoint is loaded by a worker running a different version of the agent code than the one that wrote it.