Reflexion
Also known as: Verbal Reinforcement Learning, Self-Reflection
Agent writes self-critiques into memory to improve next attempts.
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Apply when the agent attempts the same task class repeatedly and you have a place to keep critiques across runs (multi-turn assistants, recurring job types, agent benchmarks). | −When the task is single-shot, the lesson has no future attempt to inform and the critique step pays no rent. |
| +Use where failure is diagnosable from the trajectory — the agent can name the mistake in words an outside reviewer could verify. | −Without external grounding, same-model self-critique tends to approve its own work even when it should not — substitute a different model, a tool-grounded check, or the CRITIC pattern. |
| +Reach for it when fine-tuning is too slow or too expensive but you can afford a second LLM call per failed attempt and a small key-value store for lessons. | −When failures are not visible in the trajectory (stale data the agent could not have known about, hidden environment changes), the critique will hallucinate a cause. |
| +Prefer it when you want behavioral improvements that survive a deploy: the lessons are inspectable text you can read, edit, or evict by hand. |
In the wild
| Source | Claim |
|---|---|
| cognition.ai → | Cognition documents Devin keeping notes on what worked and what failed across sessions on the same project, then reading those notes back when it picks the work up again. |
| langchain-ai.github.io → | LangGraph ships a runnable Reflection tutorial that wires a generator and reflector around a shared message thread, exactly the loop this pattern describes. |
| arxiv.org → | AgentBench evaluates language agents across eight environments and reports that Reflexion-style verbal feedback measurably improves performance on programming and operating-system tasks where trajectory signal is rich. |
Reader gotcha
A same-model critic trained on the same prompt will systematically approve its own output, producing sycophantic agreement that looks like self-correction but adds no signal. Ground the critique externally — a different model, a code interpreter, a search-grounded checker — as CRITIC argues. source
Implementation sketch
import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
type Episode = { task: string; outcome: 'success' | 'failure'; critique: string }
const memory: Episode[] = []
declare function evaluate(output: string): Promise<boolean>
async function attemptWithReflexion(task: string, maxAttempts = 3): Promise<string> {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const lessons = memory
.filter((m) => m.outcome === 'failure')
.slice(-3)
.map((m) => m.critique)
.join('\n')
const { text } = await generateText({
model: openai('gpt-4o'),
prompt: `Lessons from prior attempts:\n${lessons}\n\nTask: ${task}`,
})
if (await evaluate(text)) return text
const critique = await generateText({
model: openai('gpt-4o'),
prompt: `Task: ${task}\nAttempt: ${text}\nWhy did this fail? Write a one-paragraph lesson for next time.`,
})
memory.push({ task, outcome: 'failure', critique: critique.text })
}
throw new Error('Max attempts exceeded')
}
export {}
- LangChain
- LangGraph
References
- Shinn et al.·2023·NeurIPS 2023 · DOI: 10.48550/arXiv.2303.11366
foundational paper
- Madaan et al.·2023·NeurIPS 2023 · DOI: 10.48550/arXiv.2303.17651
closely related single-attempt variant
- Gou et al.·2023·ICLR 2024 · DOI: 10.48550/arXiv.2305.11738
tool-grounded variant; addresses the same-model sycophancy gotcha
- Anthropic·2024
frames the within-attempt cousin as the evaluator-optimizer workflow
- Antonio Gulli·2026·Springer·pp. 56–68
- LangChain team·2024·accessed
Overview · 1-paragraph mechanism
Reflexion teaches a language agent to learn from its own trajectories without touching model weights. After each attempt, the agent inspects the trajectory and any environment feedback, then writes a short verbal critique — a paragraph that names what went wrong and what to try next. That critique is appended to an episodic memory buffer keyed by task or task class. On the next attempt, the agent retrieves the most recent and most relevant critiques and conditions its plan on them, treating prior failures as instructions rather than as silent gradient signal.
Background · context and trade-offs
The pattern straddles two layers. As Topology it shapes the control flow: a generator-execute-evaluate loop that, on failure, branches into a critique step before retrying. As State it maintains durable, retrievable memory whose unit is a natural-language lesson, not an embedding of a previous answer. The mechanism only earns its keep when failures are diagnosable from the trajectory itself — the agent must be able to articulate, in words, what an outside observer could also see. Tasks where failure is invisible (a stale tool, a wrong premise the agent never questioned) defeat the loop, because the critique is grounded in nothing.
Reflexion sits next to but distinct from a within-attempt generator-critic loop. Self-Refine iterates on a single output until a critic stops complaining; Reflexion iterates across attempts, so the lesson outlives the run and the next encounter with the same problem class starts informed. The cost is operational, not algorithmic: someone has to decide what counts as the same task, how many critiques to retrieve, when to compact the buffer, and which model writes the critique. The default of having the same model judge its own work is the hazard the pattern is most often deployed without noticing.