Layer 1 — Topology / Control Flow

Evaluator-Optimizer

Also known as: Generator-Critic Loop, Self-Refine, Iterative Refinement

Generate a draft, score it against a rubric, refine until the critic stops complaining.

A flowchart in which a Task feeds a Generator that drafts an answer, passed to an Evaluator that scores it against a rubric; if the verdict passes or attempts are exhausted the loop returns the final answer, otherwise the critique is appended to the prompt and the Generator runs again.

Decision

Use when ✓	Avoid when ✗
+Apply when the acceptance criteria can be written down and a separate critic can check the draft against them more reliably than the generator can self-correct in one shot.	−When the rubric collapses to "looks good" the critic invents work, the loop never converges, and the bill grows linearly with iteration count without a quality signal to justify it.
+Use where a single failed iteration is cheap and observable — code that has to compile, JSON that has to validate against a schema, translations a back-translation step can grade, prose that must hit a style or length target.	−When the generator and critic share weights and prompt context, the second call tends to ratify the first — pick a different model family, a tool-grounded check, or fall back to LLM-as-Judge where the bias is at least measured.
+Reach for it when the critic has access to a signal the generator does not — a unit-test runner, a type checker, a search-grounded fact check, or a stronger model graded against a weaker one.	−Without a hard iteration cap and a stop condition tied to a delta in the verdict, a stuck loop will spend the entire context window oscillating between two near-identical drafts.
+Prefer it when the user is willing to trade two-to-five times the latency and tokens for a measurable lift in correctness on a high-value request, not for low-stakes chat where the first draft is good enough.

In the wild

Source	Claim
github.com →	Anthropic publishes a runnable evaluator_optimizer notebook in their cookbook that wires a generator and an evaluator around a single task and loops until the evaluator returns PASS, the canonical reference implementation for the workflow they describe in the effective-agents essay.
blog.langchain.dev →	LangChain documents a Reflection-style agent in their reflection-agents post that pairs a generator with a reflector inside a single LangGraph run, looping until the reflector stops returning critique — the within-attempt shape this pattern names.
ai-sdk.dev →	The Vercel AI SDK Agents primitive ships stopWhen and prepareStep hooks so a step can read a tool result, judge it, and either accept the output or schedule another generation pass, expressing the Evaluator-Optimizer loop as a step controller rather than an outer while.

Reader gotcha

Madaan et al. report that Self-Refine fails on tasks the base model cannot already nearly solve: when GPT-3.5 is the generator, iterative critique on math reasoning underperforms a single zero-shot pass because the model cannot reliably tell a wrong answer from a right one. Treat the loop as an amplifier of an existing capability, not a way to create one — and measure base-model accuracy on the eval set before adding iterations. source

Implementation sketch

import { generateText, generateObject } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'

const Verdict = z.object({
  pass: z.boolean(),
  critique: z.string(),
})

export async function refine(task: string, maxAttempts = 4): Promise<string> {
  let draft = ''
  let critiqueLog = ''
  for (let i = 0; i < maxAttempts; i++) {
    const { text } = await generateText({
      model: openai('gpt-4o'),
      prompt: `Task: ${task}\nPrior critique:\n${critiqueLog || '(none)'}\nDraft:`,
    })
    draft = text
    const { object: verdict } = await generateObject({
      model: openai('gpt-4o-mini'),
      schema: Verdict,
      prompt: `Task: ${task}\nDraft: ${draft}\nReturn pass=true only if the draft fully satisfies the task.`,
    })
    if (verdict.pass) return draft
    critiqueLog += `\nIteration ${i + 1}: ${verdict.critique}`
  }
  return draft
}

export {}

First-party TS SDK

LangChain
LangGraph
Vercel AI SDK

References

PAPERSelf-Refine: Iterative Refinement with Self-Feedback
Madaan et al.·2023·NeurIPS 2023 · DOI: 10.48550/arXiv.2303.17651
foundational measurement of within-attempt critic loops; documents the base-capability ceiling
PAPERReflexion: Language Agents with Verbal Reinforcement Learning
Shinn et al.·2023·NeurIPS 2023 · DOI: 10.48550/arXiv.2303.11366
across-attempt cousin; cited here for the contrast that decides which loop a deployment wants
PAPERCRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Gou et al.·2023·ICLR 2024 · DOI: 10.48550/arXiv.2305.11738
tool-grounded critic variant; addresses the same-model self-ratification failure mode
ESSAYBuilding Effective Agents
Anthropic·2024
names the workflow Evaluator-Optimizer and lists the conditions under which it pays its cost
BOOKAgentic Design Patterns, Chapter 4: Reflection
Antonio Gulli·2026·Springer·pp. 56–68
DOCSAnthropic Cookbook — evaluator_optimizer notebook
Anthropic·2024·accessed 2026-05-04
reference implementation of the workflow described in the essay
DOCSVercel AI SDK — Agents (multi-step calls with stopWhen)
Vercel·2025·accessed 2026-05-04
production wiring of the generator-critic loop as a step controller

Overview · 1-paragraph mechanism

Evaluator-Optimizer wraps a single task in a generator-critic loop. A first model call drafts an answer, a second call scores that draft against a written rubric and emits a structured judgement, and a third call regenerates the answer with the critique appended to the prompt. The loop runs until the judgement clears a threshold or a hard iteration cap fires. The critic is what the prompt makes it: a list of acceptance criteria, a unit test runner, an external API check, or another LLM tasked with finding a flaw. The optimizer is the same generator the loop started with, prompted again with everything it produced so far plus the verdict on why that output failed.

Background · context and trade-offs

The pattern only earns its cost when the rubric is sharper than what the generator can self-correct in a single shot. Translation between languages, code that has to compile, structured outputs that must validate against a schema, and long-form writing with explicit style constraints all expose enough surface for a separate critic to catch what the writer missed. The loop converges when the critique from iteration N flips from listing concrete defects to producing diminishing or contradictory feedback — the operational signal to stop refining and ship.

Evaluator-Optimizer is regularly confused with Reflexion, and the difference is durability. Evaluator-Optimizer is the within-attempt loop: one task, the critique lives inside that task, the buffer is discarded when the answer ships. Reflexion is the across-attempt loop: the lesson written after a failed attempt persists into the next encounter with the same task class. Anthropic frames the within-attempt variant as one of five agentic workflows in their effective-agents essay; Self-Refine is the academic measurement of the same shape. Reach for Evaluator-Optimizer when the next call is the same job, not the next instance of that job.