Skip to main content
d-n
← Back to Agentic Design Patterns
Layer 1 — Topology / Control Flow

Evaluator-Optimizer

Also known as: Generator-Critic Loop, Self-Refine, Iterative Refinement

Generate a draft, score it against a rubric, refine until the critic stops complaining.

A flowchart in which a Task feeds a Generator that drafts an answer, passed to an Evaluator that scores it against a rubric; if the verdict passes or attempts are exhausted the loop returns the final answer, otherwise the critique is appended to the prompt and the Generator runs again.

Decision

Use when ✓Avoid when ✗
+Apply when the acceptance criteria can be written down and a separate critic can check the draft against them more reliably than the generator can self-correct in one shot.When the rubric collapses to "looks good" the critic invents work, the loop never converges, and the bill grows linearly with iteration count without a quality signal to justify it.
+Use where a single failed iteration is cheap and observable — code that has to compile, JSON that has to validate against a schema, translations a back-translation step can grade, prose that must hit a style or length target.When the generator and critic share weights and prompt context, the second call tends to ratify the first — pick a different model family, a tool-grounded check, or fall back to LLM-as-Judge where the bias is at least measured.
+Reach for it when the critic has access to a signal the generator does not — a unit-test runner, a type checker, a search-grounded fact check, or a stronger model graded against a weaker one.Without a hard iteration cap and a stop condition tied to a delta in the verdict, a stuck loop will spend the entire context window oscillating between two near-identical drafts.
+Prefer it when the user is willing to trade two-to-five times the latency and tokens for a measurable lift in correctness on a high-value request, not for low-stakes chat where the first draft is good enough.

In the wild

SourceClaim
github.comAnthropic publishes a runnable evaluator_optimizer notebook in their cookbook that wires a generator and an evaluator around a single task and loops until the evaluator returns PASS, the canonical reference implementation for the workflow they describe in the effective-agents essay.
blog.langchain.devLangChain documents a Reflection-style agent in their reflection-agents post that pairs a generator with a reflector inside a single LangGraph run, looping until the reflector stops returning critique — the within-attempt shape this pattern names.
ai-sdk.devThe Vercel AI SDK Agents primitive ships stopWhen and prepareStep hooks so a step can read a tool result, judge it, and either accept the output or schedule another generation pass, expressing the Evaluator-Optimizer loop as a step controller rather than an outer while.

Reader gotcha

Madaan et al. report that Self-Refine fails on tasks the base model cannot already nearly solve: when GPT-3.5 is the generator, iterative critique on math reasoning underperforms a single zero-shot pass because the model cannot reliably tell a wrong answer from a right one. Treat the loop as an amplifier of an existing capability, not a way to create one — and measure base-model accuracy on the eval set before adding iterations. source

Implementation sketch

import { generateText, generateObject } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'

const Verdict = z.object({
  pass: z.boolean(),
  critique: z.string(),
})

export async function refine(task: string, maxAttempts = 4): Promise<string> {
  let draft = ''
  let critiqueLog = ''
  for (let i = 0; i < maxAttempts; i++) {
    const { text } = await generateText({
      model: openai('gpt-4o'),
      prompt: `Task: ${task}\nPrior critique:\n${critiqueLog || '(none)'}\nDraft:`,
    })
    draft = text
    const { object: verdict } = await generateObject({
      model: openai('gpt-4o-mini'),
      schema: Verdict,
      prompt: `Task: ${task}\nDraft: ${draft}\nReturn pass=true only if the draft fully satisfies the task.`,
    })
    if (verdict.pass) return draft
    critiqueLog += `\nIteration ${i + 1}: ${verdict.critique}`
  }
  return draft
}

export {}
First-party TS SDK
  • LangChain
  • LangGraph
  • Vercel AI SDK

References

  1. Madaan et al.·2023·NeurIPS 2023 · DOI: 10.48550/arXiv.2303.17651

    foundational measurement of within-attempt critic loops; documents the base-capability ceiling

  2. Shinn et al.·2023·NeurIPS 2023 · DOI: 10.48550/arXiv.2303.11366

    across-attempt cousin; cited here for the contrast that decides which loop a deployment wants

  3. Gou et al.·2023·ICLR 2024 · DOI: 10.48550/arXiv.2305.11738

    tool-grounded critic variant; addresses the same-model self-ratification failure mode

  4. Anthropic·2024

    names the workflow Evaluator-Optimizer and lists the conditions under which it pays its cost

  5. Antonio Gulli·2026·Springer·pp. 5668
  6. Anthropic·2024·accessed

    reference implementation of the workflow described in the essay

  7. Vercel·2025·accessed

    production wiring of the generator-critic loop as a step controller

Overview · 1-paragraph mechanism

Evaluator-Optimizer wraps a single task in a generator-critic loop. A first model call drafts an answer, a second call scores that draft against a written rubric and emits a structured judgement, and a third call regenerates the answer with the critique appended to the prompt. The loop runs until the judgement clears a threshold or a hard iteration cap fires. The critic is what the prompt makes it: a list of acceptance criteria, a unit test runner, an external API check, or another LLM tasked with finding a flaw. The optimizer is the same generator the loop started with, prompted again with everything it produced so far plus the verdict on why that output failed.

Background · context and trade-offs

The pattern only earns its cost when the rubric is sharper than what the generator can self-correct in a single shot. Translation between languages, code that has to compile, structured outputs that must validate against a schema, and long-form writing with explicit style constraints all expose enough surface for a separate critic to catch what the writer missed. The loop converges when the critique from iteration N flips from listing concrete defects to producing diminishing or contradictory feedback — the operational signal to stop refining and ship.

Evaluator-Optimizer is regularly confused with Reflexion, and the difference is durability. Evaluator-Optimizer is the within-attempt loop: one task, the critique lives inside that task, the buffer is discarded when the answer ships. Reflexion is the across-attempt loop: the lesson written after a failed attempt persists into the next encounter with the same task class. Anthropic frames the within-attempt variant as one of five agentic workflows in their effective-agents essay; Self-Refine is the academic measurement of the same shape. Reach for Evaluator-Optimizer when the next call is the same job, not the next instance of that job.