Evaluation (LLM-as-Judge)
Also known as: LLM-as-a-Judge, Model-graded Evaluation, Automatic Evaluator
Scores model outputs with a stronger LLM applying a written rubric.
Claude Code
- Dispatch the judge as a separate subagent in a fresh context window — same-model self-review inflates scores structurally.
- Write the rubric as a SKILL.md the judge subagent loads at invocation; version it in git so rubric changes are auditable.
- Use a stronger model tier for the judge than the implementer when possible; cross-tier review breaks the shared-prior bias.
- Gate the merge queue on the judge's structured verdict; require at least one approval before any PR advances to merge.
Primitives
Related patterns
Cursor
- Open a second Agent chat as the judge — pass the candidate output via
@file; the judge reads it cold without the implementer's framing. - Write the evaluation rubric in a
.cursor/rules/*.mdcfile; reference it explicitly in the judge chat opening prompt. - Use a different model in the judge chat than the implementer chat when Cursor's model picker allows — reduces same-model self-preference.
- Record the judge's verdict in a file and reference it via
@filein the implementer chat to close the feedback loop.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Use this when collecting human preferences is the bottleneck on every release and a frontier model agrees with majority human judgement closely enough to gate CI on its score. | −When the judge and the candidate share a model family or training data, self-preference inflates scores even when the candidate is worse. Pick a judge from a different family or a stronger tier. |
| +Justified where the rubric can be written down (helpfulness against an instruction, faithfulness to retrieved context, format conformance, safety policy adherence) and the judge can read both the criteria and the candidate in one call. | −When the rubric criteria require ground truth the judge cannot verify (live database state, proprietary numeric facts, code that must execute), a tool-grounded check or unit test produces a more honest verdict. |
| +A good fit when you need pairwise leaderboards or A/B regression tracking across many candidate models, prompts, or RAG configurations on the same eval set. | −Without a sampled human-audit loop calibrated to the eval set, judge drift across model snapshots silently changes the regression baseline and the leaderboard becomes unfalsifiable. |
| +Best when the task admits a structured verdict (score, label, ranking) the harness can aggregate, and a written rationale a reviewer can audit when a number looks wrong. |
In the wild
| Source | Claim |
|---|---|
| lmarena.ai → | LMSYS Chatbot Arena collects pairwise human votes between anonymous model responses and reports an Elo leaderboard; the same paper validates GPT-4 as a judge whose pairwise verdicts agree with human majority at roughly the rate two humans agree with each other. |
| tatsu-lab.github.io → | AlpacaEval scores instruction-following models with an LLM judge that compares each candidate response against a reference from text-davinci-003 across 805 prompts, publishing a length-controlled win-rate leaderboard that tracks closely with Arena rankings. |
| docs.smith.langchain.com → | LangSmith documents an LLM-as-judge evaluator type that runs a judge prompt over each candidate trace and writes the structured score back to the experiment, the canonical wiring of this pattern in a production observability stack. |
Reader gotcha
Zheng et al. measured a position bias in pairwise judging: GPT-4 prefers the response shown first in roughly 60 percent of ties, enough to flip leaderboard rankings if the order is fixed. The mitigation they recommend is running each comparison twice with swapped order and counting only consistent verdicts, which doubles judge cost but is the difference between a defensible benchmark and a confounded one. source
Implementation sketch
import { generateObject } from 'ai'
import { openai } from '@ai-sdk/openai'
import { z } from 'zod'
const Judgement = z.object({
score: z.number().int().min(1).max(5),
rationale: z.string(),
flags: z.array(z.enum(['off-topic', 'unsafe', 'malformed'])).default([]),
})
const RUBRIC = [
'5 = fully answers the prompt with correct, well-supported claims.',
'3 = partially correct or missing key context.',
'1 = wrong, evasive, or off-topic.',
].join('\n')
export async function judge(prompt: string, candidate: string) {
const { object } = await generateObject({
model: openai('gpt-4o'),
schema: Judgement,
system: `You are an impartial grader. Score using this rubric:\n${RUBRIC}\nReason step by step before assigning the score.`,
prompt: `PROMPT:\n${prompt}\n\nCANDIDATE:\n${candidate}`,
})
return object
}
export {}
- LangChain
- LangGraph
- Vercel AI SDK
References
LLM-as-Judge replaces human raters with a stronger language model that reads a candidate output, applies a written rubric, and emits a score and a justification. The judge prompt fixes the criteria the rater is allowed to consider (helpfulness, factuality, instruction-following, safety, format conformance) and constrains the response to a structured object the eval harness can aggregate. The pattern earns its keep when collecting human preferences would gate every release: a frontier model paid per million tokens reproduces majority human judgement closely enough that running thousands of comparisons becomes a CI step rather than a quarterly contract.
Background · context and trade-offs
Two judging shapes dominate. Pairwise comparison shows the judge two candidate responses to the same prompt and asks which is better; aggregating many such comparisons into a Bradley-Terry or Elo rating produces the leaderboard form Chatbot Arena and AlpacaEval popularised. Single-answer scoring asks the judge to rate one response on a fixed numeric scale against an explicit rubric, the shape most production teams adopt for regression tracking. Both inherit the same operational hazards: position bias when the order of A and B leaks into the verdict, length bias when verbosity is rewarded, self-preference when the judge shares weights with the candidate, and rubric drift when the criteria text is edited without rerunning prior baselines.
The pattern is the class-level concern that the Reflexion gotcha names as one instance: a same-model judge systematically agrees with its own output, so the judge must come from a different family or a stronger tier than what it scores. G-Eval adds chain-of-thought rubrics that elicit per-criterion reasoning before the score, raising Spearman correlation with human judgement on summarisation but not fixing bias on tasks where the judge itself is unreliable. Production teams treat the judge prompt as versioned code, freeze the judge behind a pinned snapshot, and audit a sampled fraction of judgements against human raters to detect drift before a regression ships.