Multi-Agent Debate
Also known as: LLM Debate, Multiagent Debate, Society of Minds
Runs N agents on the same prompt, exposes them to each other's answers, and votes.
Claude Code
- Dispatch N debater subagents in a single message — they produce independent first-round answers without seeing each other.
- Write each debater's first-round output to a named disk file; the second round reads all files and produces a revised answer.
- Assign distinct roles or temperatures to debaters via their `SKILL.md` or brief — identical prompts converge, not debate.
- Run the final aggregation as a separate judge subagent that reads all revised answers cold, without the debate transcript in context.
Primitives
Related patterns
Cursor
- Open N independent Agent chats on the same question — each produces a first-round answer without seeing the others.
- Paste all first-round answers into a new Agent chat and ask for critique and revision; reference each via
@fileif saved to disk. - Assign different perspectives or constraints to each debater chat via the opening prompt to break the degeneration-of-thought pattern.
- Use a final Agent session as the judge — provide only the closing statements, not the full debate transcript, to keep the verdict unbiased.
Primitives
Related patterns
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Best for cases where factual accuracy or reasoning depth matters more than latency, and a single rollout silently produces a confident wrong answer that peer disagreement would expose. | −When latency or cost dominates and an N-agent K-round debate inflates per-query spend by an order of magnitude over a self-consistency sample for a marginal accuracy gain. |
| +Justified where the task admits a single committable verdict (a numeric answer, a multiple-choice label, a yes/no decision) that a deterministic aggregator can collapse the round to. | −Without a deterministic aggregation rule (vote, judge, or longest-justification heuristic), the debate produces N divergent finals and the system has no contract for what to commit to. |
| +A good fit on questions a stronger model cannot reach but a panel of weaker models can, by triangulating between candidate explanations rather than averaging logits. | −When debaters share a single model snapshot and a single system prompt, errors stay correlated and the loop converges on the same wrong answer it started from rather than escaping it. |
| +Useful inside a scalable-oversight pipeline where opposing debaters surface contrasting evidence a less-capable judge can adjudicate more reliably than scoring one answer alone. |
In the wild
| Source | Claim |
|---|---|
| composable-models.github.io → | Du and colleagues publish a runnable reference implementation that wires three GPT-3.5 agents through two debate rounds on MMLU, GSM8K, and biographies, and report accuracy gains over chain-of-thought and self-consistency on the same prompts at the same compute envelope. |
| microsoft.github.io → | AutoGen ships a Teams primitive in its AgentChat layer that composes multiple AssistantAgents into a round-robin or selector group chat, the canonical framework wiring of the debate loop where each agent reads the conversation transcript before its next turn. |
| anthropic.com → | Anthropic's scalable-oversight programme runs debate as a research line in which two LLM debaters argue opposing positions on hard QA prompts and a weaker judge picks the winner, with measured judge accuracy rising as debater persuasiveness rises. |
Reader gotcha
Liang and colleagues observe that when every debater is the same model with the same system prompt, the agents capitulate to the most confident-sounding peer within a round or two and the panel converges on whichever answer happened to be phrased most assertively, regardless of correctness. They label the failure mode Degeneration-of-Thought and recommend assigning explicit antagonist roles or temperatures so the agents argue rather than agree. source
Implementation sketch
import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
const model = openai('gpt-4o')
const ask = (prompt: string) => generateText({ model, prompt }).then((r) => r.text)
export async function debate(question: string, agents = 3, rounds = 2): Promise<string> {
let answers = await Promise.all(
Array.from({ length: agents }, () => ask(`Answer concisely. Question: ${question}`)),
)
for (let r = 0; r < rounds; r++) {
const transcript = answers.map((a, i) => `Agent ${i + 1}: ${a}`).join('\n\n')
answers = await Promise.all(
answers.map((_, i) =>
ask(`You are Agent ${i + 1}. Peers:\n${transcript}\n\nQuestion: ${question}\nCritique the peers, then revise your answer.`),
),
)
}
return ask(`Question: ${question}\nFinal candidates:\n${answers.join('\n---\n')}\nReturn the single best answer.`)
}
export {}
- AutoGen
- CrewAI
- LangGraph
References
Multi-Agent Debate runs N language-model instances on the same prompt in parallel, then exposes each agent to the others' answers and asks it to revise. The first round produces independent candidates that have not yet seen one another. The second round feeds every agent the full set of peer responses with an instruction to critique, defend, or update; agents may copy a peer's argument, contradict it, or merge fragments of several. The loop repeats for a fixed number of rounds, after which a deterministic aggregation (majority vote on the final answer field, or a separate judge over the closing statements) picks the verdict the system commits to.
Background · context and trade-offs
The mechanism earns its rent on tasks where errors are not correlated across independent samples but a single agent will not catch its own mistake. Du and colleagues report that debate raises factual accuracy on multi-hop arithmetic and trivia by margins self-consistency sampling does not match, because peer disagreement surfaces specific contradictions a same-prompt rollout would silently agree with. Liang and colleagues frame the same loop as a way out of Degeneration-of-Thought: the failure where a model commits to a confident-but-wrong answer and self-refinement only deepens the commitment. Cost is linear in agents and rounds: a four-agent two-round debate is eight LLM calls plus the judge against one for a baseline.
The pattern sits next to Orchestrator-Workers and Evaluation (LLM-as-Judge) but is distinct from both. Orchestrator-Workers fans a fixed plan out to specialists chosen for their competence; debate runs symmetric peers and lets the disagreement do the work. LLM-as-Judge scores a fixed slate of candidates against a rubric; debate dynamically generates and refines the candidates before any verdict is taken. Khan and colleagues separately show that when candidates argue opposing positions and a weaker judge picks the winner, judge accuracy on hard reading-comprehension rises with debater capability: evidence the pattern composes with scalable-oversight pipelines, not just self-consistency.