Multi-Agent Debate
Also known as: LLM Debate, Multiagent Debate, Society of Minds
Several agents argue, critique each other, and converge on a single answer.
Decision
| Use when ✓ | Avoid when ✗ |
|---|---|
| +Apply when factual accuracy or reasoning depth matters more than latency, and a single rollout silently produces a confident wrong answer that peer disagreement would expose. | −When latency or cost dominates and an N-agent K-round debate inflates per-query spend by an order of magnitude over a self-consistency sample for a marginal accuracy gain. |
| +Use where the task admits a single committable verdict — a numeric answer, a multiple-choice label, a yes/no decision — that a deterministic aggregator can collapse the round to. | −Without a deterministic aggregation rule — vote, judge, or longest-justification heuristic — the debate produces N divergent finals and the system has no contract for what to commit to. |
| +Reach for it on questions a stronger model cannot reach but a panel of weaker models can, by triangulating between candidate explanations rather than averaging logits. | −When debaters share a single model snapshot and a single system prompt, errors stay correlated and the loop converges on the same wrong answer it started from rather than escaping it. |
| +Prefer it inside a scalable-oversight pipeline where opposing debaters surface contrasting evidence a less-capable judge can adjudicate more reliably than scoring one answer alone. |
In the wild
| Source | Claim |
|---|---|
| composable-models.github.io → | Du and colleagues publish a runnable reference implementation that wires three GPT-3.5 agents through two debate rounds on MMLU, GSM8K, and biographies, and report accuracy gains over chain-of-thought and self-consistency on the same prompts at the same compute envelope. |
| microsoft.github.io → | AutoGen ships a Teams primitive in its AgentChat layer that composes multiple AssistantAgents into a round-robin or selector group chat, the canonical framework wiring of the debate loop where each agent reads the conversation transcript before its next turn. |
| anthropic.com → | Anthropic's scalable-oversight programme runs debate as a research line in which two LLM debaters argue opposing positions on hard QA prompts and a weaker judge picks the winner, with measured judge accuracy rising as debater persuasiveness rises. |
Reader gotcha
Liang and colleagues observe that when every debater is the same model with the same system prompt, the agents capitulate to the most confident-sounding peer within a round or two and the panel converges on whichever answer happened to be phrased most assertively, regardless of correctness. They label the failure mode Degeneration-of-Thought and recommend assigning explicit antagonist roles or temperatures so the agents argue rather than agree. source
Implementation sketch
import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
const model = openai('gpt-4o')
const ask = (prompt: string) => generateText({ model, prompt }).then((r) => r.text)
export async function debate(question: string, agents = 3, rounds = 2): Promise<string> {
let answers = await Promise.all(
Array.from({ length: agents }, () => ask(`Answer concisely. Question: ${question}`)),
)
for (let r = 0; r < rounds; r++) {
const transcript = answers.map((a, i) => `Agent ${i + 1}: ${a}`).join('\n\n')
answers = await Promise.all(
answers.map((_, i) =>
ask(`You are Agent ${i + 1}. Peers:\n${transcript}\n\nQuestion: ${question}\nCritique the peers, then revise your answer.`),
),
)
}
return ask(`Question: ${question}\nFinal candidates:\n${answers.join('\n---\n')}\nReturn the single best answer.`)
}
export {}
- AutoGen
- CrewAI
- LangGraph
References
- Du et al.·2023·ICML 2024 · DOI: 10.48550/arXiv.2305.14325
foundational paper; introduces the parallel-agents-then-critique-round loop on MMLU and GSM8K
- Liang Tian et al.·2023·EMNLP 2024 · DOI: 10.48550/arXiv.2305.19118
documents the Degeneration-of-Thought failure mode and the antagonist-role mitigation
- Khan et al.·2024·ICML 2024 · DOI: 10.48550/arXiv.2402.06782
scalable-oversight result: weaker judges score debaters more accurately as debater capability rises
- Anthropic·2022
frames debate as a scalable-oversight protocol where weaker judges supervise stronger debaters
- Microsoft AutoGen team·2025·accessed
production wiring of the debate loop as a round-robin or selector group chat
- Antonio Gulli·2026·Springer·pp. 102–119
Overview · 1-paragraph mechanism
Multi-Agent Debate runs N language-model instances on the same prompt in parallel, then exposes each agent to the others' answers and asks it to revise. The first round produces independent candidates that have not yet seen one another. The second round feeds every agent the full set of peer responses with an instruction to critique, defend, or update; agents may copy a peer's argument, contradict it, or merge fragments of several. The loop repeats for a fixed number of rounds, after which a deterministic aggregation — majority vote on the final answer field, or a separate judge over the closing statements — picks the verdict the system commits to.
Background · context and trade-offs
The mechanism earns its rent on tasks where errors are not correlated across independent samples but a single agent will not catch its own mistake. Du and colleagues report that debate raises factual accuracy on multi-hop arithmetic and trivia by margins self-consistency sampling does not match, because peer disagreement surfaces specific contradictions a same-prompt rollout would silently agree with. Liang and colleagues frame the same loop as a way out of Degeneration-of-Thought — the failure where a model commits to a confident-but-wrong answer and self-refinement only deepens the commitment. Cost is linear in agents and rounds: a four-agent two-round debate is eight LLM calls plus the judge against one for a baseline.
The pattern sits next to Orchestrator-Workers and Evaluation (LLM-as-Judge) but is distinct from both. Orchestrator-Workers fans a fixed plan out to specialists chosen for their competence; debate runs symmetric peers and lets the disagreement do the work. LLM-as-Judge scores a fixed slate of candidates against a rubric; debate dynamically generates and refines the candidates before any verdict is taken. Khan and colleagues separately show that when candidates argue opposing positions and a weaker judge picks the winner, judge accuracy on hard reading-comprehension rises with debater capability — evidence the pattern composes with scalable-oversight pipelines, not just self-consistency.