Skip to main content
d-n
← Back to Agentic Design Patterns
Layer 1: Topology / Control Flow

Multi-Agent Debate

Also known as: LLM Debate, Multiagent Debate, Society of Minds

Runs N agents on the same prompt, exposes them to each other's answers, and votes.

by

N symmetric agents answer the same prompt independently, then each reads all peer answers and revises; the round repeats for a fixed count before a majority vote or judge picks the verdict — peer disagreement surfaces errors a single rollout would ratify.

Claude Code

  • Dispatch N debater subagents in a single message — they produce independent first-round answers without seeing each other.
  • Write each debater's first-round output to a named disk file; the second round reads all files and produces a revised answer.
  • Assign distinct roles or temperatures to debaters via their `SKILL.md` or brief — identical prompts converge, not debate.
  • Run the final aggregation as a separate judge subagent that reads all revised answers cold, without the debate transcript in context.

Primitives

  • Task tool (parallel debater subagents)
  • Single-message dispatch (isolation)
  • Disk-written round outputs
  • Judge subagent (final aggregation)

Cursor

  • Open N independent Agent chats on the same question — each produces a first-round answer without seeing the others.
  • Paste all first-round answers into a new Agent chat and ask for critique and revision; reference each via @file if saved to disk.
  • Assign different perspectives or constraints to each debater chat via the opening prompt to break the degeneration-of-thought pattern.
  • Use a final Agent session as the judge — provide only the closing statements, not the full debate transcript, to keep the verdict unbiased.

Primitives

  • Multiple Agent chats (parallel debaters)
  • @file (cross-chat answer sharing)
  • Agent mode

Decision

Use when ✓Avoid when ✗
+Best for cases where factual accuracy or reasoning depth matters more than latency, and a single rollout silently produces a confident wrong answer that peer disagreement would expose.When latency or cost dominates and an N-agent K-round debate inflates per-query spend by an order of magnitude over a self-consistency sample for a marginal accuracy gain.
+Justified where the task admits a single committable verdict (a numeric answer, a multiple-choice label, a yes/no decision) that a deterministic aggregator can collapse the round to.Without a deterministic aggregation rule (vote, judge, or longest-justification heuristic), the debate produces N divergent finals and the system has no contract for what to commit to.
+A good fit on questions a stronger model cannot reach but a panel of weaker models can, by triangulating between candidate explanations rather than averaging logits.When debaters share a single model snapshot and a single system prompt, errors stay correlated and the loop converges on the same wrong answer it started from rather than escaping it.
+Useful inside a scalable-oversight pipeline where opposing debaters surface contrasting evidence a less-capable judge can adjudicate more reliably than scoring one answer alone.

In the wild

SourceClaim
composable-models.github.ioDu and colleagues publish a runnable reference implementation that wires three GPT-3.5 agents through two debate rounds on MMLU, GSM8K, and biographies, and report accuracy gains over chain-of-thought and self-consistency on the same prompts at the same compute envelope.
microsoft.github.ioAutoGen ships a Teams primitive in its AgentChat layer that composes multiple AssistantAgents into a round-robin or selector group chat, the canonical framework wiring of the debate loop where each agent reads the conversation transcript before its next turn.
anthropic.comAnthropic's scalable-oversight programme runs debate as a research line in which two LLM debaters argue opposing positions on hard QA prompts and a weaker judge picks the winner, with measured judge accuracy rising as debater persuasiveness rises.

Reader gotcha

Liang and colleagues observe that when every debater is the same model with the same system prompt, the agents capitulate to the most confident-sounding peer within a round or two and the panel converges on whichever answer happened to be phrased most assertively, regardless of correctness. They label the failure mode Degeneration-of-Thought and recommend assigning explicit antagonist roles or temperatures so the agents argue rather than agree. source

Implementation sketch

import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'

const model = openai('gpt-4o')
const ask = (prompt: string) => generateText({ model, prompt }).then((r) => r.text)

export async function debate(question: string, agents = 3, rounds = 2): Promise<string> {
  let answers = await Promise.all(
    Array.from({ length: agents }, () => ask(`Answer concisely. Question: ${question}`)),
  )
  for (let r = 0; r < rounds; r++) {
    const transcript = answers.map((a, i) => `Agent ${i + 1}: ${a}`).join('\n\n')
    answers = await Promise.all(
      answers.map((_, i) =>
        ask(`You are Agent ${i + 1}. Peers:\n${transcript}\n\nQuestion: ${question}\nCritique the peers, then revise your answer.`),
      ),
    )
  }
  return ask(`Question: ${question}\nFinal candidates:\n${answers.join('\n---\n')}\nReturn the single best answer.`)
}

export {}
First-party TS SDK
  • AutoGen
  • CrewAI
  • LangGraph

References

  1. Du et al.·2023·ICML 2024 · DOI: 10.48550/arXiv.2305.14325

    foundational paper; introduces the parallel-agents-then-critique-round loop on MMLU and GSM8K

  2. Liang Tian et al.·2023·EMNLP 2024 · DOI: 10.48550/arXiv.2305.19118

    documents the Degeneration-of-Thought failure mode and the antagonist-role mitigation

  3. Khan et al.·2024·ICML 2024 · DOI: 10.48550/arXiv.2402.06782

    scalable-oversight result: weaker judges score debaters more accurately as debater capability rises

  4. Anthropic·2022

    frames debate as a scalable-oversight protocol where weaker judges supervise stronger debaters

  5. Microsoft AutoGen team·2025·accessed

    production wiring of the debate loop as a round-robin or selector group chat

  6. Antonio Gulli·2026·Springer·pp. 102119

Multi-Agent Debate runs N language-model instances on the same prompt in parallel, then exposes each agent to the others' answers and asks it to revise. The first round produces independent candidates that have not yet seen one another. The second round feeds every agent the full set of peer responses with an instruction to critique, defend, or update; agents may copy a peer's argument, contradict it, or merge fragments of several. The loop repeats for a fixed number of rounds, after which a deterministic aggregation (majority vote on the final answer field, or a separate judge over the closing statements) picks the verdict the system commits to.

Background · context and trade-offs

The mechanism earns its rent on tasks where errors are not correlated across independent samples but a single agent will not catch its own mistake. Du and colleagues report that debate raises factual accuracy on multi-hop arithmetic and trivia by margins self-consistency sampling does not match, because peer disagreement surfaces specific contradictions a same-prompt rollout would silently agree with. Liang and colleagues frame the same loop as a way out of Degeneration-of-Thought: the failure where a model commits to a confident-but-wrong answer and self-refinement only deepens the commitment. Cost is linear in agents and rounds: a four-agent two-round debate is eight LLM calls plus the judge against one for a baseline.

The pattern sits next to Orchestrator-Workers and Evaluation (LLM-as-Judge) but is distinct from both. Orchestrator-Workers fans a fixed plan out to specialists chosen for their competence; debate runs symmetric peers and lets the disagreement do the work. LLM-as-Judge scores a fixed slate of candidates against a rubric; debate dynamically generates and refines the candidates before any verdict is taken. Khan and colleagues separately show that when candidates argue opposing positions and a weaker judge picks the winner, judge accuracy on hard reading-comprehension rises with debater capability: evidence the pattern composes with scalable-oversight pipelines, not just self-consistency.