build

The Agreeable Summarization Trap

TRIGGER

Multiple LLMs given full discussion context were producing superficial responses that didn't engage meaningfully with each other's arguments—they had information but no framework for productive disagreement or synthesis.

APPROACH

Developer azettl built Consilium during the Gradio Agents & MCP Hackathon, a multi-LLM debate platform where 3-4 models (Mistral Large, DeepSeek-R1, Llama-3.3-70B, QwQ-32B) discuss questions around a visual poker-table interface. Input: user question + configurable communication structure (full context, ring, or star topology) + 1-5 discussion rounds. Output: role-differentiated responses plus lead analyst synthesis determining if consensus was reached. Without role differentiation, LLMs receiving full discussion context produced no real debate—they had context but no framework for disagreement. The fix: distinct system prompt roles including expert_advocate ("PASSIONATE EXPERT advocating with conviction"), critical_analyst ("RIGOROUS CRITIC identifying flaws and risks"), strategic_advisor ("practical implementation and real-world constraints"), research_specialist ("authoritative evidence-based analysis"), and innovation_catalyst ("challenge conventional thinking"). Microsoft's AI Diagnostic Orchestrator validated this approach: their multi-agent panel with roles like "Dr. Challenger Agent" achieved 85.5% accuracy on medical diagnosis benchmarks versus 20% for practicing physicians.

PATTERN

“Your multi-agent system will produce agreeable summaries instead of genuine debate without explicit conflict incentives. LLMs default to accommodation—one model's job must be to find flaws in another's reasoning, or you get superficial consensus.”

✓ WORKS WHEN

Question has multiple defensible positions or tradeoffs to explore
You need to surface weaknesses in arguments before committing to a decision
3-5 LLM participants (enough for role diversity, few enough for coherent synthesis)
Discussion rounds are capped (1-5 rounds) to force convergence rather than infinite debate
A designated synthesizer role exists to evaluate consensus and extract final answer

✗ FAILS WHEN

Question has a single factual answer where debate adds no value
LLMs have identical training data and will converge to same position regardless of role
Latency budget doesn't allow multiple sequential LLM calls (each round multiplies latency by participant count)
Users expect deterministic outputs—role-based debate introduces response variance
The task requires precise execution rather than exploratory reasoning

Stage

build

Source

Hugging Face →

From

July 2025