
Grounded in peer-reviewed and emerging multi-agent AI researchⓘ, FiveBots runs a structured analysis across five frontier AI models with live web retrieval and cross-critique between them — improving factuality, strengthening reasoning, and reducing hallucinations.
The problem
Ask any single frontier model to help you decide between options and you’ll get a fluent, assertive recommendation — often sounding more certain than it has any right to be. The things that would make a thoughtful advisor pause (the counter-argument you haven’t heard, the case for the option you almost dismissed) rarely make it into the answer.
What you actually want when you’re stuck is a room: several advisors, arguing with each other, arriving at something you can act on.
The research answer
Multi-agent debate research1 has shown, repeatedly, that the single biggest quality gain comes from an underused mechanic: let several diverse models answer independently first, then let each read the others’ reasoning and revise. Five agents that start from genuinely different priors2 catch errors that any single model would defend.
More recent work7adds a second finding: once you’ve got five positions on the table, the highest-quality aggregation is NOT to let one model judge the rest — it’s to pin the final recommendation to where the five actually land after reconsidering, and to surface the minority view rather than bury it behind a single winning answer.
Debate mode runs exactly this protocol. Five frontier models give an independent opening take, read each other’s reasoning, reconsider, and the final recommendation is pinned to the revised positions. If one lab holds out, we tell you what it’s worried about. And we tell you the single counter-fact or preference shift that would change the answer.
What we do
When you bring a dilemma into Debate mode, the real protocol runs underneath. What you hear — five voices arguing in real time, a verdict at the end — is a dramatisation of that real material. The substance is real; the theatre just makes it listenable.
Frame the dilemma and gather context
The dilemma is parsed for the real options on the table. If the decision depends on facts the panel might get stale on (prices, dates, news, rules), a live web search pulls relevant context before anyone answers — so the debate is argued over the same grounded picture, not five different assumptions about the world.
Five independent opening takes
Your dilemma goes to five frontier models — one each from OpenAI (ChatGPT), Anthropic (Claude), Google DeepMind (Gemini), xAI (Grok), and DeepSeek — in parallel, with no visibility of each other. Each returns a position, a confidence, and the single thing that would most change their view.
Read and reconsider
Each model now reads the other four openings and returns a revised position. This is where real cross-model persuasion happens — a lab might dig in, partly shift, or flip entirely when it sees a case it hadn’t weighed. The rethinks are what the final verdict is pinned to.
Live debate — dramatised, from the real material
While the rethinks are landing, a scriptwriter model turns the openings and early disagreement into a voiced, streaming debate. You hear the five advisors argue in real time. This part is dramatisation: real substance, dramatised delivery.
Verdict, tally, and minority report
The final recommendation is written against the rethinks — not against anything the live debate invented. You see the final vote tally across options, the per-lab ballots, a minority report if any lab dissents, and the single fact or preference shift that would flip the answer.
Every part of the run is inspectable after the verdict lands: the five openings, the five rethinks, the full turn-by-turn debate transcript, and the ballots each lab cast.
What’s real, what’s theatre
We want to be honest about where the theatre starts and stops. The openings, the rethinks, the ballots, and the final recommendation are all real outputs from five different labs — or, for the recommendation, a deterministic read of those outputs. The live voiced debate between them is written by a single scriptwriter model from that real material.
We could hide the debate step and just show you the pinned vote. It would be faster and cheaper. We include it because hearing five perspectives argue out loud is genuinely useful when you’re stuck — even when you know the voices are dramatised. The point is that the recommendation is not at risk of being whatever the dramatist felt made for a satisfying scene.
Meet the panel
Every run uses the latest flagship from each of five frontier labs — ChatGPT, Claude, Gemini, Grok, DeepSeek. The panel is intentionally heterogeneous. Voices are randomised per run so the same lab doesn’t always sound the same. The recommendation is pinned to the reconsidered positions of those five labs, with the minority view surfaced and the single fact that would flip the answer named explicitly.
Why it matters
The best decision advice doesn’t just tell you what to do — it tells you what it’s assuming, where the panel disagrees, and what single fact or preference shift would change the recommendation. That’s what research-aligned aggregation gets you: an answer you can actually evaluate.
Debate mode is for the dilemmas where you’d rather hear the disagreement out loud than be told a confident wrong thing.
References
Peer-reviewed here means accepted at ICML / ICLR / ACL / EMNLP / NeurIPS — not “on arXiv.” arXiv preprints that haven’t cleared a conference are labelled emerging.
ICML 2024 · peer-reviewed
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Li, Torralba, Tenenbaum & Mordatch
The foundational result: multiple models reading each other’s reasoning catch errors a single model defends. Shows debate performance can improve as the number of participating agents increases.
Read on arXivACL 2024 · peer-reviewed
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
Chen, Saha & Bansal
Shows consensus quality improves when agents are drawn from different model families rather than from repeated instances of the same model, and that a transcript-level judge outperforms majority voting.
Read on arXivEMNLP 2024 · peer-reviewed
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Liang et al.
Motivates debate as a way to counter the Degeneration-of-Thought problem that emerges when a single model becomes locked into its initial reasoning path.
Read on arXivarXiv 2026 · emerging
Demystifying Multi-Agent Debate
Zhu et al.
Studies a five-agent, five-turn debate setting and shows performance improves when the initial debate pool is made more diverse and when agents communicate calibrated confidence during revision.
Read on arXivarXiv 2026 · emerging
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
HDE paper
Argues that architectural heterogeneity — models from different labs — prevents “consensus collapse”, where homogeneous panels share the same training biases and confidently converge on the same wrong answer.
Read on arXivarXiv 2025 · emerging
Enhancing Multi-Agent Debate System Performance via Confidence Expression
Wu et al.
Finds that when debating agents see each other’s confidence scores the panel drifts toward over-confidence and loses signal. Informs the FiveBots design choice that cross-critique turns on reasoning and disconfirmation conditions, not assertiveness.
Read on arXivarXiv 2026 · emerging
Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge
AgentAuditor paper
Shows that adjudicating at divergence points — by comparing localised branch evidence — beats both majority vote and generic LLM-as-judge, recovering correct minority answers where voting loses them entirely. Supports the FiveBots design of a blinded synthesis pass over the full debate transcript.
Read on arXivarXiv 2025 · emerging
Retrieval-Augmented Generation with Conflicting Evidence (MADAM-RAG)
Wang, Prasad, Stengel-Eskin & Bansal
Assigns each agent a different subset of the retrieved evidence, then lets them debate. Reports factuality gains of 11–16 percentage points on benchmarks with ambiguous or conflicting documents. Basis for the FiveBots per-analyst evidence partitioning: agreement reached by analysts reading different sources is much stronger evidence than agreement when all five read the same article.
Read on arXivPowered by ChatGPT, Claude, Gemini, Grok, and DeepSeek.
Your questions are never shared. Your answers are private to you.