Three-way AI model debate as a pre-commit gate: receipts from several months of use

Multi-model debate sounds like a clever idea. Running it as a pre-commit gate, repeatedly over the past few months, has been more useful and more boring than the idea suggests. Most of what I assumed before I built it turned out to be wrong: the value is not where I expected, and the failure modes are not what I worried about.

The harness is a small Python script. I run it on architectural reviews, plan approvals, and on every important write to my personal AI infrastructure. If you are considering ensembling models for review, here is what to expect, with three dated incidents where the protocol caught something single-model review would have missed.

What the protocol does

The orchestrator (me, or a Claude Opus session acting as scheduler) writes the decision to disk as context: the design document, change plan, or pull request diff with enough surrounding code to stand alone.

Round 0 is blind. All three models see the context, none see any other’s response. Claude writes its position to a file; the harness calls Gemini and GPT-5.4 in parallel and writes theirs. All three files are revealed at once, and each position must end with a verdict: APPROVE, CHANGES REQUESTED, or BLOCK.

If all three return the same verdict, the debate is over, and the result carries weight: it happened before any model anchored on any other. If two agree, the dissenter’s reasoning gets flagged. If all three differ, round 1 is mandatory.

From round 1 onwards each model sees all three round-zero positions and must engage explicitly: “Where do you agree? Where do you disagree? If either reviewer raised a valid point you missed, update your verdict honestly. If your position is still correct, defend it with new evidence.” A rotating blast-radius seat is assigned on round 1: that model walks the dependency graph and identifies which consumers of the change have not been checked, while the other two continue design critique. I rotate the seat so no single model’s blind spots dominate.

Rounds continue until unanimous, majority with acknowledged dissent, or max rounds (default 4). If max rounds is hit without convergence, all three final positions are surfaced and I decide.

The core of the Python runner:

def run_round(round_n, context, prior_positions):
    if round_n == 0:
        gemini_prompt = BLIND_PROMPT.format(context=context)
        gpt_prompt = BLIND_PROMPT.format(context=context)
    else:
        gemini_prompt = INFORMED_PROMPT.format(
            context=context,
            own_prior=prior_positions["gemini"],
            claude_prior=prior_positions["claude"],
            gpt_prior=prior_positions["gpt"],
        )
        gpt_prompt = INFORMED_PROMPT.format(
            context=context,
            own_prior=prior_positions["gpt"],
            claude_prior=prior_positions["claude"],
            gemini_prior=prior_positions["gemini"],
        )
    # Call Gemini and GPT in parallel; Claude writes its position directly to file
    with ThreadPoolExecutor(max_workers=2) as pool:
        gem_future = pool.submit(call_gemini, gemini_prompt)
        gpt_future = pool.submit(call_gpt, gpt_prompt)
        return {"gemini": gem_future.result(), "gpt": gpt_future.result()}

Cost per debate is modest at full tier (Gemini Pro plus GPT-5.4 full) over four rounds. Claude on the orchestrating session costs nothing extra because the Opus subscription is already paying for it. The mini tier (Gemini Flash plus GPT-5.4-mini) is roughly a third.

What I expected before running it

Going in, I had four expectations:

The value would come from ensembling. Three imperfect judgements, majority verdict more reliable than any single one.
The cost would be round count. More rounds, more spend, more wallclock.
The failure mode would be collusion. Shared training data, anchoring on the same mistakes.
The protocol would suit safety-critical decisions and be less useful for small design choices.

Most of these were wrong.

What I actually found

The value is in framing diversity, not verdict ensembling. Three models from three labs do not produce three versions of the same answer; they produce three framings. Claude defaults to systems-level risk analysis. Gemini defaults to “what is missing” and “who has not been consulted.” GPT-5.4 defaults to first-principles re-derivation. I learn more from the three-way comparison than from any single one, even when they converge.

Convergence is dangerous. When all three agree on round 0, it is often because they share a training-data bias. The first time I ran the protocol on a “should I decompose this monolith?” decision all three returned APPROVE with almost identical reasoning, so I shipped. A while in I realised all three had missed the same question: did any downstream consumer depend on the monolith’s internal call graph in a way that module splits would break? The answer was yes. Unanimous round-zero agreements are now flagged as suspicious.

Principled disagreement is the real signal. The most valuable debates are the ones where two models converge and one holds out with a specific technical objection, and that objection is almost always worth a third look. Round-one dissent after round-zero agreement has changed my decision often enough that it has a much higher signal rate than round-zero unanimity.

The blast-radius seat is the most important rule in the protocol, and it was not my idea. It came from a failed review in April 2026: I ran the protocol on the plan to rebuild my controlplane infrastructure, got unanimous APPROVE on round 0, shipped, and produced four red flags on my iOS app within hours. All three had reviewed the design without its operational dependencies; they had not asked “what consumes this? what is the staleness horizon of every consumer?” because they had not been asked to. In debrief, all three converged: one seat per round must be responsible for operational dependency analysis rather than design critique. Most debates since have surfaced at least one unchecked consumer.

Specific incidents where the protocol caught things

The research-plan debate (20 April 2026). I asked the protocol to review my plan for a global research review paper on Claude Code best practices, round 0 blind. All three returned CHANGES REQUESTED, unanimous on verdict though not on specifics. Claude: “the plan is biased toward addition; add a simplification vector.” Gemini: “the plan is too broad for the time budget; sharpen the position.” GPT-5.4: “the plan underweights adversarial review of prompt-injection pathways.” Three framings of the same underlying issue: the plan was asking the wrong question. Not “what tools and patterns exist?” but “what is under-measured here that should be measured first?” I rewrote with a measurement-first anchor, and the eventual paper was materially better.

The controlplane rebuild debate (11 April 2026). The failed one, and more useful for it. I asked the protocol to review a seven-phase plan to rebuild my infrastructure under a unified control plane repository, round 0, unanimous APPROVE with substantive commentary, and I shipped. Within hours of the first acceptance-test run, four red flags appeared on my iOS app, all pointing at consumers of the old layout that the new layout had broken without migration. In debrief, all three converged on the same root cause: they had reviewed the design in isolation from downstream operational state. None had asked whether health_check.py’s hardcoded list of LaunchAgents matched the new services.yaml, or whether the iOS app’s expected file paths were in the new tree. The design was correct; the rollout was broken. That is where the mandatory blast-radius seat came from.

The recommendations-paper debate (this week). I asked the protocol to review a draft recommendations paper. Round 0, unanimous CHANGES REQUESTED. Claude: “prompt and system-message management strategy is missing.” Gemini: “cost observability for API spend attribution is missing.” GPT-5.4: “versioning and pinning of prompts, skills, Model Context Protocol (MCP) server configs, and model IDs as proper release artifacts with rollback by exact digest is missing.” All three were circling the same gap: the configuration surface I am managing is not versioned, measured, or pinned as a coherent artifact. I rolled the three into one high-confidence recommendation.

Where the protocol has not helped

Small decisions. For edits under 50 lines, three-way debate is noise. I use /review (ruff lint, semgrep, a code-reviewer subagent, optionally one external model) below a complexity threshold; /debate is for architecture, plans, anything irreversible, and anything touching the printer.

Time-critical decisions. The protocol takes several minutes per round, so for decisions under time pressure (a live incident, a print heading toward failure) it is the wrong tool.

Decisions all three models are confidently wrong about. This cannot be eliminated by protocol: if all three share a training-data gap, the unanimous verdict will be confidently wrong. I have hit this twice, and in both cases the gap was empirical.

Aesthetic or taste decisions. “Should the tone of this post be more formal?” is a bad debate question. Three models’ aesthetic judgements average to mush.

The one confidence trap I did not predict

Over the first months I caught myself treating unanimous round-zero APPROVE as licence to ship faster than I would have without the protocol. The debate was doing the work of my own due diligence, which cuts the wrong way: when the debate converges early, it reduces my own scrutiny at the very moment the collusion failure mode bites hardest.

The fix was a rule I now follow. Unanimous round-zero APPROVE is not a green light; it is a flag to look again at what the agreement might be masking. For plans involving real-world state I force a round 1 assigned to “what did all three miss?” Round 1 takes a few minutes and often surfaces something interesting. The cost of skipping it on the one occasion I did (the controlplane rebuild) was not trivial.

If you are making a lot of agent-system or infrastructure decisions, yes. The protocol is cheap relative to a single wrong architectural decision.

If you are making one decision a month, probably not. The overhead is not worth it at that volume.

If you already have two humans who will disagree with you in principled ways, you do not need the protocol. Humans are strictly better at this than models. The protocol is a substitute for not having those humans, which is my situation and probably yours if you are running solo.

If you are making aesthetic or taste decisions, no.

The parts that matter most: round 0 is blind, every round requires engaging with the previous positions rather than just asserting, and one seat per round is operational rather than architectural.