The 3-Model Consensus Pattern, Now With Opus 4.7 and GPT-5.5

By Promptster Team · 2026-06-07

We've made the case before that three independent models agreeing is a stronger signal than one expensive model asserting. The 3-judge consensus pattern showed the bias math; three cheap models beating one expensive showed the economics. The method is settled. What changes is the frontier.

So this is a refresh, not a new theory. Same pattern, current lineup: Opus 4.7, GPT-5.5, and Gemini 3.1 Pro as the three voices. The May 2026 model wave makes this the most interesting consensus trio of the year — and the first time we've had three current-generation frontier models from three labs available together on a single day.

Why These Three

The pattern works only if the voices are genuinely independent — different pretraining, different RLHF, different failure modes. A consensus of three OpenAI models is just one model with extra steps. The current May-2026 frontier gives us three maximally independent strong models:

Model	Notable strength
Claude Opus 4.7	Strong coding/agentic model; fastest of the three on most structured tasks
GPT-5.5	OpenAI's current release; cleanest instruction-following on formal constraints
Gemini 3.1 Pro	Strong multimodal and reasoning reputation; cheapest of the three on average

Three labs, three architectures, three independent error distributions. That independence is the whole point.

The Pattern (Unchanged)

        prompt
          │
   ┌──────┼──────┐
   ▼      ▼      ▼
 Opus   GPT    Gemini
 4.7    5.5    3.1 Pro
   │      │      │
   └──────┼──────┘
          ▼
   aggregate / agreement check
          │
          ▼
   consensus answer + confidence

Two ways to aggregate, depending on the task:

Factual / closed-answer tasks — take the majority answer. 2-of-3 or 3-of-3 agreement is your confidence signal. A 1-1-1 split is a flag to escalate to a human, not a coin flip.
Open-ended / ranked tasks — use the average-rank aggregation from the 3-judge pattern. Each model ranks the anonymized candidates; you average the ranks so each model's self-preference cancels.

The math is exactly the one we published. We're not re-deriving it — we're pointing the same machinery at newer models.

Where Consensus Earns Its Keep

Running three frontier models costs roughly 3x a single call. That's only worth it where a wrong answer is expensive: medical/legal summarization, financial extraction, anything that feeds an automated decision. For those, the agreement rate is a cheaper proxy for correctness than any single model's confidence score — and it's the core idea behind consensus analysis improving quality.

curl -s -X POST "https://www.promptster.dev/v1/prompts/compare" \
  -H "Authorization: Bearer $PROMPTSTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Does this contract clause waive the right to a jury trial? Answer yes/no with the exact phrase.",
    "temperature": 0.0,
    "targets": [
      {"provider": "anthropic", "model": "claude-opus-4-7"},
      {"provider": "openai",    "model": "gpt-5.5"},
      {"provider": "google",    "model": "gemini-3.1-pro-preview"}
    ]
  }'

When all three return the same yes/no and cite the same phrase, you ship it. When they disagree, you route to review.

The Numbers (Run Before You Trust)

The pattern is theory-backed; the specific agreement rate for your task type is empirical. So we ran four closed-answer prompts at temperature 0.0 — three factual questions and one date-extraction task — through the trio on 2026-05-30.

Here's the result: all three agreed unanimously and correctly on every prompt — W for tungsten's chemical symbol, 36 black keys on an 88-key piano, Jupiter for the fastest-rotating planet, and 2021 for the contract signing year. A clean 3-of-3 across three independent labs' current-generation frontier models, with no need for a Flash substitute this time — all three of Opus 4.7, GPT-5.5, and Gemini 3.1 Pro were available on a single day.

Question	Opus 4.7	GPT-5.5	Gemini 3.1 Pro	Consensus
Chemical symbol for tungsten?	W	W	W	W ✓ (3/3)
Black keys on an 88-key piano?	36	36	36	36 ✓ (3/3)
Fastest-rotating planet?	Jupiter	Jupiter	Jupiter	Jupiter ✓ (3/3)
Year the contract was signed?	2021	2021	2021	2021 ✓ (3/3)

Per-call cost was lopsided in a way worth noticing: Gemini 3.1 Pro's four answers cost a combined $0.000298 (sub-1¢ for the whole task suite); Opus 4.7's was $0.001535; GPT-5.5's was $0.003350. Latency followed the same shape — Opus answered fastest (730–937 ms across all four), GPT-5.5 in the middle (950–1,904 ms), Gemini 3.1 Pro slowest (1,790–3,310 ms). The trio answers cost roughly $0.0052 combined for all four prompts — the cheap end of frontier consensus.

All three converged on the correct answer for every prompt — a clean 3-of-3 across three independent labs' current frontier models, with no splits to adjudicate. That's the easy case: closed trivia agrees easily. The hypothesis that needs a harder dataset to prove: on ambiguous or error-prone tasks, 3-of-3 agreement accuracy materially exceeds any single model's standalone accuracy, and the split cases are exactly where single models are silently wrong. Run this on your genuinely hard prompts to see consensus earn its ~3× cost.

The Real Lesson

Frontier models churn. The consensus pattern doesn't. When the next wave lands — Opus 4.8, GPT-5.6, Gemini 3.2 — you swap the three model IDs and re-run the same aggregation. The method survives the churn because it depends on independence between voices, not on which voice happens to lead this month. Pick three labs, aggregate, and treat disagreement as the signal it is.

Tests run 2026-05-30 via the Promptster /v1/prompts/compare API. Temperature 0.0. Costs computed from the May 2026 pricing.ts (gpt-5.5 $5/$30, opus-4-7 $5/$25, gemini-3.1-pro $2/$12 per 1M).