Why 3 Cheap Models in Consensus Beat 1 Expensive Model Alone (With the Math)

By Promptster Team · 2026-05-09

The intuition most developers have about AI model pricing is: "you get what you pay for." Premium models are more accurate. Premium answers are better answers. The math behind prompt quality, though, is more interesting than that — and in many real tasks, three cheap models in consensus beat one expensive model alone on both accuracy and cost.

Here's the math, and here's the receipt.

The Math

Assume three independent models, each with individual accuracy p on a factual task. What's the probability that at least 2 of the 3 agree on the correct answer (majority-vote accuracy)?

Binomial:

Plug in some numbers:

Individual model accuracy p Majority-of-3 accuracy Delta
0.70 0.784 +0.084
0.80 0.896 +0.096
0.85 0.939 +0.089
0.90 0.972 +0.072
0.95 0.993 +0.043

Three 80%-accurate models in majority vote reach 90% — exceeding the accuracy of many single frontier models, for a fraction of the price. The gain is largest in the middle of the accuracy range. At the high end, there's less room to improve; at the low end (<50%), majority voting amplifies errors.

The same logic extends: 5 models at 70% accuracy reach 83% via majority vote.

Critical prerequisite: the models must make independent errors. Three hosted copies of Llama 3.3 don't give you independence — they give you one error with three invoices. We documented this in our 11-provider consensus study.

The Data

We put this to the test on a factual recall prompt from our earlier 11-provider benchmark: "Name 5 features added to Python 3.12 with their PEP numbers and descriptions." The correct set of PEPs: 669, 684, 688, 692, 695, 698, 701, 709.

Three budget-tier-or-specialized models' scores (out of 5):

Their union of correctly cited PEPs (the set of any PEP cited correctly by at least 2 of the 3): PEP 684, 688, 695, 701 — 4 correct PEPs with strong cross-provider agreement. The union of unique correct citations: 5+ PEPs.

Compare to the single most accurate frontier model in the same study: Perplexity Sonar at 5/5 (web search gave it an edge). But stepping down to pure-language-model frontier tier: Claude Sonnet, GPT-4o, Gemini 3.1 Pro all scored 2-3/5 on the same prompt. A consensus of three specialized cheap models outperformed any single frontier model running from weights alone.

Cost Math

Published pricing for the three cheap-tier models on this benchmark:

Total cost of running all three in parallel: $0.000790 per prompt.

Compare to a single Claude Opus 4.1 call on the same prompt: ~$0.003500 (based on $15/M input + $75/M output at typical token counts).

The three-cheap-model consensus was 4.4x cheaper than one Opus call and more accurate on this task.

Where This Pattern Wins

The consensus-of-cheap pattern dominates when:

  1. The task has verifiable sub-claims. Factual recall, structured extraction, multiple-choice reasoning. Majority vote only helps when "correct" is checkable.
  2. Model diversity is high. Pick models from different training pipelines. Our earlier analysis showed that three Llama hosts gave you one answer three times — that's not consensus.
  3. Latency budget tolerates parallel fan-out. Three parallel calls are almost as fast as one (limited by the slowest). Serial calls would blow your latency.
  4. Per-request volume is high. At one-off usage, the developer time saved by "just use the best model" may outweigh the savings.

Where It Loses

This pattern is not a silver bullet. It fails:

  1. When individual model accuracy is below ~60%. Majority vote of 3 models at 50% accuracy = 50%. No signal.
  2. On subjective tasks. "Write the most engaging headline" has no majority-vote answer — three models produce three different headlines, all plausible. Quality is spread, not binary.
  3. When errors are correlated. Shared training data produces shared hallucinations; we showed this with the PEP 657 error (6 of 11 models misattributed it). Majority vote over correlated-error models reinforces the error.
  4. When latency must be sub-second. Fan-out adds round-trip overhead.

How to Build This

The Promptster comparison endpoint gives you parallel N-provider fan-out in one call. See our LLM router tutorial for the full routing pattern. The consensus-vote layer is three more lines of Python:

# Pseudocode
results = compare(prompt, providers=[p1, p2, p3])
answers = [r.extract_answer() for r in results]
consensus = Counter(answers).most_common(1)[0][0]  # majority vote

For non-trivial cases (extraction into structured fields, numerical answers), the "majority vote" step becomes schema-level — majority per field, not per whole response. That's where consensus analysis in Promptster earns its keep: it does the schema-level aggregation for you.

The Takeaway

"Pick the best model" is a simpler story than "run three cheaper models in parallel and majority-vote the answer." It's also strictly worse for many real workloads. The 4x price spread between a cheap-model consensus and a single frontier call buys you accuracy, not the other way around — when you pick independent cheap models on a verifiable task.

For more on when cheap models work, see our 300x cost-quality analysis. For when they don't, see the task-type decision framework.


Data from 2026-04-18 run. Individual model accuracies based on single-trial scores; production use should validate with multi-trial runs against your own reference data.