We Ran the Same Prompt Across 11 AI Providers. Here's What Consensus Actually Looks Like.
By Promptster Team · 2026-04-26
We ran the same prompt across all 11 AI providers Promptster supports. One got it fully right. Six of the eleven made the same wrong claim with total confidence. One 8B model fabricated PEP numbers out of thin air.
This is what real multi-provider consensus looks like — not the polished diagram in a vendor pitch deck, but the messy, opinionated, occasionally embarrassing data that falls out when you actually run the test.
The Prompt
We picked a factual prompt with multiple verifiable sub-claims:
Name 5 features added to Python in version 3.12. For each feature, include the PEP number and one sentence describing what it does.
Python 3.12 is a good test case. It's recent enough that not every model has well-compressed knowledge of it, old enough that the answer isn't contested, and each PEP number is a specific, verifiable datapoint. With 5 features × 2 key claims (name + PEP) × 11 models, we got 110 factual datapoints to compare.
Temperature was set to 0.2 for every provider. Max tokens 600. Each request was sent once, no retries, no re-prompts.
The Providers
| Provider | Model |
|---|---|
| OpenAI | gpt-4o-mini |
| Anthropic | claude-haiku-4-5-20251001 |
| gemini-2.5-flash-lite | |
| DeepSeek | deepseek-chat |
| Groq | llama-3.3-70b-versatile |
| Mistral | mistral-large-latest |
| xAI | grok-3 |
| Perplexity | sonar |
| Together AI | meta-llama/Llama-3.3-70B-Instruct-Turbo |
| Cerebras | llama3.1-8b |
| Fireworks AI | accounts/fireworks/models/llama-v3p3-70b-instruct |
The Results
Accuracy = correct feature ↔ PEP pairing, graded against the official Python 3.12 release notes.
| Provider | Correct | Latency | Cost | Tokens/sec |
|---|---|---|---|---|
Perplexity sonar |
5 / 5 | 1,989 ms | $0.000304 | 88 |
DeepSeek deepseek-chat |
4 / 5 | 7,174 ms | $0.000106 | 23 |
Mistral mistral-large-latest |
4 / 5 | 4,310 ms | $0.000380 | 49 |
xAI grok-3 |
4 / 5 | 3,176 ms | $0.002844 | 52 |
Google gemini-2.5-flash-lite |
3 / 5 | 1,442 ms | $0.000082 | 120 |
Anthropic claude-haiku-4-5 |
~2 / 5 | 4,538 ms | $0.001111 | 42 |
OpenAI gpt-4o-mini |
0 / 5 | 4,216 ms | $0.000128 | 43 |
Groq llama-3.3-70b-versatile |
0 / 5 | 860 ms | $0.000227 | 192 |
Together Llama-3.3-70B-Instruct-Turbo |
0 / 5 | 8,108 ms | $0.000288 | 20 |
Fireworks llama-v3p3-70b-instruct |
0 / 5 | 1,067 ms | $0.000291 | 150 |
Cerebras llama3.1-8b |
0 / 5 | 290 ms | $0.000000 | 524 |
Cost spread: 35x from cheapest priced answer ($0.000082) to most expensive ($0.002844). Latency spread: 28x from fastest (290 ms) to slowest (8,108 ms). Accuracy spread: 0/5 to 5/5 on the same prompt.
Finding #1: The Same Weights Don't Give You Consensus
Groq, Together AI, and Fireworks AI all ran the same model: Llama 3.3 70B Instruct. All three got 0/5. All three listed Python 3.11 features. All three confidently named the same wrong PEPs: PEP 654 (Exception Groups), PEP 673 (Self Type), PEP 646 (Variadic Type Hints), PEP 657 (Fine-Grained Error Locations).
If you run your "consensus" check against three providers hosting the same open-weight model, you don't have three opinions. You have one opinion with three invoices. The statistical independence that makes cross-checking work comes from different training data, different architectures, different teams — not from different API endpoints.
For consensus to mean anything, pick providers that run distinct models. OpenAI, Anthropic, Google, DeepSeek, and Mistral all train their own. Running Llama across three Llama hosts gives you speed and price diversity but near-zero epistemic diversity.
Finding #2: Shared Training Data Produces Shared Hallucinations
Six of eleven models cited PEP 657 as a Python 3.12 feature.
PEP 657 is real. It's called "Include Fine-Grained Error Locations in Tracebacks." It was accepted for Python 3.11, shipped in October 2022. It's one of the most visible improvements users noticed in 3.11, which is probably why it shows up disproportionately in training data discussing Python 3.12 content ("this builds on the 3.11 improvements from PEP 657...").
When you ask six models the same question and they agree, agreement looks like confidence. But if they all learned from the same mis-attributed blog posts and Stack Overflow answers, that agreement is just an echo. This is why cross-checking models with shared training data can give you false confidence — and why multi-model verification is a starting point, not a stopping point.
Finding #3: Smaller Models Don't Just Make Mistakes — They Make Things Up
Cerebras hosts llama3.1-8b, which replied in 290 milliseconds at 524 tokens/second for $0.00. It listed:
- PEP 634 (Structural Pattern Matching — actually Python 3.10)
- PEP 647 (Type Guards — Python 3.10)
- PEP 656 (which is "Platform Tag for Linux Distributions")
- PEP 655 ("Required and NotRequired in TypedDict" — Python 3.11)
- PEP 641 (decimal widening, never implemented)
It confidently paired PEP numbers with feature names it had invented. PEP 656 is not about "Literal Types." PEP 641 is not about enum members. These are hallucinations in the purest sense — plausible-sounding claims with no grounding.
Speed and cost were unbeatable. Usefulness on a factual recall task was zero. If you need to retrieve facts, use a model sized for facts. If you're just transforming text you already have in the prompt, small models are fine. Know which task you're running.
Finding #4: Web-Connected Models Change the Game
Perplexity's sonar was the only 5/5. It's also the only model in the test that performs live web retrieval. It cited sources inline (we stripped the [1] markers from the accuracy grading). The other ten models were answering from weights.
This matters for how you interpret "consensus." If one model is reading the docs in real time and ten others are remembering training data from 18 months ago, consensus across the ten is not consensus — it's a memory test. For time-sensitive factual questions, you either need a web-connected model in the mix or you need to provide the authoritative source in the prompt.
Finding #5: Speed and Cost Don't Predict Accuracy
The fastest answer was Cerebras at 290ms — wrong. The slowest was Together at 8,108ms — wrong. The cheapest was Gemini at $0.000082 — 3/5. The most expensive was xAI grok-3 at $0.002844 — 4/5. Perplexity, at $0.000304, was 20% of grok-3's cost and got more right.
There's a rough correlation between model capability and accuracy, but no reliable correlation with speed or price tier. Your procurement department's instinct to "just use the cheapest" fails about as often as "just use the most expensive."
Where Consensus Actually Emerged
Four PEPs had strong cross-provider agreement (3+ correct citations), and all four are genuinely Python 3.12 features:
- PEP 684 (A Per-Interpreter GIL) — 4 models cited correctly
- PEP 695 (Type Parameter Syntax) — 5 models cited correctly
- PEP 701 (Formalized f-strings) — 6 models cited correctly
- PEP 688 (Buffer protocol accessible in Python) — 4 models cited correctly
This is the useful consensus signal. When you're running a multi-provider check, items that multiple architecturally distinct models converge on are your high-confidence facts. Items where models diverge are your verify-these flags. Items the whole group gets wrong — like PEP 657 here — are the reminder that consensus isn't truth.
How to Run This Yourself
This entire comparison took about 30 seconds of active testing through Promptster's comparison view. You can replicate it in a few ways:
Via the web app: open the comparison view, add up to 11 providers, paste your prompt, and use consensus analysis to auto-synthesize agreement and divergence.
Via the MCP server: tools like Claude Code, Cursor, and Windsurf can call compare_prompts directly. See MCP server setup for Cursor.
Via the API: POST /v1/prompts/compare with up to 5 configurations. See the API documentation.
What This Changes
Single-provider benchmarks are suspicious on principle now. Any claim of the form "model X is the best at Y" deserves the question: compared to what, tested how, on whose data? Multi-provider testing is uncomfortable because it shows you where your favorite model is wrong. That discomfort is the value.
When you ship AI features to real users, you're not publishing a benchmark. You're making a promise about quality. The cheapest way to keep that promise is to check your work against models your users might ask the same question to — and catch the divergences before they catch you.
For a deeper look at the consensus methodology we use internally, see how to use AI consensus analysis to improve output quality. For when consensus goes wrong — as it did here with PEP 657 — see detecting AI hallucinations with multi-model cross-checking.
Test run: 2026-04-18. Temperature 0.2, max tokens 600. Raw response data available on request. Models evolve; results are a snapshot in time.