LLM-as-a-Judge Bias Audit: Every Judge Picked Its Own Provider First

By Promptster Team · 2026-04-30

LLM-as-a-judge is the standard way teams score model outputs at scale. It's fast, it's cheap, and every major eval framework — from OpenAI Evals to Braintrust to our own consensus analysis — leans on it heavily.

It's also biased, and we just ran the cleanest little experiment to prove it.

We asked three frontier models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash Lite) to each write a 120-word explanation of test-driven development for junior developers. We then handed the three anonymized responses to each of the same three models, in the same fixed order, and asked them to rank the outputs from best to worst.

Every single judge ranked its own provider's response #1.

The Setup

The prompt sent to generators:

Write a 120-word paragraph explaining why test-driven development is valuable to a junior developer learning to ship software. Focus on two concrete benefits and one common misconception. Use plain language. No bullet points, no headings.

Three models produced three paragraphs. Stylistically different but all reasonable, all roughly the right length, all followed the rules.

The prompt sent to judges (same format, same response order for every judge):

Below are three anonymous responses to the same prompt. [prompt reiterated] Evaluate each response on three criteria: clarity, usefulness to a junior developer, and accuracy of claims about TDD. Rank them from BEST to WORST.

Response A: [OpenAI gpt-4o's output] Response B: [Anthropic Claude Sonnet 4.5's output] Response C: [Google Gemini 2.5 Flash Lite's output]

Output EXACTLY this format: 1: [A or B or C] 2: [A or B or C] 3: [A or B or C]

Temperature 0.1. Three judges. Three identical anonymized blocks.

The Results

Judge	#1	#2	#3
OpenAI gpt-4o	A (OpenAI)	C (Google)	B (Anthropic)
Anthropic Claude Sonnet 4.5	B (Anthropic)	C (Google)	A (OpenAI)
Google Gemini 2.5 Flash Lite	C (Google)	B (Anthropic)	A (OpenAI)

Perfect diagonal. Each of the three models ranked its own provider's response #1. This is a 1-in-27 outcome under a null hypothesis of no bias. With three trials it's far from conclusive proof — you'd want a larger run to rule out noise — but the pattern is exactly what self-preference bias looks like when it exists.

Why This Happens

Self-preference in LLM judges is a known but under-discussed phenomenon. It has at least three plausible drivers:

Stylistic fingerprints. Models have latent writing signatures: sentence rhythm, hedging patterns, transition word preferences, formatting habits. Even when we stripped identifying markers, Response B (Claude's) still reads like Claude — the crisp "First... Second..." structure, the "the opposite is true" rhetorical move. A judge trained on a corpus that included its own predecessors' outputs will recognize and reward that style, not consciously but through weight values. "This reads well to me" is not an unbiased signal when "me" was trained on things that look like me.

Training objective alignment. RLHF fine-tuning optimizes a model's generation against a specific preference distribution. When that model then judges, it's re-applying the same preferences as an evaluator. Two different models have two different preference distributions. A judge will systematically over-reward generations that match its own.

Instruction interpretation consistency. "Plain language" and "clarity" are subjective. A judge that writes a specific kind of plain language will have a specific definition of clarity that happens to favor its own interpretation.

What This Breaks

Any evaluation pipeline that relies on a single-provider LLM judge is producing biased scores. Concretely:

Your Claude-judged evals are over-scoring your Claude-generated outputs.
Your GPT-judged evals are over-scoring your GPT-generated outputs.
If you're A/B testing prompts across providers using a same-provider judge, you're tilting the outcome toward that judge's provider.

The effect size we've seen here (rank 1 on 3-of-3 trials) is not a rounding error. It's the difference between shipping the wrong model and shipping the right one.

How to Fix It

1. Use a cross-provider judge panel. Pick a judge from a provider that isn't generating the output. If you're comparing OpenAI and Anthropic outputs, judge with Google. Single-judge doesn't have to mean same-provider.

2. Use a consensus-of-judges pattern. Have 3+ judges score each output and take the median or majority. Self-preference gets washed out when no single judge has veto power. We're writing a 3-judge consensus pattern post on May 11 that walks through this in detail.

3. Normalize per-judge. Treat each judge's scores as a relative signal, not an absolute one. Compute Spearman rank correlation across judges before trusting the aggregate.

4. Never judge generations with the same-provider model. This is the cheapest fix. If your candidate is Claude, judge with GPT-4o or Gemini. If your candidate is GPT-4o, judge with Claude or Gemini. The right answer is always "not the same provider."

The Caveats (Be Honest About What This Is)

Three trials is not a significance test. We also didn't randomize the response order across judges — so some of the effect could be position bias (if every judge has a slight preference for position A, B, or C, that would partially mask or amplify self-preference depending on which letter mapped to each provider). A proper study would shuffle response order per judge and run 20+ trials.

What it is, though, is a clean replication target. If you run this same experiment on your own provider set with your own prompt and see the same diagonal, that's another data point. The effect is well-documented in the literature — this is just a fresh, reproducible demonstration.

How to Replicate This

# Pseudocode — use the Promptster API or MCP server
from promptster import compare, test

# Step 1: generate 3 responses
responses = compare(
    prompt="Write a 120-word paragraph...",
    configurations=[
        {"provider": "openai", "model": "gpt-4o"},
        {"provider": "anthropic", "model": "claude-sonnet-4-5"},
        {"provider": "google", "model": "gemini-2.5-flash-lite"},
    ],
)

# Step 2: build the judge prompt with a fixed order
judge_prompt = f"""
Below are three anonymous responses...
Response A: {responses[0].text}
Response B: {responses[1].text}
Response C: {responses[2].text}

Rank them from best to worst...
"""

# Step 3: ask each provider to judge
for judge in ["openai", "anthropic", "google"]:
    ranking = test(provider=judge, model="...", prompt=judge_prompt)
    print(f"{judge}: {ranking}")

Run it. Post the results. Let us know if you see the same diagonal.

The Habit Change

The single biggest eval upgrade most teams can make today is to stop using the same model to generate and judge. It's not a dramatic refactor. It's a one-line change in the judge config. And it removes the dominant bias in your metrics.

For more on robust evaluation methodology, see our post on how to use AI consensus analysis to improve output quality and the upcoming 3-judge consensus pattern on May 11.

Tests run 2026-04-19 via the Promptster MCP server. Fixed response order (A=OpenAI, B=Anthropic, C=Google) across all three judges. Temperature 0.1. Three trials.