How Promptster Scores Models: The 4 Judge Dimensions, Explained

By Promptster Team · 2026-06-15

When Promptster shows you a quality score next to each model's response, that number didn't come from a benchmark or a vibe. It came from another LLM reading the responses and grading them against a fixed rubric. That's powerful — and it's also a place where people deserve to know exactly what's happening.

So here's the full transparency version: how Promptster's scoring actually works, what the four dimensions mean, how the overall number is computed, and which biases we mitigate versus the ones no automated judge fully escapes.

If you've read our LLM-as-a-judge bias audit, you already know the punchline of that genre: judges aren't neutral. This post is about what we do with that fact.

The Four Dimensions

Every scored response gets graded on four dimensions, each on a 1-5 scale where 5 is best. These are the exact dimensions and definitions baked into the judge prompt:

Dimension The question it answers
Relevance How well does the response address the original prompt?
Accuracy How factually correct is the response?
Completeness How thoroughly does the response cover the topic?
Clarity How well-written and easy to understand is the response?

These four are deliberately orthogonal-ish. A response can be perfectly relevant and clear while being factually wrong (high relevance + clarity, low accuracy). It can be accurate but truncated (high accuracy, low completeness). Splitting the score into four axes is what lets you see why a model lost, not just that it lost.

How The Overall Score Is Computed

The headline number is a plain mean:

overall = (relevance + accuracy + completeness + clarity) / 4

rounded to one decimal. No hidden weighting, no secret sauce. A response scoring 5/4/5/4 gets an overall of 4.5.

We chose an unweighted mean on purpose. Weighting (e.g., "accuracy counts double") sounds smart but bakes our priorities into your eval. You know your task. If accuracy matters more than clarity for your use case, you can read the per-dimension breakdown and weight it yourself — the four numbers are always exposed, never collapsed away.

What The Judge Actually Sees

The judge model receives the original prompt plus every response, formatted like this (one block per model):

### Response 1 (anthropic / claude-opus-4-6)
<full response text>

---

### Response 2 (openai / gpt-5.2)
<full response text>

It returns structured JSON — one entry per response — with the four dimension scores, the computed overall, and a 1-2 sentence justification:

{
  "scores": [
    {
      "provider": "anthropic",
      "model": "claude-opus-4-6",
      "relevance": 5, "accuracy": 4, "completeness": 5, "clarity": 4,
      "overall": 4.5,
      "justification": "Directly addresses the prompt and is well-structured; one factual claim is unsupported."
    }
  ]
}

The justification field is not decoration. It's the audit trail. If a score surprises you, the justification tells you which dimension drove it — and whether the judge's reasoning is one you trust.

An Illustrative Example (Hypothetical)

To make the rubric concrete, here's a made-up scoring of three responses to "Explain idempotency in REST APIs in under 100 words." These numbers are illustrative, not from a real run:

Model Relevance Accuracy Completeness Clarity Overall
Model A 5 5 4 5 4.8
Model B 5 3 5 4 4.3
Model C 4 5 3 5 4.3

Notice B and C tie on overall (4.3) but for opposite reasons — B is thorough but slipped on a fact; C is accurate but thin. The mean hides that; the four dimensions reveal it. Always read the breakdown, not just the headline. For a real, measured leaderboard, run your own comparison — illustrative tables like this one prove nothing about actual model quality.

The Bias Problem (And What We Do About It)

A single LLM judge is biased. We've measured it. The dominant failure mode is self-preference — a judge tends to rank its own provider's output highest. Our bias audit found a clean 3-for-3 self-preference result across major providers.

Here are the mitigations Promptster applies, and their honest limits:

Bias Mitigation Limit
Self-preference Default scoring uses a judge from a different family than the responses; multi-judge consensus available Not eliminated, only diluted
Verbosity bias Rubric scores Completeness and Clarity separately, so "longer" doesn't auto-win Judges still drift toward longer answers
Position bias Provider/model labels are explicit, reducing reliance on order We don't randomize order on every run by default
Authority bias Justification field forces the judge to cite a specific reason Confident-but-wrong text can still fool it

The single biggest upgrade is using more than one judge from different provider families and averaging their rankings. The full math is in our 3-judge consensus pattern post — it's the cheapest, highest-impact debiasing move available.

When To Trust The Score (And When Not To)

LLM-as-judge scoring is a comparative signal, not ground truth. Use it for:

Don't use it for:

This is the same discipline we argue for in evals are the new unit tests: the judge is your CI signal, not your proof of correctness.

The Real Lesson

Promptster's score is a four-dimension, 1-5, unweighted-mean signal produced by an LLM judge — and we'd rather you understand its limits than treat it as an oracle. Read the four dimensions, not just the overall. Use a cross-family judge or a consensus panel for anything that matters. And remember the score ranks, it doesn't certify. A transparent imperfect metric beats an opaque "trust us" one every time.

For the bias data behind these mitigations, see the LLM-as-a-judge bias audit. For the consensus upgrade, see the 3-judge consensus pattern.


Rubric and overall-score formula reflect Promptster's production judge prompt as of 2026-06-15. The example scoring table above is hypothetical and illustrative only — not from a measured run.