How Promptster Scores Models: The 4 Judge Dimensions, Explained
By Promptster Team · 2026-06-15
When Promptster shows you a quality score next to each model's response, that number didn't come from a benchmark or a vibe. It came from another LLM reading the responses and grading them against a fixed rubric. That's powerful — and it's also a place where people deserve to know exactly what's happening.
So here's the full transparency version: how Promptster's scoring actually works, what the four dimensions mean, how the overall number is computed, and which biases we mitigate versus the ones no automated judge fully escapes.
If you've read our LLM-as-a-judge bias audit, you already know the punchline of that genre: judges aren't neutral. This post is about what we do with that fact.
The Four Dimensions
Every scored response gets graded on four dimensions, each on a 1-5 scale where 5 is best. These are the exact dimensions and definitions baked into the judge prompt:
| Dimension | The question it answers |
|---|---|
| Relevance | How well does the response address the original prompt? |
| Accuracy | How factually correct is the response? |
| Completeness | How thoroughly does the response cover the topic? |
| Clarity | How well-written and easy to understand is the response? |
These four are deliberately orthogonal-ish. A response can be perfectly relevant and clear while being factually wrong (high relevance + clarity, low accuracy). It can be accurate but truncated (high accuracy, low completeness). Splitting the score into four axes is what lets you see why a model lost, not just that it lost.
How The Overall Score Is Computed
The headline number is a plain mean:
overall = (relevance + accuracy + completeness + clarity) / 4
rounded to one decimal. No hidden weighting, no secret sauce. A response scoring 5/4/5/4 gets an overall of 4.5.
We chose an unweighted mean on purpose. Weighting (e.g., "accuracy counts double") sounds smart but bakes our priorities into your eval. You know your task. If accuracy matters more than clarity for your use case, you can read the per-dimension breakdown and weight it yourself — the four numbers are always exposed, never collapsed away.
What The Judge Actually Sees
The judge model receives the original prompt plus every response, formatted like this (one block per model):
### Response 1 (anthropic / claude-opus-4-6)
<full response text>
---
### Response 2 (openai / gpt-5.2)
<full response text>
It returns structured JSON — one entry per response — with the four dimension scores, the computed overall, and a 1-2 sentence justification:
{
"scores": [
{
"provider": "anthropic",
"model": "claude-opus-4-6",
"relevance": 5, "accuracy": 4, "completeness": 5, "clarity": 4,
"overall": 4.5,
"justification": "Directly addresses the prompt and is well-structured; one factual claim is unsupported."
}
]
}
The justification field is not decoration. It's the audit trail. If a score surprises you, the justification tells you which dimension drove it — and whether the judge's reasoning is one you trust.
An Illustrative Example (Hypothetical)
To make the rubric concrete, here's a made-up scoring of three responses to "Explain idempotency in REST APIs in under 100 words." These numbers are illustrative, not from a real run:
| Model | Relevance | Accuracy | Completeness | Clarity | Overall |
|---|---|---|---|---|---|
| Model A | 5 | 5 | 4 | 5 | 4.8 |
| Model B | 5 | 3 | 5 | 4 | 4.3 |
| Model C | 4 | 5 | 3 | 5 | 4.3 |
Notice B and C tie on overall (4.3) but for opposite reasons — B is thorough but slipped on a fact; C is accurate but thin. The mean hides that; the four dimensions reveal it. Always read the breakdown, not just the headline. For a real, measured leaderboard, run your own comparison — illustrative tables like this one prove nothing about actual model quality.
The Bias Problem (And What We Do About It)
A single LLM judge is biased. We've measured it. The dominant failure mode is self-preference — a judge tends to rank its own provider's output highest. Our bias audit found a clean 3-for-3 self-preference result across major providers.
Here are the mitigations Promptster applies, and their honest limits:
| Bias | Mitigation | Limit |
|---|---|---|
| Self-preference | Default scoring uses a judge from a different family than the responses; multi-judge consensus available | Not eliminated, only diluted |
| Verbosity bias | Rubric scores Completeness and Clarity separately, so "longer" doesn't auto-win | Judges still drift toward longer answers |
| Position bias | Provider/model labels are explicit, reducing reliance on order | We don't randomize order on every run by default |
| Authority bias | Justification field forces the judge to cite a specific reason | Confident-but-wrong text can still fool it |
The single biggest upgrade is using more than one judge from different provider families and averaging their rankings. The full math is in our 3-judge consensus pattern post — it's the cheapest, highest-impact debiasing move available.
When To Trust The Score (And When Not To)
LLM-as-judge scoring is a comparative signal, not ground truth. Use it for:
- Ranking responses against each other in the same run (relative quality is the strong signal).
- Regression detection — same prompt, same judge, over time. A drop is real even if the absolute number is fuzzy.
- Triaging — flagging which responses deserve a human read.
Don't use it for:
- Absolute quality claims ("this model scores 4.8, therefore it's production-ready"). A judge can't certify correctness it can't itself verify.
- High-stakes accuracy where a wrong fact has real cost — pair the score with a human reviewer or a ground-truth check.
This is the same discipline we argue for in evals are the new unit tests: the judge is your CI signal, not your proof of correctness.
The Real Lesson
Promptster's score is a four-dimension, 1-5, unweighted-mean signal produced by an LLM judge — and we'd rather you understand its limits than treat it as an oracle. Read the four dimensions, not just the overall. Use a cross-family judge or a consensus panel for anything that matters. And remember the score ranks, it doesn't certify. A transparent imperfect metric beats an opaque "trust us" one every time.
For the bias data behind these mitigations, see the LLM-as-a-judge bias audit. For the consensus upgrade, see the 3-judge consensus pattern.
Rubric and overall-score formula reflect Promptster's production judge prompt as of 2026-06-15. The example scoring table above is hypothetical and illustrative only — not from a measured run.