Evals Are a Production Gate Now: Failing a PR When an LLM Judge Says Quality Regressed
By Promptster Team · 2026-06-02
For years, "we have evals" meant a notebook someone ran once before launch and never opened again. That era is over. The 2026 tooling surveys put offline evals at ~52% adoption and LLM-as-judge at ~53% — and crucially, the teams adopting them aren't running them in research. They're running them in CI, as a gate. A PR that drops a prompt's quality score now fails the same way a PR that breaks a unit test fails.
This is the right model. We argued the analogy in evals are the new unit tests; this post is the wiring. By the end you'll have a CI job that runs an LLM judge over a fixed dataset on every prompt change and blocks the merge if the score regresses past a threshold.
Why an LLM Judge, and Why It's Trustworthy Enough
You can't assert output == expected on generative text — there are a thousand correct phrasings. So you score against a rubric with another model acting as judge. That sounds circular until you remember the asymmetry: judging "is this answer relevant and accurate?" is far easier than producing the answer. A judge model doesn't need to know the answer cold; it needs to recognize quality against a rubric, which is a much more reliable task.
Promptster's score_responses runs LLM-as-judge across four dimensions — relevance, accuracy, completeness, clarity — on a 0–1 scale. Auto-scoring is available on Builder+. The dimensions matter: a regression often shows up in one of them (completeness drops while relevance holds), and an aggregate score would mask it.
The one real risk is judge bias — judges tend to favor verbosity and their own family's style. Mitigate it the way we covered in the three-judge consensus pattern: use a judge from a different provider than the model under test, or average multiple judges. Never let GPT judge GPT in a high-stakes gate.
The Pipeline
PR opened ──► fixed eval dataset ──► run prompt across providers
│
▼
score_responses (LLM judge)
│
┌───────────────┴───────────────┐
▼ ▼
score >= baseline? score < baseline?
│ │
▼ ▼
✅ merge OK ❌ fail the check
+ comment the diff
The non-negotiable rule: the dataset is fixed and version-controlled. If the inputs move every run, the gate is theater. Treat the eval set like a golden file.
Step 1 — The Eval Dataset
Commit evals/dataset.json. Keep it small (20–40 cases) and representative of production traffic, not edge cases you invented:
{
"cases": [
{
"id": "summarize-ticket",
"prompt": "Summarize this support ticket in 2 sentences:\n{{ticket}}",
"vars": { "ticket": "Customer can't reset password; reset email never arrives..." },
"dimensions": ["relevance", "accuracy", "completeness"]
},
{
"id": "extract-json",
"prompt": "Return the order as JSON with keys id, total, status:\n{{order}}",
"vars": { "order": "Order #4471, $129.00, shipped" },
"dimensions": ["accuracy", "completeness", "clarity"]
}
],
"thresholds": { "min_mean_score": 0.82, "max_regression": 0.05 }
}
min_mean_score is the absolute floor. max_regression is the relative guard — fail if the new score is more than 5 points below the last merged baseline, even if it clears the floor. The relative guard is what actually catches slow erosion.
Step 2 — The Runner
import json, os, sys, requests
BASE = "https://www.promptster.dev/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['PROMPTSTER_API_KEY']}"}
JUDGE = {"provider": "anthropic", "model": "claude-opus-4-6"} # cross-provider judge
def run_case(case):
# generate the candidate answer
gen = requests.post(f"{BASE}/prompts/test", headers=HEADERS, json={
"prompt": case["prompt"].format(**case.get("vars", {})),
"provider": "openai", "model": "gpt-5.2", # model under test
"temperature": 0.1,
}).json()
# judge it with a DIFFERENT provider to avoid same-family bias
judged = requests.post(f"{BASE}/score_responses", headers=HEADERS, json={
"prompt": case["prompt"].format(**case.get("vars", {})),
"response": gen["output"],
"dimensions": case["dimensions"],
"judge": JUDGE,
}).json()
return judged["scores"] # {relevance: 0.9, accuracy: 0.8, ...}
def main():
spec = json.load(open("evals/dataset.json"))
all_scores = []
for case in spec["cases"]:
scores = run_case(case)
mean = sum(scores.values()) / len(scores)
all_scores.append(mean)
print(f"{case['id']}: {mean:.3f} {scores}")
run_mean = sum(all_scores) / len(all_scores)
floor = spec["thresholds"]["min_mean_score"]
baseline = float(os.environ.get("EVAL_BASELINE", floor))
max_reg = spec["thresholds"]["max_regression"]
print(f"\nrun mean={run_mean:.3f} floor={floor} baseline={baseline}")
if run_mean < floor:
sys.exit(f"FAIL: below floor ({run_mean:.3f} < {floor})")
if run_mean < baseline - max_reg:
sys.exit(f"FAIL: regression of {baseline - run_mean:.3f} > {max_reg}")
print("PASS")
if __name__ == "__main__":
main()
Note the runner is provider-agnostic — swap the model under test to gate a model migration (e.g., GPT-5 → GPT-5.2) instead of a prompt change. Same gate, different variable.
Step 3 — The CI Gate
name: Eval Gate
on:
pull_request:
paths: ["prompts/**", "evals/**"]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install requests
- name: Run eval gate
env:
PROMPTSTER_API_KEY: ${{ secrets.PROMPTSTER_API_KEY }}
EVAL_BASELINE: ${{ vars.EVAL_BASELINE }} # last merged mean score
run: python evals/run.py
A non-zero exit fails the check and blocks the merge. Store the merged-main score as a repo variable (EVAL_BASELINE) so each PR is judged against the last known-good state, not a static floor that drifts out of date.
Picking Thresholds Without Crying Wolf
| Workload | min_mean_score | max_regression |
|---|---|---|
| Customer-facing generation | 0.85 | 0.04 |
| Internal tooling | 0.75 | 0.06 |
| Creative / open-ended | 0.68 | 0.08 |
Start lenient and tighten. A gate that false-positives on every PR gets disabled within a week, and a disabled gate protects nothing. Calibrate against ten merged PRs before you trust the numbers.
The Real Lesson
Evals crossed 50% adoption in 2026 not because teams discovered evals — they discovered that an eval you don't enforce is a comment, not a control. The shift is from "we measured quality once" to "quality cannot regress without a human overriding a red check." Wire the judge into CI, use a cross-provider judge to dodge bias, and gate on relative regression, not just an absolute floor.
For the always-on companion to this merge gate, see automating prompt testing for production; for the judging methodology, the three-judge consensus pattern.