Evals Are a Production Gate Now: Failing a PR When an LLM Judge Says Quality Regressed

By Promptster Team · 2026-06-02

For years, "we have evals" meant a notebook someone ran once before launch and never opened again. That era is over. The 2026 tooling surveys put offline evals at ~52% adoption and LLM-as-judge at ~53% — and crucially, the teams adopting them aren't running them in research. They're running them in CI, as a gate. A PR that drops a prompt's quality score now fails the same way a PR that breaks a unit test fails.

This is the right model. We argued the analogy in evals are the new unit tests; this post is the wiring. By the end you'll have a CI job that runs an LLM judge over a fixed dataset on every prompt change and blocks the merge if the score regresses past a threshold.

Why an LLM Judge, and Why It's Trustworthy Enough

You can't assert output == expected on generative text — there are a thousand correct phrasings. So you score against a rubric with another model acting as judge. That sounds circular until you remember the asymmetry: judging "is this answer relevant and accurate?" is far easier than producing the answer. A judge model doesn't need to know the answer cold; it needs to recognize quality against a rubric, which is a much more reliable task.

Promptster's score_responses runs LLM-as-judge across four dimensions — relevance, accuracy, completeness, clarity — on a 0–1 scale. Auto-scoring is available on Builder+. The dimensions matter: a regression often shows up in one of them (completeness drops while relevance holds), and an aggregate score would mask it.

The one real risk is judge bias — judges tend to favor verbosity and their own family's style. Mitigate it the way we covered in the three-judge consensus pattern: use a judge from a different provider than the model under test, or average multiple judges. Never let GPT judge GPT in a high-stakes gate.

The Pipeline

PR opened ──► fixed eval dataset ──► run prompt across providers
                                          │
                                          ▼
                              score_responses (LLM judge)
                                          │
                          ┌───────────────┴───────────────┐
                          ▼                                ▼
                  score >= baseline?                 score < baseline?
                          │                                │
                          ▼                                ▼
                    ✅ merge OK                    ❌ fail the check
                                                   + comment the diff

The non-negotiable rule: the dataset is fixed and version-controlled. If the inputs move every run, the gate is theater. Treat the eval set like a golden file.

Step 1 — The Eval Dataset

Commit evals/dataset.json. Keep it small (20–40 cases) and representative of production traffic, not edge cases you invented:

{
  "cases": [
    {
      "id": "summarize-ticket",
      "prompt": "Summarize this support ticket in 2 sentences:\n{{ticket}}",
      "vars": { "ticket": "Customer can't reset password; reset email never arrives..." },
      "dimensions": ["relevance", "accuracy", "completeness"]
    },
    {
      "id": "extract-json",
      "prompt": "Return the order as JSON with keys id, total, status:\n{{order}}",
      "vars": { "order": "Order #4471, $129.00, shipped" },
      "dimensions": ["accuracy", "completeness", "clarity"]
    }
  ],
  "thresholds": { "min_mean_score": 0.82, "max_regression": 0.05 }
}

min_mean_score is the absolute floor. max_regression is the relative guard — fail if the new score is more than 5 points below the last merged baseline, even if it clears the floor. The relative guard is what actually catches slow erosion.

Step 2 — The Runner

import json, os, sys, requests

BASE = "https://www.promptster.dev/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['PROMPTSTER_API_KEY']}"}
JUDGE = {"provider": "anthropic", "model": "claude-opus-4-6"}  # cross-provider judge

def run_case(case):
    # generate the candidate answer
    gen = requests.post(f"{BASE}/prompts/test", headers=HEADERS, json={
        "prompt": case["prompt"].format(**case.get("vars", {})),
        "provider": "openai", "model": "gpt-5.2",  # model under test
        "temperature": 0.1,
    }).json()

    # judge it with a DIFFERENT provider to avoid same-family bias
    judged = requests.post(f"{BASE}/score_responses", headers=HEADERS, json={
        "prompt": case["prompt"].format(**case.get("vars", {})),
        "response": gen["output"],
        "dimensions": case["dimensions"],
        "judge": JUDGE,
    }).json()
    return judged["scores"]  # {relevance: 0.9, accuracy: 0.8, ...}

def main():
    spec = json.load(open("evals/dataset.json"))
    all_scores = []
    for case in spec["cases"]:
        scores = run_case(case)
        mean = sum(scores.values()) / len(scores)
        all_scores.append(mean)
        print(f"{case['id']}: {mean:.3f}  {scores}")

    run_mean = sum(all_scores) / len(all_scores)
    floor = spec["thresholds"]["min_mean_score"]
    baseline = float(os.environ.get("EVAL_BASELINE", floor))
    max_reg = spec["thresholds"]["max_regression"]

    print(f"\nrun mean={run_mean:.3f}  floor={floor}  baseline={baseline}")
    if run_mean < floor:
        sys.exit(f"FAIL: below floor ({run_mean:.3f} < {floor})")
    if run_mean < baseline - max_reg:
        sys.exit(f"FAIL: regression of {baseline - run_mean:.3f} > {max_reg}")
    print("PASS")

if __name__ == "__main__":
    main()

Note the runner is provider-agnostic — swap the model under test to gate a model migration (e.g., GPT-5 → GPT-5.2) instead of a prompt change. Same gate, different variable.

Step 3 — The CI Gate

name: Eval Gate
on:
  pull_request:
    paths: ["prompts/**", "evals/**"]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install requests
      - name: Run eval gate
        env:
          PROMPTSTER_API_KEY: ${{ secrets.PROMPTSTER_API_KEY }}
          EVAL_BASELINE: ${{ vars.EVAL_BASELINE }}   # last merged mean score
        run: python evals/run.py

A non-zero exit fails the check and blocks the merge. Store the merged-main score as a repo variable (EVAL_BASELINE) so each PR is judged against the last known-good state, not a static floor that drifts out of date.

Picking Thresholds Without Crying Wolf

Workload	min_mean_score	max_regression
Customer-facing generation	0.85	0.04
Internal tooling	0.75	0.06
Creative / open-ended	0.68	0.08

Start lenient and tighten. A gate that false-positives on every PR gets disabled within a week, and a disabled gate protects nothing. Calibrate against ten merged PRs before you trust the numbers.

The Real Lesson

Evals crossed 50% adoption in 2026 not because teams discovered evals — they discovered that an eval you don't enforce is a comment, not a control. The shift is from "we measured quality once" to "quality cannot regress without a human overriding a red check." Wire the judge into CI, use a cross-provider judge to dodge bias, and gate on relative regression, not just an absolute floor.

For the always-on companion to this merge gate, see automating prompt testing for production; for the judging methodology, the three-judge consensus pattern.