Evals Are the New Unit Tests: A Practical Regression Guide for AI Code
By Promptster Team · 2026-05-14
A unit test fails when your code is wrong. An eval fails when your prompt is wrong — or when the model it runs against silently changed. Both failures should block a deploy. Most teams in 2026 treat the first as non-negotiable and the second as a once-a-quarter spreadsheet exercise.
That asymmetry is about to snap. Here's how to get ahead of it.
What an Eval Actually Is
An eval is a regression test for a prompt. It takes:
- A prompt template (parameterized).
- A reference dataset — inputs and the expected output shape/quality for each.
- A scoring function — checks whether the model's output matches the expected result.
A passing eval means: "for these 50 sample inputs, the prompt produces outputs that meet the quality bar." A failing eval means: "something changed — the prompt, the model, or the provider — and quality dropped."
That's the entirety of it. No ML PhD required.
The Minimal Eval Stack
A working eval pipeline needs four components. You don't need a vendor for this — a 100-line Python script does the job.
1. A prompt registry (versioned). Your prompts live in source control, not in a database dashboard where they can silently mutate.
2. A reference dataset (curated). 20–50 representative inputs for each prompt, ideally covering happy paths, edge cases, and a few adversarial inputs. Start small; grow as you see real bug reports.
3. A scoring function. Three flavors:
- Exact match (output equals expected value) — for classification, extraction, code gen.
- Schema match (output parses and satisfies a JSON schema) — for structured outputs.
- LLM-as-a-judge (another model rates the output 1-5) — for subjective quality.
4. A CI hook that runs the above on every PR touching a prompt.
The Implementation
# evals.py — minimal working eval
import yaml, json
from promptster import test
def run_eval(prompt_template: str, dataset_path: str) -> dict:
dataset = yaml.safe_load(open(dataset_path))
results = {"passed": 0, "failed": 0, "failures": []}
for case in dataset:
output = test(
provider="openai",
model="gpt-4o-mini",
prompt=prompt_template.format(**case["inputs"]),
temperature=0.1,
)["response"]
if score(output, case["expected"]) >= case.get("threshold", 0.8):
results["passed"] += 1
else:
results["failed"] += 1
results["failures"].append({
"case": case["id"],
"expected": case["expected"],
"got": output,
})
return results
def score(output: str, expected: dict) -> float:
if expected["type"] == "exact":
return 1.0 if output.strip() == expected["value"] else 0.0
if expected["type"] == "schema":
try:
parsed = json.loads(output)
return 1.0 if validate_schema(parsed, expected["schema"]) else 0.0
except json.JSONDecodeError:
return 0.0
if expected["type"] == "judge":
return llm_judge_score(output, expected["criteria"])
raise ValueError(f"Unknown expected type: {expected['type']}")
Dataset Format
# prompts/extract_invoice.eval.yaml
- id: invoice_001
inputs:
text: "Invoice #482 — ACME Corp — $1,247.50 due 2026-04-15..."
expected:
type: schema
schema:
required: [invoice_number, vendor, amount, due_date]
types: {invoice_number: string, amount: number}
- id: invoice_002_edge_case_missing_date
inputs:
text: "Invoice 881 from Widget Co for $2,300"
expected:
type: schema
schema:
required: [invoice_number, vendor, amount]
optional: [due_date]
Keep the dataset in version control alongside the prompt. Any prompt change requires updating the reference set, which creates a natural review checkpoint.
The CI Hook
# .github/workflows/prompt-regression.yml
name: Prompt Regression
on:
pull_request:
paths:
- "prompts/**"
- "evals.py"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install promptster
- env:
PROMPTSTER_API_KEY: ${{ secrets.PROMPTSTER_API_KEY }}
run: python evals.py --threshold 0.9
Fail the build if more than 10% of eval cases fail, or if a critical-tagged case fails. The threshold is a judgment call — start lenient, tighten as your reference set stabilizes.
Cross-Provider Evals
A single-provider eval tells you whether your prompt works on one model. A cross-provider eval tells you whether your prompt works across your routing table. Both matter. Both should run on PR.
for provider, model in [
("openai", "gpt-4o-mini"),
("anthropic", "claude-haiku-4-5"),
("google", "gemini-2.5-flash-lite"),
]:
results[f"{provider}/{model}"] = run_eval(prompt_template, dataset_path)
When provider divergence appears (passing on OpenAI, failing on Claude), the prompt has a portability bug. Fix it with explicit formatting instructions, or accept the divergence and pin the prompt to one provider. See our why your prompts fail on different LLM providers post for common causes.
Scheduled Evals (Catching Silent Drift)
PR-time evals catch regressions from code changes. They don't catch regressions from silent model updates — when OpenAI ships a point-release and quality on your prompt drifts overnight. For that, schedule your eval suite to run weekly against production prompts, outside of any PR.
Alert when any provider's eval score drops > 5% week-over-week. You'll see the drift signal before your users do. Our upcoming post on scheduled drift detection walks through this pattern.
The Four Objections You'll Hear
"We don't have time to curate reference data." You have time for 20 test cases. That's two hours. The ROI is catching one production bug before it ships.
"LLM-as-a-judge isn't objective." Correct, and it's still better than no judge. Use a 3-judge panel for bias reduction. Calibrate against human labels for a sample.
"Our prompts change too fast." That's the argument for evals, not against. Rapidly-changing code without tests is how production breaks.
"Evals are flaky because LLMs are stochastic." Use temperature 0 for eval runs. Use exact-match or schema-match where possible. Budget for some noise in judge-graded cases by averaging 3 runs.
The Takeaway
Treating prompts as untested configuration is how "it works on my machine" becomes "it works on my model version." Treat prompts as code. Version them, test them on PR, monitor them in production. The tooling is boring and well-understood; the only thing in short supply is the institutional habit.
For the continuous integration side, see automating prompt testing for production. For the enterprise version, see enterprise prompt management: tagging and version control strategies.