Building an Eval Dataset From Production Traffic: Sampling, Labeling, and Avoiding Leakage

By Promptster Team · 2026-06-21

The fastest way to build an eval set that doesn't predict production is to write the prompts yourself. You'll subconsciously write prompts your model already handles well. The result: a green dashboard and angry users.

The real signal lives in your production traffic — the messy, ambiguous, edge-case-laden requests your users actually send. This post is the concrete process for turning that traffic into a golden eval dataset: how to sample it, how to label it, how to build the golden set, and the leakage traps that quietly invalidate the whole thing.

If you haven't internalized why this matters, start with evals are the new unit tests. This post is the how.

The Five-Step Process

Step 1 — Sample with intent, not at random

Pure random sampling over-represents your easy, high-volume requests and under-represents the edge cases that break things. Stratified sampling is the answer. Bucket your traffic, then sample within buckets.

Stratum	Why include it	Suggested share
High-frequency happy path	Represents the bulk of real load	40%
Long-tail / rare intents	Where quality silently degrades	25%
Known failure modes	User complaints, thumbs-down, retries	20%
Adversarial / injection attempts	Security regression coverage	10%
New-feature traffic	Recently shipped, under-tested	5%

Pull the raw requests from your logging layer. If you proxy through Promptster, the history API already has prompt, provider, model, params, and output per request — that's your sampling pool.

Step 2 — Strip and minimize PII before it leaves the warehouse

Production traffic contains PII. Your eval set should not. Before a request enters the dataset:

import re

def redact(text: str) -> str:
    text = re.sub(r"[\w.+-]+@[\w-]+\.[\w.-]+", "<EMAIL>", text)
    text = re.sub(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "<PHONE>", text)
    text = re.sub(r"\b(?:\d[ -]*?){13,16}\b", "<CARD>", text)
    return text

This is a floor, not a ceiling — pair it with a named-entity pass for names and addresses. The point: your golden set should be safe to share with the team and store long-term.

Step 3 — Label the expected output (the hard part)

A golden set needs ground truth. For each sampled input, you need an expected answer or a rubric. Three honest options, in order of cost:

Reference answer — a human writes the ideal output. Highest quality, slowest. Use for your most critical 50–100 cases.
Rubric / assertion — instead of an exact answer, define checks: "must mention X," "must be valid JSON," "must not recommend a competitor." Cheaper, scales better, pairs naturally with LLM-as-judge scoring.
Acceptance set — for open-ended tasks, label a set of acceptable outputs rather than one. Avoids penalizing valid variation.

Be honest about labeler agreement. If two humans disagree on the expected output, the case is either genuinely ambiguous (keep it, label it as such) or your task is underspecified (fix the task).

Step 4 — Freeze the golden set and version it

Once labeled, the golden set is an artifact. Treat it like code: commit it, version it, review changes to it in PRs. A golden set that drifts silently is worse than no golden set, because you'll attribute score changes to the model when they came from the data.

evals/
  golden-v3.jsonl        # frozen, reviewed, versioned
  CHANGELOG.md           # every case added/removed/relabeled, with reason

Step 5 — Run it as a gate, not a report

A golden set that runs nightly and emails a chart is observability. A golden set that blocks a deploy when scores regress is a quality gate. The second one actually prevents shipping regressions — see evals as a production gate in CI and the automated prompt testing pipeline for the CI wiring.

The Leakage Traps That Quietly Ruin Everything

This is the part most teams get wrong. Leakage means information from your eval set has contaminated the thing you're evaluating, so the score is inflated and meaningless.

Leakage type	How it happens	Fix
Few-shot leakage	An eval case is also used as a few-shot example in the prompt	Keep eval inputs strictly out of the prompt template
Tuning leakage	You iterate on the prompt until the eval set passes	Hold out a blind slice you never tune against
Temporal leakage	Eval includes traffic from after a feature shipped, tested against the old version	Tag each case with capture date; scope evals by version
Judge leakage	Same-provider model judges its own output	Use a cross-provider judge panel
Selection leakage	You only kept cases your model already passed	Stratified sampling (Step 1), not cherry-picking

The tuning-leakage one is the most seductive. The moment you start editing prompts to make the eval set go green, that eval set stops measuring generalization. Always hold back a blind slice — 20% of your golden set that you never look at until a release candidate is locked.

A Realistic Cadence

You don't build this once. You build it continuously:

Weekly — sample new production traffic into a staging pool (stratified, redacted).
Bi-weekly — label and promote the best cases into the golden set; log every change in the CHANGELOG.
Per-PR — run the full golden set as a gate; block on regression beyond threshold.
Per-release — run the blind slice once, record the score, never tune against it.

The Real Lesson

Your production traffic already knows where your prompts fail — you just have to capture it before the signal evaporates. Sample with intent, redact ruthlessly, label honestly, freeze and version the golden set, and guard a blind slice against your own optimization pressure. A 200-case eval dataset built from real traffic will catch more regressions than 2,000 cases you invented at your desk. The synthetic set makes you feel safe; the production-derived set actually keeps you safe.

Process notes compiled from production eval pipelines as of June 2026. Adapt sampling ratios and labeling depth to your risk tolerance and team size.