Building an Eval Dataset From Production Traffic: Sampling, Labeling, and Avoiding Leakage
By Promptster Team · 2026-06-21
The fastest way to build an eval set that doesn't predict production is to write the prompts yourself. You'll subconsciously write prompts your model already handles well. The result: a green dashboard and angry users.
The real signal lives in your production traffic — the messy, ambiguous, edge-case-laden requests your users actually send. This post is the concrete process for turning that traffic into a golden eval dataset: how to sample it, how to label it, how to build the golden set, and the leakage traps that quietly invalidate the whole thing.
If you haven't internalized why this matters, start with evals are the new unit tests. This post is the how.
The Five-Step Process
Step 1 — Sample with intent, not at random
Pure random sampling over-represents your easy, high-volume requests and under-represents the edge cases that break things. Stratified sampling is the answer. Bucket your traffic, then sample within buckets.
| Stratum | Why include it | Suggested share |
|---|---|---|
| High-frequency happy path | Represents the bulk of real load | 40% |
| Long-tail / rare intents | Where quality silently degrades | 25% |
| Known failure modes | User complaints, thumbs-down, retries | 20% |
| Adversarial / injection attempts | Security regression coverage | 10% |
| New-feature traffic | Recently shipped, under-tested | 5% |
Pull the raw requests from your logging layer. If you proxy through Promptster, the history API already has prompt, provider, model, params, and output per request — that's your sampling pool.
Step 2 — Strip and minimize PII before it leaves the warehouse
Production traffic contains PII. Your eval set should not. Before a request enters the dataset:
import re
def redact(text: str) -> str:
text = re.sub(r"[\w.+-]+@[\w-]+\.[\w.-]+", "<EMAIL>", text)
text = re.sub(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "<PHONE>", text)
text = re.sub(r"\b(?:\d[ -]*?){13,16}\b", "<CARD>", text)
return text
This is a floor, not a ceiling — pair it with a named-entity pass for names and addresses. The point: your golden set should be safe to share with the team and store long-term.
Step 3 — Label the expected output (the hard part)
A golden set needs ground truth. For each sampled input, you need an expected answer or a rubric. Three honest options, in order of cost:
- Reference answer — a human writes the ideal output. Highest quality, slowest. Use for your most critical 50–100 cases.
- Rubric / assertion — instead of an exact answer, define checks: "must mention X," "must be valid JSON," "must not recommend a competitor." Cheaper, scales better, pairs naturally with LLM-as-judge scoring.
- Acceptance set — for open-ended tasks, label a set of acceptable outputs rather than one. Avoids penalizing valid variation.
Be honest about labeler agreement. If two humans disagree on the expected output, the case is either genuinely ambiguous (keep it, label it as such) or your task is underspecified (fix the task).
Step 4 — Freeze the golden set and version it
Once labeled, the golden set is an artifact. Treat it like code: commit it, version it, review changes to it in PRs. A golden set that drifts silently is worse than no golden set, because you'll attribute score changes to the model when they came from the data.
evals/
golden-v3.jsonl # frozen, reviewed, versioned
CHANGELOG.md # every case added/removed/relabeled, with reason
Step 5 — Run it as a gate, not a report
A golden set that runs nightly and emails a chart is observability. A golden set that blocks a deploy when scores regress is a quality gate. The second one actually prevents shipping regressions — see evals as a production gate in CI and the automated prompt testing pipeline for the CI wiring.
The Leakage Traps That Quietly Ruin Everything
This is the part most teams get wrong. Leakage means information from your eval set has contaminated the thing you're evaluating, so the score is inflated and meaningless.
| Leakage type | How it happens | Fix |
|---|---|---|
| Few-shot leakage | An eval case is also used as a few-shot example in the prompt | Keep eval inputs strictly out of the prompt template |
| Tuning leakage | You iterate on the prompt until the eval set passes | Hold out a blind slice you never tune against |
| Temporal leakage | Eval includes traffic from after a feature shipped, tested against the old version | Tag each case with capture date; scope evals by version |
| Judge leakage | Same-provider model judges its own output | Use a cross-provider judge panel |
| Selection leakage | You only kept cases your model already passed | Stratified sampling (Step 1), not cherry-picking |
The tuning-leakage one is the most seductive. The moment you start editing prompts to make the eval set go green, that eval set stops measuring generalization. Always hold back a blind slice — 20% of your golden set that you never look at until a release candidate is locked.
A Realistic Cadence
You don't build this once. You build it continuously:
- Weekly — sample new production traffic into a staging pool (stratified, redacted).
- Bi-weekly — label and promote the best cases into the golden set; log every change in the CHANGELOG.
- Per-PR — run the full golden set as a gate; block on regression beyond threshold.
- Per-release — run the blind slice once, record the score, never tune against it.
The Real Lesson
Your production traffic already knows where your prompts fail — you just have to capture it before the signal evaporates. Sample with intent, redact ruthlessly, label honestly, freeze and version the golden set, and guard a blind slice against your own optimization pressure. A 200-case eval dataset built from real traffic will catch more regressions than 2,000 cases you invented at your desk. The synthetic set makes you feel safe; the production-derived set actually keeps you safe.
Process notes compiled from production eval pipelines as of June 2026. Adapt sampling ratios and labeling depth to your risk tolerance and team size.