Batch-Testing 1,000 Prompts with CSV Import: A Cost and Quality Audit

By Promptster Team · 2026-06-13

Testing one prompt in a side-by-side UI is a solved problem. Testing the thousand prompts that actually run in your product is the problem nobody wants to do by hand — so it doesn't get done, and your prompt library rots quietly until a model deprecation forces a panic.

Batch CSV import fixes the math. Upload a file, compare every row across the providers you care about, export the results, and audit cost and quality across the whole library at once. This is a tutorial: real CSV schema, real steps, honest about what's tedious. We covered the philosophy of automating prompt testing in production; this is the hands-on version at scale.

When You Actually Need This

A model you depend on gets deprecated and you need to re-test everything against its replacement.
You're evaluating a cheaper provider and want a portfolio-wide before/after, not a cherry-picked demo.
You inherited a prompt library and have no idea which prompts are good.
You're doing a quarterly cost/quality audit (you should be).

In every case, the unit of work is the library, not the prompt.

Step 1 — Build the CSV

Keep the schema boring. One row per prompt; columns for everything you want to vary or group by:

prompt_id,prompt,system_prompt,temperature,tags,reference_output
inv-001,"Extract the invoice total from: {{text}}","You are a precise data extractor.",0.2,"billing,extract","$1,240.00"
sup-014,"Summarize this support ticket in 2 sentences: {{text}}","Be concise and neutral.",0.3,"support,summarize",""
cod-103,"Fix the off-by-one bug in this loop: {{code}}","You are a senior engineer.",0.2,"code,debug",""

Schema rules that save you pain later:

Column	Required	Why
`prompt_id`	Yes	Stable join key for the export; never reuse one
`prompt`	Yes	The actual prompt; quote fields with commas
`system_prompt`	No	Per-row system prompt if it varies
`temperature`	No	Defaults applied per row
`tags`	No	Powers the audit slice-by-tag later
`reference_output`	No	Fill it and you unlock automated scoring

The single highest-leverage column is reference_output. With it, scoring is automatic and your audit is objective. Without it, you're eyeballing 1,000 outputs by hand — which is exactly the work you started this to avoid.

Step 2 — Import and Pick Your Providers

Import the CSV, then choose the providers/models to fan each row out across. The whole point is comparison, so pick at least one incumbent and one challenger:

Library: 1,000 rows
Fan-out: openai/gpt-5.2  +  deepseek/deepseek-reasoner  +  anthropic/claude-opus-4-6
Result matrix: 1,000 prompts × 3 models = 3,000 responses

That matrix is the thing you've never been able to build by hand. It's also where the cost shows up — so size your fan-out deliberately. Three models is plenty for an audit; you don't need eleven.

Step 3 — Score (the part that makes it an audit, not a dump)

A batch of 3,000 raw outputs is a data dump, not an audit. Scoring turns it into a decision. For rows with a reference_output, run LLM-as-judge scoring across the four dimensions automatically. For rows without one, the judge scores on relative quality within the row.

score_responses(matrix) → per-response scores on:
  accuracy · relevance · completeness · clarity

Now each cell has a number, and you can finally ask the only questions that matter:

Which model wins per tag group? (Code prompts and summarization prompts rarely have the same winner.)
Where does the cheap challenger match the incumbent — and where does it quietly fall apart?

Step 4 — Export and Audit

Export the scored matrix and pivot it. This is where the money shows up.

This table is yours to fill — by design. Unlike the other June benchmarks on this blog, the audit numbers here are intrinsically specific to your prompt library and your reference outputs. That's the whole point: the audit hands you a routing table for the prompts you actually ship, not ours. Run Steps 1–4 on your CSV and pivot the export into this shape:

Model	Avg score (0–10)	Total cost (1,000 prompts)	Cost / passing response	Best tag group
GPT-5.2	from your run	from your run	from your run	from your run
DeepSeek Reasoner	from your run	from your run	from your run	from your run
Claude Opus 4.6	from your run	from your run	from your run	from your run

The mechanics are real and already validated: all three models run through the exact /v1/prompts/compare + score_responses path we used for the live June benchmarks. And the shape of the answer is predictable from those runs — in the DeepSeek cost test a DeepSeek model matched Opus 4.6 on a debugging task at ~1/43 the cost, and the 2026 cost-per-quality refresh found the frontier premium buys little correctness on structured tasks. Your pivot will almost certainly say the same thing in your library's specific dialect: some tag groups are fine on the cheap model, a few need the frontier.

The audit almost never says "one model wins everything." It says "this tag group is fine on the cheap model, that one needs the frontier" — which is exactly the routing logic that produces real savings. We documented the size of that prize in how to save 60% on AI API costs and the underlying cost-per-quality 300x spread.

Step 5 — Make It Recurring

A one-time audit is stale in a quarter. Once the CSV exists, re-running it on the next model wave is one command. The export becomes your before; the next run becomes your after. That's a regression test for your entire prompt library.

The Real Lesson

The reason prompt libraries rot is that auditing them by hand doesn't scale, so nobody does it. CSV import removes the excuse: one file, a three-model fan-out, automated scoring against your reference outputs, and an export you can pivot. The audit won't crown a single winner — it'll hand you a routing table, sliced by tag, that tells you exactly which prompts can move to the cheap model and which can't. That table is worth more than any single benchmark.

Tutorial uses Promptster batch CSV import, compare, score_responses, and GET /v1/export as of 2026-06-13. The audit table is a template you populate from your own CSV export — the per-model API mechanics and the cost-vs-quality shape are validated in our June 2026 live benchmarks (linked above).