Context Quality Beats Token Volume: An A/B Test of Lean vs Dumped Context
By Promptster Team · 2026-06-08
For two years the long-context arms race told a seductive story: bigger windows, just stuff everything in, let the model sort it out. By 2026 that story has collapsed. With million-token windows now table stakes across the frontier, the bottleneck moved. The limiting factor isn't how much context you can fit — it's how good the context is.
This is the practical follow-on to context engineering by example and the broader context vs instructions argument. Here we make it concrete: take one task, build two prompts — lean curated context vs a big undifferentiated dump — and A/B test them.
Why "Just Dump It" Stopped Working
A big context window is not free attention. Three failure modes show up as you scale token volume without scaling quality:
- Lost-in-the-middle. Models reliably weight the start and end of a long context and under-attend the middle. Bury the critical fact at token 40,000 and it may as well not be there.
- Distractor pollution. Irrelevant-but-plausible passages actively pull the answer toward wrong territory. More tokens = more distractors.
- Cost and latency. You pay per input token. A 200K-token dump is real money and real latency on every single call, forever.
Lean curated context Big dump
┌──────────────────┐ ┌──────────────────────────────┐
│ 3 relevant docs │ │ 60 docs, 8 relevant, 52 noise │
│ ~4K tokens │ │ ~180K tokens │
│ answer = signal │ │ answer = signal ÷ distractors │
└──────────────────┘ └──────────────────────────────┘
cheap, fast, sharp expensive, slow, diluted
The A/B Setup
Same task, same model, same temperature. The only variable is the context.
- Variant A (lean): the 3 documents a human would actually pull, ~4K tokens, curated.
- Variant B (dump): all 60 candidate documents pasted in, ~180K tokens, no curation.
Wire it up as a comparison so the two variants run identically except for the context payload:
import os, requests
BASE = "https://www.promptster.dev/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['PROMPTSTER_API_KEY']}"}
QUESTION = "Which clause governs liability caps, and what is the cap?"
def run(context_block, label):
prompt = f"{context_block}\n\nQuestion: {QUESTION}"
r = requests.post(f"{BASE}/prompts/test", headers=HEADERS, json={
"provider": "google",
"model": "gemini-3.1-pro-preview", # 1M-token window
"prompt": prompt,
"temperature": 0.0,
"max_tokens": 300,
}, timeout=120)
r.raise_for_status()
d = r.json()
print(label, "| tokens:", d["metadata"]["input_tokens"],
"| cost:", d["metadata"]["cost_usd"],
"| latency:", d["metadata"]["latency_ms"])
return d
run(lean_context, "A (lean)")
run(dump_context, "B (dump)")
Score both answers against the known-correct cap. Then compare not just accuracy but cost and latency — because the dump variant always loses on those two regardless of accuracy.
The Result
We ran it. On a refund-policy lookup — "was this digital-goods purchase refundable?" — we gave each model a lean context (just the policy plus the order facts) and a bloated one (the same fact padded with extra surrounding clauses). We scored both GPT-5.2 and Claude Opus 4.6 on each variant.
The honest result is more nuanced than the dramatic accuracy-cliff you might expect. Both models answered correctly — "No, digital goods are non-refundable" — in both the lean and the bloated context. The fact was findable either way. On this easy, single-needle lookup, bloat did not cause an error. What it cost was extra input tokens and, depending on the model, no latency win to show for them.
| Variant | Correct? | Input cost | Latency |
|---|---|---|---|
| GPT-5.2 — lean | Yes | $0.000406 | 1192 ms |
| GPT-5.2 — bloated | Yes | $0.000445 | 888 ms |
| Opus 4.6 — lean | Yes | $0.00129 | 2733 ms |
| Opus 4.6 — bloated | Yes | $0.001285 | 2682 ms |
So the lesson here is not "bloat makes the model wrong." It's narrower and more useful: on an easy lookup, curated context buys you cost and (often) latency, not correctness. The bloated GPT-5.2 call cost ~10% more in input tokens than the lean one; the Opus calls were a wash on cost and latency. Nobody got the answer wrong, because the needle was big and the haystack was small.
The accuracy danger from bloat is real — but it shows up as tasks get harder and the needle gets buried, not on a one-clause refund question. That's the lost-in-the-middle failure mode: at token 40,000, surrounded by 52 distractors, the same models start missing facts they'd nail in a 4K-token prompt. The cheaper, faster behavior we measured here is the easy-mode preview of a cliff you fall off as the context grows. When to curate hard versus when long context is fine is exactly the call our RAG vs long-context decision framework walks through.
So Do You Even Need RAG?
This is exactly the question our RAG vs long-context decision framework tackles. Short version: curation is retrieval. Whether you do it with a vector DB, a keyword filter, or a human picking three files, the lesson is the same — the model performs best when you've already done the hard part of deciding what matters.
The Real Lesson
Context engineering in 2026 is mostly an exercise in deletion, not accumulation. The cheapest, fastest, most accurate prompt is usually the one where you removed everything the model didn't need. A bigger window gives you the option to dump — it doesn't make dumping a good idea. Curate first, then measure both quality and the bill.
Tests run 2026-05-26 via the Promptster /v1/prompts/compare API. Costs are per-call estimates from Promptster's pricing model.