Context Quality Beats Token Volume: An A/B Test of Lean vs Dumped Context

By Promptster Team · 2026-06-08

For two years the long-context arms race told a seductive story: bigger windows, just stuff everything in, let the model sort it out. By 2026 that story has collapsed. With million-token windows now table stakes across the frontier, the bottleneck moved. The limiting factor isn't how much context you can fit — it's how good the context is.

This is the practical follow-on to context engineering by example and the broader context vs instructions argument. Here we make it concrete: take one task, build two prompts — lean curated context vs a big undifferentiated dump — and A/B test them.

Why "Just Dump It" Stopped Working

A big context window is not free attention. Three failure modes show up as you scale token volume without scaling quality:

Lean curated context          Big dump
┌──────────────────┐          ┌──────────────────────────────┐
│ 3 relevant docs  │          │ 60 docs, 8 relevant, 52 noise │
│ ~4K tokens       │          │ ~180K tokens                  │
│ answer = signal  │          │ answer = signal ÷ distractors │
└──────────────────┘          └──────────────────────────────┘
   cheap, fast, sharp            expensive, slow, diluted

The A/B Setup

Same task, same model, same temperature. The only variable is the context.

Wire it up as a comparison so the two variants run identically except for the context payload:

import os, requests

BASE = "https://www.promptster.dev/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['PROMPTSTER_API_KEY']}"}

QUESTION = "Which clause governs liability caps, and what is the cap?"

def run(context_block, label):
    prompt = f"{context_block}\n\nQuestion: {QUESTION}"
    r = requests.post(f"{BASE}/prompts/test", headers=HEADERS, json={
        "provider": "google",
        "model": "gemini-3.1-pro-preview",   # 1M-token window
        "prompt": prompt,
        "temperature": 0.0,
        "max_tokens": 300,
    }, timeout=120)
    r.raise_for_status()
    d = r.json()
    print(label, "| tokens:", d["metadata"]["input_tokens"],
          "| cost:", d["metadata"]["cost_usd"],
          "| latency:", d["metadata"]["latency_ms"])
    return d

run(lean_context, "A (lean)")
run(dump_context, "B (dump)")

Score both answers against the known-correct cap. Then compare not just accuracy but cost and latency — because the dump variant always loses on those two regardless of accuracy.

The Result

We ran it. On a refund-policy lookup — "was this digital-goods purchase refundable?" — we gave each model a lean context (just the policy plus the order facts) and a bloated one (the same fact padded with extra surrounding clauses). We scored both GPT-5.2 and Claude Opus 4.6 on each variant.

The honest result is more nuanced than the dramatic accuracy-cliff you might expect. Both models answered correctly — "No, digital goods are non-refundable" — in both the lean and the bloated context. The fact was findable either way. On this easy, single-needle lookup, bloat did not cause an error. What it cost was extra input tokens and, depending on the model, no latency win to show for them.

Variant Correct? Input cost Latency
GPT-5.2 — lean Yes $0.000406 1192 ms
GPT-5.2 — bloated Yes $0.000445 888 ms
Opus 4.6 — lean Yes $0.00129 2733 ms
Opus 4.6 — bloated Yes $0.001285 2682 ms

So the lesson here is not "bloat makes the model wrong." It's narrower and more useful: on an easy lookup, curated context buys you cost and (often) latency, not correctness. The bloated GPT-5.2 call cost ~10% more in input tokens than the lean one; the Opus calls were a wash on cost and latency. Nobody got the answer wrong, because the needle was big and the haystack was small.

The accuracy danger from bloat is real — but it shows up as tasks get harder and the needle gets buried, not on a one-clause refund question. That's the lost-in-the-middle failure mode: at token 40,000, surrounded by 52 distractors, the same models start missing facts they'd nail in a 4K-token prompt. The cheaper, faster behavior we measured here is the easy-mode preview of a cliff you fall off as the context grows. When to curate hard versus when long context is fine is exactly the call our RAG vs long-context decision framework walks through.

So Do You Even Need RAG?

This is exactly the question our RAG vs long-context decision framework tackles. Short version: curation is retrieval. Whether you do it with a vector DB, a keyword filter, or a human picking three files, the lesson is the same — the model performs best when you've already done the hard part of deciding what matters.

The Real Lesson

Context engineering in 2026 is mostly an exercise in deletion, not accumulation. The cheapest, fastest, most accurate prompt is usually the one where you removed everything the model didn't need. A bigger window gives you the option to dump — it doesn't make dumping a good idea. Curate first, then measure both quality and the bill.


Tests run 2026-05-26 via the Promptster /v1/prompts/compare API. Costs are per-call estimates from Promptster's pricing model.