Drift Report: Did the Spring 2026 Model Wave Break Your Prompts?
By Promptster Team · 2026-06-18
In a short window this spring, the frontier moved under everyone's feet:
| Model | Positioning |
|---|---|
| Claude Opus 4.7 | Anthropic's latest frontier release (new tokenizer) |
| GPT-5.5 | OpenAI's current release |
| Gemini 3.5 Flash | Google's newest cheap-tier model |
| Claude Sonnet 4.6 | Anthropic's mid-tier refresh |
Every one of those is a better model. And every one of them silently changed the output your prompts produce. A better model does not mean identical output on your prompt. A model that scores better on a benchmark can still start wrapping JSON in markdown fences, refusing a borderline request it used to answer, or padding a summary from 80 words to 140.
That's drift. Your prompt is the same; the output isn't. This is exactly the failure mode we built scheduled drift detection for — and it's why we stopped trusting single-provider benchmarks. This post measures it directly: a fixed prompt set, re-run old version vs new version, drift reported.
Why "Silent" Is The Dangerous Word
When a provider deprecates a model, you get an email. When a provider improves a model behind the same alias, you often get nothing. Your code pins gpt-5.4 (or worse, an unversioned alias that now points at a newer version), the requests keep succeeding with 200s, and the shape of the answers shifts. Uptime is green. Quality moved. Nobody alerted you because, from the provider's view, nothing broke.
The drift classes that bite hardest:
- Format drift — fences appear/disappear, key ordering changes, an extra prose preamble shows up before the JSON.
- Verbosity drift — the new version is chattier (or terser), breaking downstream length assumptions.
- Tokenizer drift — same prompt, more input tokens. Anthropic's Opus 4.7 ships a new tokenizer that can use up to ~35% more tokens for the same input. Your input cost rises even when your output looks identical.
- Refusal drift — the safety boundary moved; a prompt that used to answer now declines, or vice versa.
- Tool-calling drift — argument formatting or tool-selection behavior changes.
The Test Design
The method is deliberately boring, because boring is reproducible: hold the prompt fixed, hold temperature fixed, vary only the model version, diff the outputs. We ran two probes — a strict JSON-extraction task (no fences) on the OpenAI and Anthropic Opus pairs, and a "give exactly three tips, numbered list only" formatting task on the Gemini Flash and Anthropic Sonnet pairs — across four real old→new pairs from the May 2026 wave. Here's what came back.
The headline is reassuring: behavioral drift on correctness and format was low across the entire wave. Every model produced the correct JSON with no fences, or a clean three-item numbered list, on both sides of every pair. Nothing regressed on the contract. The visible changes were subtler — newer models shifting verbosity in both directions, Opus 4.7's new tokenizer inflating input-token counts on identical prompts, and across-the-board price-per-call increases for the new versions.
| Old → New pair | Output stable? | Notable drift |
|---|---|---|
| gpt-5.4 → gpt-5.5 | Yes — byte-identical output: {"product":"Widget Pro","price_usd":29.99}, no fences, 49 output tokens both versions |
None on output. Cost ~2× higher ($0.000822 → $0.001645) for the identical answer. Latency edged up (1.25s → 1.63s). |
| opus-4-6 → opus-4-7 | Yes — both returned correct JSON, no fences | Format polish + tokenizer inflation. 4-6 added spaces after colons ({"product": "Widget Pro", ...}); 4-7 dropped them ({"product":"Widget Pro",...}). Same prompt jumped from 40 → 56 input tokens (+40%) — the new tokenizer in action. Output tokens 21 → 25. Cost up modestly ($0.000725 → $0.000905). |
| gemini-3-flash-preview → gemini-3.5-flash | Yes — both gave a valid numbered 3-tip list | Verbosity up. 3 Flash Preview was terse (62 output tokens, 298 chars); 3.5 Flash got chattier (106 tokens, 413 chars) — adds parenthetical examples like (e.g., "Fix bug" instead of "Fixed bug"). Cost ~5× higher ($0.000193 → $0.000977). |
| sonnet-4-5 → sonnet-4-6 | Yes — both gave a clean numbered 3-tip list | Negligible. Near-identical content (74 vs 69 output tokens), same advice, near-identical phrasing. Cost essentially flat ($0.001176 → $0.001101). |
No prompt broke. No fences appeared where they shouldn't, no list lost its structure, no version regressed on correctness. But the pattern is not "newer models are uniformly terser" the way last quarter's wave looked — it's mixed. GPT-5.5 was byte-identical to GPT-5.4. Sonnet 4.6 was effectively a tie with 4.5. Opus 4.7 tightened JSON formatting but inflated input tokens by 40% on this prompt due to the new tokenizer. And Gemini 3.5 Flash got chattier than its preview predecessor, not leaner.
The trap to watch — even when drift is low: the same-output, higher-cost case. GPT-5.4 → GPT-5.5 produced a byte-identical JSON output, but cost roughly twice as much per call. If you upgrade your pin and don't look at your bill, you'll notice in a month. Same for Opus 4.7's tokenizer change: your input cost rose ~40% on this prompt for an output that looks essentially the same.
How To Read Your Own Results
When you run this against your real prompts, sort the outcomes into three buckets:
- Pure improvement, no drift — adopt the new version freely.
- Improvement + drift — adopt, but fix the prompt or the parser for the drift class you saw. This is the most common bucket and the one teams underestimate.
- Regression — the new version scored lower on your task. Pin the old version and file the gap. It happens more than the marketing implies, especially on narrow domain tasks the benchmark didn't cover.
The Defense: Pin, Snapshot, Schedule
You can't stop providers from improving models. You can stop being surprised:
- Pin explicit version strings, never bare aliases. A dated snapshot like
gpt-5.5-<release-date>over a baregpt-5. - Snapshot a reference output set the day you ship a prompt. That snapshot is your drift baseline.
- Schedule the diff. Re-run your reference set on a cadence (and on every announced model update) and alert on score drops, format changes, or cost changes. That's the whole point of scheduled drift detection.
- Watch input-token counts on Anthropic upgrades. Opus 4.7's new tokenizer means your input cost can jump even when your prompt and output don't change a character. Snapshot token counts, not just outputs.
The Real Lesson
Good news from this run: the May 2026 model wave did not break these prompts. Across all four old→new pairs — GPT-5.4 → GPT-5.5, Opus 4.6 → Opus 4.7, Gemini 3 Flash Preview → Gemini 3.5 Flash, and Sonnet 4.5 → Sonnet 4.6 — outputs stayed stable on correctness and format. The JSON came back clean and fence-free on every version, the lists stayed structured at three tips, and nothing regressed on the task.
But the cost story is less rosy. GPT-5.5 produced a byte-identical answer to GPT-5.4 for roughly twice the price. Opus 4.7's new tokenizer inflated input tokens by ~40% on the same prompt, raising per-call cost even with similar output length. Gemini 3.5 Flash got chattier than its preview predecessor — 106 output tokens vs 62 — and cost ~5× more. Only Sonnet 4.6 was essentially a wash with 4.5. The output is the same; the bill is not. That's a real drift class teams routinely miss because their parsers don't break and their tests don't fail.
Low drift on our prompts is not a promise about yours. The verbosity that helped us could break a downstream length assumption elsewhere, and a future version could just as easily reintroduce a fence. So treat a model upgrade like any other dependency bump: pin the version, diff against a snapshot — outputs and token counts and cost — and gate adoption on your own prompt set. Green uptime is not green quality, and a clean drift report is only clean for the prompts you actually tested.
When you decide to move, do it deliberately — the per-diff playbook for the highest-traffic upgrade of this wave is in migrating prompts from GPT-5 to GPT-5.5.
Tests run 2026-05-30 via the Promptster /v1/prompts/compare API. JSON-strict task at temperature 0, format-list task at temperature 0.2. Costs computed from the May 2026 pricing.ts.