Migrating From GPT-5 to GPT-5.5: The Diffs That Actually Change Outputs
By Promptster Team · 2026-06-19
GPT-5.5 is OpenAI's current release. If you skipped GPT-5.2 and GPT-5.4, the jump from GPT-5 straight to 5.5 is a three-minor-version migration — pitched as a clear upgrade on reasoning and hallucination. So the migration should be a no-op, right? Bump the model string, ship it.
It isn't. It never is. A better model produces different output, and different output breaks pipelines tuned for the old one. Same-vendor, same-family migrations feel safe precisely because nobody runs the diff — and that's where the regressions hide.
This is a case-study migration. We reuse the seven-class diff framework from migrating prompts across providers, narrowed to the one upgrade most teams are doing right now: GPT-5 → GPT-5.5. Which diffs actually change outputs, and what to do about each.
The Diff Classes That Bite On Same-Family Upgrades
Cross-provider migrations hit all seven diff classes. Same-family upgrades concentrate the damage in four. Here's the map:
| Diff class | Risk on GPT-5 → 5.5 | Why |
|---|---|---|
| Format / fences | High | Newer models often change default wrapping (markdown fences, preambles, bold styling) |
| Verbosity | High | Hallucination-reduction training tends to shift answer length and hedging |
| Refusal boundary | Medium | Safety calibration moves between versions |
| Tool-calling | Medium | Argument formatting and tool-selection thresholds shift |
| Temperature semantics | Low | Same vendor, same scale |
| Instruction hierarchy | Low | Same family, similar system-prompt handling |
| Reasoning output parsing | Low–Med | If you parse reasoning traces, check the envelope |
Across three minor versions, the top four classes are where your migration test budget should go.
Diff Class 1 — Format & Fences
The classic silent breaker. A GPT-5 prompt that reliably returns bare JSON can start wrapping it in ```json fences on 5.5, or prepend a one-line "Here's the JSON you requested:". Your parser, expecting response.startsWith('{'), throws.
Fix: re-assert the format contract explicitly ("Return only valid JSON, no code fences, no preamble") and make your parser tolerant of fences as a belt-and-suspenders measure.
Diff Class 2 — Verbosity
Hallucination-reduction training tends to come partly from more hedging and qualification. Helpful for correctness, disruptive for any downstream that assumed a length range. A summary capped "in 50 words" on GPT-5 might run to 80 on GPT-5.5; a one-line answer might gain a caveat sentence — or, as you'll see below, the visible output can shrink dramatically while the billed tokens stay similar.
Fix: re-tighten length and format instructions, and re-validate any UI or storage that assumed a length envelope.
Diff Class 3 — Refusal Boundary
Safety calibration moves between versions in both directions. A borderline-legitimate request — security research framing, medical-adjacent question, edgy creative prompt — that GPT-5 answered may get a refusal on 5.5, or the reverse. This is the diff teams discover from a user complaint, not a test.
Fix: include your known borderline prompts in the migration test set.
Diff Class 4 — Tool-Calling
Argument formatting and the threshold at which the model decides to call a tool both shift across versions. 5.5 might extract order_id: "4821" where 5 returned "#4821", or call a tool 5 would have skipped.
Fix: re-run your tool-call fixture set against 5.5 and diff the selected tools and emitted arguments.
The Before/After Test
The whole case study reduces to one comparison: same prompts, GPT-5 vs GPT-5.5, diff the outputs by class. We ran identical prompts across three behavior classes — format (a "list three benefits" prompt), JSON extraction, and verbosity (a one-paragraph explanation).
The headline finding: on correctness, the migration is a non-event. On token shape, GPT-5.5 trims output 4–6× — and on the prompt where it produces the most compact answer, the bill still went up.
| Diff class | GPT-5 | GPT-5.5 | Output changed? |
|---|---|---|---|
| Format ("list 3 benefits") | 3 benefits, plain dashes, 317 out-tokens, $0.003186, 4,340 ms | 3 benefits, bolded leads + numbered, 76 out-tokens, $0.002345, 2,116 ms | Style change — same content, equivalent quality |
JSON ({"lang","year"}) |
{"lang":"Python","year":1991} · 147 out-tokens · $0.001501 · 1,505 ms |
{"lang":"Python","year":1991} · 25 out-tokens · $0.000875 · 1,874 ms |
Byte-identical, ~6× fewer tokens |
| Verbosity (JS closure) | One paragraph, correct, 238 out-tokens, $0.002401, 3,488 ms | One paragraph, correct, 113 out-tokens, $0.003475, 3,016 ms | Equivalent explanation — but cost went up |
| Totals | $0.007088 | $0.006695 | 5.5 wins on bill by ~6%; per-token price is higher |
Output correctness: equivalent across all three tasks. The JSON was byte-identical. The closure paragraph hit the same conceptual points — lexical environment, captured variables, common uses — in roughly half the words. The format task changed style: GPT-5 used plain dashes, GPT-5.5 returned a numbered list with bolded leads. Same answer, different shape — exactly the silent-breaker pattern Diff Class 1 warns about.
The interesting cost finding: GPT-5.5 produces fewer output tokens on every task, but its per-output-token price is higher than GPT-5's. On the format and JSON tasks the token reduction (4–6×) was big enough to outweigh the per-token jump; on the verbosity task the 2× token reduction wasn't enough — and GPT-5.5 cost more for the same answer. The aggregate bill came out 6% cheaper, but the per-task math goes both ways.
That's a finding worth internalizing before you commit to "5.5 is just cheaper." It isn't, uniformly. It's leaner per call, with a higher per-token rate, and the breakeven depends on how compact your specific traffic gets after the upgrade.
The Migration Playbook (GPT-5 → 5.5 Edition)
- Pin both versions explicitly. A dated snapshot, not a bare alias that might already have rolled forward to 5.6.
- Build a 15–25 prompt reference set from real production traffic, including borderline and tool-calling cases.
- Run the before/after diff. Skipping 5.2 and 5.4 means three minor versions of accumulated style drift — expect format and verbosity shifts on most prompts.
- Re-assert the contract. Format ("return only JSON, no fences"), length ("in N words"), and refusal framing — all worth restating.
- Re-run your tool-call fixtures. Argument shapes drift across versions even on identical schemas.
- Measure cost on YOUR prompts. Don't trust the per-million-token sticker — the per-output-token rate went up, and whether the leaner output wins depends on your traffic mix.
- Ramp. Shadow 5.5 against 5 on a slice of traffic before full cutover.
This is why prompts fail on a model they weren't tuned for — the same root cause we dissect in why prompts fail on different LLM providers. And re-confirm temperature behavior with your sampling settings using the empirical temperature guide before you call the migration done.
The Real Lesson
GPT-5.5 is a better model that produces different-shaped output, and "better" is not "drop-in" — especially when you're jumping three minor versions at once. The four diffs that break same-family migrations are format, verbosity, refusal, and tool-calling — test exactly those, fix the prompt, and ramp. The output-shape changes are real (a bolded numbered list where you used to get plain dashes will silently break a markdown-strip regex), and the cost story is more nuanced than the sticker price implies. The upgrade is worth taking. The copy-paste version of it is how you ship a regression with a green dashboard.
For the broader cross-provider diff framework, see migrating prompts across providers.
Tests run 2026-05-30 via the Promptster /v1/prompts/compare API. Temperatures 0.0–0.2. Costs computed from the May 2026 pricing.ts.