Migrating From GPT-5 to GPT-5.5: The Diffs That Actually Change Outputs

By Promptster Team · 2026-06-19

GPT-5.5 is OpenAI's current release. If you skipped GPT-5.2 and GPT-5.4, the jump from GPT-5 straight to 5.5 is a three-minor-version migration — pitched as a clear upgrade on reasoning and hallucination. So the migration should be a no-op, right? Bump the model string, ship it.

It isn't. It never is. A better model produces different output, and different output breaks pipelines tuned for the old one. Same-vendor, same-family migrations feel safe precisely because nobody runs the diff — and that's where the regressions hide.

This is a case-study migration. We reuse the seven-class diff framework from migrating prompts across providers, narrowed to the one upgrade most teams are doing right now: GPT-5 → GPT-5.5. Which diffs actually change outputs, and what to do about each.

The Diff Classes That Bite On Same-Family Upgrades

Cross-provider migrations hit all seven diff classes. Same-family upgrades concentrate the damage in four. Here's the map:

Diff class	Risk on GPT-5 → 5.5	Why
Format / fences	High	Newer models often change default wrapping (markdown fences, preambles, bold styling)
Verbosity	High	Hallucination-reduction training tends to shift answer length and hedging
Refusal boundary	Medium	Safety calibration moves between versions
Tool-calling	Medium	Argument formatting and tool-selection thresholds shift
Temperature semantics	Low	Same vendor, same scale
Instruction hierarchy	Low	Same family, similar system-prompt handling
Reasoning output parsing	Low–Med	If you parse reasoning traces, check the envelope

Across three minor versions, the top four classes are where your migration test budget should go.

Diff Class 1 — Format & Fences

The classic silent breaker. A GPT-5 prompt that reliably returns bare JSON can start wrapping it in ```json fences on 5.5, or prepend a one-line "Here's the JSON you requested:". Your parser, expecting response.startsWith('{'), throws.

Fix: re-assert the format contract explicitly ("Return only valid JSON, no code fences, no preamble") and make your parser tolerant of fences as a belt-and-suspenders measure.

Diff Class 2 — Verbosity

Hallucination-reduction training tends to come partly from more hedging and qualification. Helpful for correctness, disruptive for any downstream that assumed a length range. A summary capped "in 50 words" on GPT-5 might run to 80 on GPT-5.5; a one-line answer might gain a caveat sentence — or, as you'll see below, the visible output can shrink dramatically while the billed tokens stay similar.

Fix: re-tighten length and format instructions, and re-validate any UI or storage that assumed a length envelope.

Diff Class 3 — Refusal Boundary

Safety calibration moves between versions in both directions. A borderline-legitimate request — security research framing, medical-adjacent question, edgy creative prompt — that GPT-5 answered may get a refusal on 5.5, or the reverse. This is the diff teams discover from a user complaint, not a test.

Fix: include your known borderline prompts in the migration test set.

Diff Class 4 — Tool-Calling

Argument formatting and the threshold at which the model decides to call a tool both shift across versions. 5.5 might extract order_id: "4821" where 5 returned "#4821", or call a tool 5 would have skipped.

Fix: re-run your tool-call fixture set against 5.5 and diff the selected tools and emitted arguments.

The Before/After Test

The whole case study reduces to one comparison: same prompts, GPT-5 vs GPT-5.5, diff the outputs by class. We ran identical prompts across three behavior classes — format (a "list three benefits" prompt), JSON extraction, and verbosity (a one-paragraph explanation).

The headline finding: on correctness, the migration is a non-event. On token shape, GPT-5.5 trims output 4–6× — and on the prompt where it produces the most compact answer, the bill still went up.

Diff class	GPT-5	GPT-5.5	Output changed?
Format ("list 3 benefits")	3 benefits, plain dashes, 317 out-tokens, $0.003186, 4,340 ms	3 benefits, bolded leads + numbered, 76 out-tokens, $0.002345, 2,116 ms	Style change — same content, equivalent quality
JSON (`{"lang","year"}`)	`{"lang":"Python","year":1991}` · 147 out-tokens · $0.001501 · 1,505 ms	`{"lang":"Python","year":1991}` · 25 out-tokens · $0.000875 · 1,874 ms	Byte-identical, ~6× fewer tokens
Verbosity (JS closure)	One paragraph, correct, 238 out-tokens, $0.002401, 3,488 ms	One paragraph, correct, 113 out-tokens, $0.003475, 3,016 ms	Equivalent explanation — but cost went up
Totals	$0.007088	$0.006695	5.5 wins on bill by ~6%; per-token price is higher

Output correctness: equivalent across all three tasks. The JSON was byte-identical. The closure paragraph hit the same conceptual points — lexical environment, captured variables, common uses — in roughly half the words. The format task changed style: GPT-5 used plain dashes, GPT-5.5 returned a numbered list with bolded leads. Same answer, different shape — exactly the silent-breaker pattern Diff Class 1 warns about.

The interesting cost finding: GPT-5.5 produces fewer output tokens on every task, but its per-output-token price is higher than GPT-5's. On the format and JSON tasks the token reduction (4–6×) was big enough to outweigh the per-token jump; on the verbosity task the 2× token reduction wasn't enough — and GPT-5.5 cost more for the same answer. The aggregate bill came out 6% cheaper, but the per-task math goes both ways.

That's a finding worth internalizing before you commit to "5.5 is just cheaper." It isn't, uniformly. It's leaner per call, with a higher per-token rate, and the breakeven depends on how compact your specific traffic gets after the upgrade.

The Migration Playbook (GPT-5 → 5.5 Edition)

Pin both versions explicitly. A dated snapshot, not a bare alias that might already have rolled forward to 5.6.
Build a 15–25 prompt reference set from real production traffic, including borderline and tool-calling cases.
Run the before/after diff. Skipping 5.2 and 5.4 means three minor versions of accumulated style drift — expect format and verbosity shifts on most prompts.
Re-assert the contract. Format ("return only JSON, no fences"), length ("in N words"), and refusal framing — all worth restating.
Re-run your tool-call fixtures. Argument shapes drift across versions even on identical schemas.
Measure cost on YOUR prompts. Don't trust the per-million-token sticker — the per-output-token rate went up, and whether the leaner output wins depends on your traffic mix.
Ramp. Shadow 5.5 against 5 on a slice of traffic before full cutover.

This is why prompts fail on a model they weren't tuned for — the same root cause we dissect in why prompts fail on different LLM providers. And re-confirm temperature behavior with your sampling settings using the empirical temperature guide before you call the migration done.

The Real Lesson

GPT-5.5 is a better model that produces different-shaped output, and "better" is not "drop-in" — especially when you're jumping three minor versions at once. The four diffs that break same-family migrations are format, verbosity, refusal, and tool-calling — test exactly those, fix the prompt, and ramp. The output-shape changes are real (a bolded numbered list where you used to get plain dashes will silently break a markdown-strip regex), and the cost story is more nuanced than the sticker price implies. The upgrade is worth taking. The copy-paste version of it is how you ship a regression with a green dashboard.

For the broader cross-provider diff framework, see migrating prompts across providers.

Tests run 2026-05-30 via the Promptster /v1/prompts/compare API. Temperatures 0.0–0.2. Costs computed from the May 2026 pricing.ts.