Structured Outputs Compared: How 5 Providers Handle Strict JSON Schemas

By Promptster Team · 2026-05-01

Structured output is the unsexy workhorse of production AI. Every agent that calls a tool, every pipeline that hydrates a database, every UI that renders a response — they all need the model to emit JSON that parses on the first try and matches a schema.

It's also where providers quietly diverge. "Supports JSON mode" is not the same as "returns the JSON you asked for." We ran five providers against a realistic, non-trivial schema — nested objects, arrays with constraints, enum fields, a boolean — and graded the output on four axes: valid JSON, all required fields, correct types, and whether the model followed the "no markdown fences" instruction.

The Test

We asked each provider to extract structured data from a short press release:

Developers! The inaugural StackForge Summit is coming to Berlin, Germany, on September 14, 2026 — an in-person gathering at the Tempelhof Airport Hangar. Confirmed speakers so far include: Yuki Tanaka, Principal Engineer at Nimbus Labs, presenting on streaming data architectures; Dr. Leila Gafoor, CTO of Meridian Robotics, keynoting on embodied AI; and open-source maintainer Javier Ruiz, running a deep-dive on modern CI/CD. Early-bird tickets are $249 USD and include the main conference, 2 workshops, and the Friday night party. Standard tickets at $349 include the main conference and 1 workshop. VIP tickets at $749 include everything plus the speaker dinner, a 1:1 advisor session, and hotel for two nights.

The schema required six fields:

The instruction ended with: "Output ONLY the JSON object. No markdown fences, no explanation."

Results

Provider Model Valid JSON Schema Match Type Correctness "No fences" rule
OpenAI gpt-4o-mini Followed
DeepSeek deepseek-chat Followed
Google gemini-2.5-flash-lite ❌ Wrapped in ```json fences
Anthropic claude-haiku-4-5 ❌ Wrapped in ```json fences
Mistral mistral-large-latest ❌ Wrapped in ```json fences

On pure schema conformance, all five produced valid, parseable JSON with every required field at the right type. is_virtual came back as false (boolean) from every provider — nobody stringified it. price_usd came back as a number everywhere. No trailing commas, no comments, no missing fields.

But three out of five providers ignored the explicit "no markdown fences" instruction. If your downstream parser expects raw JSON (and most do), that's a string stripping step you need to handle.

The Interesting Divergences

Where providers had to make judgment calls, they split.

Speaker roles: short or long?

The text says "Yuki Tanaka, Principal Engineer at Nimbus Labs."

Provider Role extracted as
OpenAI gpt-4o-mini "Principal Engineer"
Anthropic Claude Haiku "Principal Engineer at Nimbus Labs"
Google Gemini "Principal Engineer at Nimbus Labs"
DeepSeek "Principal Engineer at Nimbus Labs"
Mistral "Principal Engineer at Nimbus Labs"

OpenAI stripped the company affiliation; the other four kept it inline. Neither is "wrong," but if your schema has a separate company field elsewhere, you need OpenAI's interpretation. If you don't, you probably want the others'.

VIP tier "includes": literal or expanded?

The text says VIP tickets include "everything plus the speaker dinner, a 1:1 advisor session, and hotel for two nights." The word "everything" was ambiguous on purpose.

Provider VIP includes
OpenAI gpt-4o-mini ["everything", "speaker dinner", "1:1 advisor session", "hotel for two nights"] — literal "everything" kept as-is
DeepSeek ["everything plus the speaker dinner", "1:1 advisor session", "hotel for two nights"] — collapsed first two items into one string
Anthropic Claude Haiku ["main conference", "2 workshops", "Friday night party", "speaker dinner", "1:1 advisor session", "hotel for two nights"]expanded "everything" to mean the early-bird contents
Google Gemini Same as Anthropic: expanded "everything" to the early-bird list
Mistral Same as Anthropic: expanded "everything" to the early-bird list

This is the interesting one. Three providers inferred that "everything" meant "everything from the early-bird tier" and expanded it into concrete list items. That's a reasonable reading — and probably what a human would do. But the text never said that. The early-bird list was "main conference, 2 workshops, Friday night party"; standard was "main conference, 1 workshop." Which one is "everything"?

The three expanding providers inserted content that isn't in the source. For some use cases that's a feature (reader-friendly output). For compliance-sensitive extraction (legal, financial, contractual), it's a hallucination — content that reads as factual but wasn't in the source document.

If your structured-extraction use case is regulated, you probably want OpenAI's behavior: preserve the ambiguity, don't fabricate.

Cost and Speed

Provider Cost Latency Tokens/sec
Google Gemini 2.5 Flash Lite $0.000195 1,240 ms 314
DeepSeek Chat $0.000229 10,097 ms 30
OpenAI gpt-4o-mini $0.000235 6,871 ms 44
Mistral large $0.000704 4,181 ms 82
Anthropic Claude Haiku 4.5 $0.002474 1,790 ms 229

Gemini was cheapest and nearly fastest. Anthropic Haiku was 13x more expensive than Gemini with comparable speed. DeepSeek was third-cheapest but the slowest by a wide margin.

The Recommendations

For structured extraction where the schema is well-defined and the source text is clean, GPT-4o-mini is the pragmatic default: it followed the "no fences" rule, it preserved ambiguity rather than inferring, and its cost is middle-of-the-pack. If cost matters more than fence-stripping, Gemini 2.5 Flash Lite wins on raw economics but you'll need to strip ```json in your parser.

For compliance-sensitive extraction (legal, regulatory, audit), GPT-4o-mini or DeepSeek's conservative "preserve literal text" behavior is safer than Anthropic/Google/Mistral's helpful-but-inferential expansion.

For prototyping or developer-facing tools where the expansion is desirable (an event calendar wants the expanded bullet list), Anthropic or Mistral's inferential style is actually better UX.

The Real Lesson

"Structured output works across all major providers" is true and useless. The granular differences — fence-wrapping, literal-vs-inferred interpretations, company affiliation parsing — are where your pipeline quietly breaks or silently fabricates.

Before you standardize on a provider for your extraction layer, run your actual schema against 4+ providers with your actual source documents and check the output manually. Five minutes of comparison now saves a week of "wait, why does the database have content that wasn't in the email?" later.

For more on how providers diverge on instructions, see why your prompts fail on different LLM providers. For how to catch this automatically, see automating prompt testing for production-ready AI apps.


Tests run 2026-04-19 via the Promptster MCP server. Temperature 0.1. All responses validated with json.loads() after fence-stripping where present.