Reasoning Tokens Aren't Free: A Real Cost Breakdown of o4-mini, DeepSeek Reasoner, and the Hidden Token Bill

By Promptster Team · 2026-05-02

When OpenAI introduced o-series reasoning models in late 2024, they came with a new line item on the bill: reasoning tokens. Tokens the model uses to think before answering, billed at the output-token rate, not always visible in the API response. DeepSeek followed with the Reasoner family. Anthropic added extended thinking. Gemini added thinking modes.

If you're not accounting for reasoning tokens, your reasoning-model bill looks weirdly high and you can't figure out why. We ran a simple benchmark and the numbers tell the story clearly.

The Test

A single prompt, two small reasoning problems stapled together:

A farmer has 17 sheep. All but 9 die. How many sheep are left?

Then, a train leaves Boston at 8:00 AM traveling at 60 mph. Another train leaves New York at 9:00 AM traveling at 80 mph on the same track heading toward Boston. The cities are 220 miles apart. At what time do they meet? Give your final answers clearly on two separate lines at the end.

The correct answers: 9 sheep and ~10:08:34 AM. Both are low-complexity reasoning questions — they would take a careful human under 60 seconds.

We ran this against three models at three different reasoning depths:

gpt-4o-mini (non-reasoning, for baseline)
o4-mini (OpenAI reasoning tier, $1.10/M in / $4.40/M out)
deepseek-reasoner (DeepSeek reasoning tier, ~$0.55/M in / $2.19/M out with reasoning)

The Numbers

Model	Input tokens	Output tokens	Visible chars in response	Latency	Cost
gpt-4o-mini	96	320	~1,280 chars (full working shown)	5,869 ms	$0.000206
o4-mini	95	1,030	~650 chars visible	9,604 ms	$0.004637
deepseek-reasoner	93	1,712	~10 chars ("9\n10:08 AM")	51,588 ms	$0.000745

Read that DeepSeek row again. The model charged us for 1,712 output tokens. The actual response text it returned was "9" and "10:08 AM" — about 10 visible characters, maybe 4 tokens.

~1,708 output tokens — 99.8% — were invisible reasoning, consumed and billed, never returned to us.

Three Different API Contracts

Each provider has a different approach to reasoning-token visibility, and each has a different implication for your bill and your observability.

DeepSeek: reasoning tokens consumed, not returned

On DeepSeek Reasoner, reasoning happens inside the model call but the choices[0].message.content field contains only the final answer. The reasoning itself is stored in a separate reasoning_content field (which our MCP test didn't pipe through, explaining why the visible response looked empty).

Implication: If you're using a higher-level wrapper that only reads content, you will see blank or truncated responses and wonder what went wrong. You'll also be billed for the invisible reasoning. Read reasoning_content explicitly.

o4-mini: reasoning tokens counted in output, reasoning text inlined

OpenAI's o-series returns reasoning summaries (not raw reasoning) in the normal response text in some configurations. Our test ran with default settings and got 1,030 output tokens for a response that included a visible working-out. Roughly a third of those tokens were hidden reasoning, two-thirds were the visible answer walk-through.

Implication: Output token counts on o-series don't cleanly map to visible text length. Budgeting max_completion_tokens based on "how long do I want the answer to be" will undershoot.

gpt-4o-mini: no reasoning layer, all output is visible

Non-reasoning baseline. 320 output tokens, all of them in the visible response (full step-by-step math). What you see is what you pay for.

Implication: Predictable. This is the only mode where token accounting and response text are the same thing.

Cost Consequences

On this one prompt, the reasoning-tier premium was:

o4-mini cost 22x more than gpt-4o-mini for essentially the same answer quality.
deepseek-reasoner cost 3.6x more than gpt-4o-mini, but 51 seconds of latency.
gpt-4o-mini got both answers right with visible, debuggable working. For this task shape, reasoning tier added no value.

This is the pattern we keep seeing: for problems that don't actually need reasoning depth (small arithmetic, well-specified logic, simple temporal inference), non-reasoning models produce correct answers faster and cheaper. The reasoning tier pays off on genuinely hard problems — multi-step novel synthesis, constraint satisfaction, chains of dependent deductions — and punishes you on everything else.

If you classify tasks before routing (see our task-type decision framework), the reasoning tier should be reserved for the top-quadrant prompts. Defaulting every prompt to a reasoning model is the 2026 equivalent of running SELECT * in a hot loop.

How to Audit Your Reasoning-Token Bill

Three concrete checks:

1. Compare usage.completion_tokens (or provider equivalent) to the response_length in characters. If the ratio is wildly different from ~4 chars/token, you have invisible reasoning tokens. For o-series, look for usage.completion_tokens_details.reasoning_tokens. For DeepSeek, check usage.reasoning_tokens if exposed by your client.

2. Log reasoning-token counts explicitly. If you're using any LLM observability (Helicone, Langfuse, Promptster history), surface a reasoning_tokens column in your dashboards. Anomalies become visible instead of hidden in the output column.

3. Set max_reasoning_tokens or equivalent budget caps. Most reasoning APIs accept a cap. Without one, the model decides how much to think. A single pathological prompt can burn 10k+ reasoning tokens on what looks like a simple question.

The Three-Line Takeaway

Reasoning tokens are billed whether or not you see them.
On DeepSeek Reasoner, 99% of output tokens can be invisible.
Reserve the reasoning tier for hard problems; route easy ones to non-reasoning models — even frontier non-reasoning is usually cheaper than budget reasoning for equivalent quality on simple tasks.

For more on per-tier routing, see the cost-per-quality 300x spread analysis and the task-type decision framework.

Tests run 2026-04-19 via the Promptster MCP server. Temperature 0.1, max_tokens 2000. Reasoning tokens inferred from output_tokens minus visible response length.