Gemini Flash vs Claude Haiku vs GPT-4o Mini: The Cheap-Fast-Smart Triangle
By Promptster Team · 2026-04-29
There's a classic engineering trope: "cheap, fast, good — pick two." For AI models in 2026, that trope is mostly wrong. You can get all three, on a lot of tasks, for under a fifth of a cent per request.
The question is: which of the three budget-tier frontier models gives you the best slice of all three? We tested Gemini 2.5 Flash Lite, Claude Haiku 4.5, and GPT-4o-mini across three task archetypes — code, math reasoning, and structured summarization — and found a clear winner, a competent runner-up, and one result that made us double-check the API logs.
Note: we tried to include GPT-5 Mini in this test but the API returned HTTP 400 on every attempt. Substituted GPT-4o Mini as the OpenAI representative at the same tier.
The Contenders
| Model | Input $/M | Output $/M | Claimed position |
|---|---|---|---|
| Gemini 2.5 Flash Lite | $0.10 | $0.40 | Cheapest, fastest |
| Claude Haiku 4.5 | $1.00 | $5.00 | Mid-budget, smartest |
| GPT-4o Mini | $0.15 | $0.60 | Cheap + well-rounded |
Claude Haiku 4.5 is the most expensive of the three by 10x on input tokens, 6x on output. Priced-as-premium within the budget tier.
Test 1 — Code: Duration String Parser
Prompt: Write parse_duration(s: str) -> int that parses strings like "2h30m15s" into total seconds. Support h/m/s units. Reject invalid formats and empty strings. Include a docstring and 4 assert statements.
Gemini 2.5 Flash Lite
Correct regex-based solution. Included explicit empty-string guard. Correctly rejects "invalid-format" via a clever cleaned_s != s check that ensures no unrecognized characters leak through. One minor mess: it imported pytest at the bottom of the file and used with pytest.raises in the asserts — functional but not idiomatic.
Grade: solid pass. Cost: $0.000208, 1,748 ms, 280 tok/s.
Claude Haiku 4.5
Here's where things got weird. Haiku's response:
import anthropic
import re
def parse_duration(s: str) -> int:
"""Parse a duration string like '2h30m15s' and return total seconds."""
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": f"""Parse this duration string..."""}]
)
response_text = message.content[0].text.strip()
# ... returns int(response_text)
Claude Haiku produced a function that calls the Anthropic API — specifically, that asks a different Claude model to parse the duration string. This is the model solving a parsing problem by delegating to itself over the network. It technically "works" if the sub-call succeeds, but:
- Requires
ANTHROPIC_API_KEYconfigured - Needs a network round-trip on every call
- Costs money on every invocation
- Introduces a dependency on an LLM API for a 15-line regex problem
This isn't a simple hallucination. It's a pattern-matching failure: when asked to "write code that parses durations," Haiku pattern-matched to code samples in its training data that call the Anthropic API — probably because the phrase "parse this" shows up a lot in Anthropic's own documentation and examples for their Claude API.
Grade: functionally wrong. The code runs, but it's the wrong abstraction. Cost: $0.002224, 2,893 ms, 145 tok/s. You also paid 14x more than Gemini for this.
GPT-4o Mini
Correct regex-based solution using re.findall to iterate over unit matches. Includes a regex validation step upfront to reject invalid formats. The empty-string rejection is implicit (the upfront regex rejects "") rather than explicit, but it works.
Grade: solid pass. Cost: $0.000156, 3,908 ms, 59 tok/s. (Cheapest and cleanest on this task.)
Code task winner: GPT-4o Mini on price-per-correctness. Gemini a close second. Claude Haiku produced code you'd have to throw away.
Test 2 — Math Reasoning: Bakery Optimization
Prompt: Boxes of 6 cupcakes cost $12, singles cost $2.25. Buy exactly 20 at minimum cost. Show reasoning.
Correct answer: $40.50 with 3 boxes and 2 singles.
Gemini 2.5 Flash Lite
Correct reasoning, correct final option identified ($40.50 for 3 boxes + 2 singles), but hit the max_tokens limit before producing the required FINAL: line. Enumerated Options 1-3, cut off mid-Option-3. The correct answer was clearly stated mid-reasoning, but the instruction to end with FINAL: $40.50 with 3 boxes and 2 singles was not followed because the model ran out of tokens being verbose.
Grade: correct but truncated. Fix: raise max_tokens or ask the model to be more concise. Cost: $0.000209, 2,034 ms.
Claude Haiku 4.5
Clean, well-formatted reasoning. Enumerated all four options (including b=0). Correct final answer: FINAL: $40.50 with 3 boxes and 2 singles. Best chain-of-thought layout of the three.
Grade: clean pass. Cost: $0.002505, 2,924 ms.
GPT-4o Mini
Correct reasoning, enumerated multiple options, arrived at $40.50. Followed the FINAL: format. Slightly more verbose than Haiku but readable.
Grade: clean pass. Cost: $0.000281, 6,044 ms.
Math task winner: Claude Haiku for cleanest output; GPT-4o Mini for best price-per-correctness; Gemini needs a bigger token budget to finish the task.
Test 3 — Structured Summarization
Prompt: Summarize a technical paragraph in exactly 2 sentences. Use no adverbs ending in -ly. Do not use the word "significantly." State the main finding first, then a practitioner implication.
Provided paragraph was about a transformer-scale benchmark showing plateaus on simple tasks but gains on multi-hop reasoning.
Gemini 2.5 Flash Lite
Larger transformer models show improved performance on complex reasoning tasks, but gains diminish with scale. Practitioners should consider mid-sized models for general workloads, reserving larger models for tasks demanding advanced inference.
Two sentences. No -ly adverbs. No "significantly." Captures the main finding and implication. Crisp.
Grade: pass. Cost: $0.000034, 649 ms, 58 tok/s.
Claude Haiku 4.5
Larger transformer models above 100B parameters show diminishing accuracy returns on standard tasks but demonstrate meaningful improvements on complex reasoning tasks involving multi-hop inference and counterfactual reasoning. Practitioners should prioritize mid-sized models for most production workloads since the token cost increases faster than accuracy gains on harder tasks.
Two sentences. No -ly adverbs. No "significantly." More specific (cites the 100B parameter threshold). Slightly longer sentences.
Grade: pass, slightly more informative. Cost: $0.000544, 1,533 ms.
GPT-4o Mini
Models with over 100B parameters plateau in accuracy on well-specified tasks but improve on complex reasoning tasks. Practitioners should consider using mid-sized models for most production workloads to balance cost and performance.
Two sentences. No rule violations. Factually accurate. Compact.
Grade: pass. Cost: $0.000052, 1,412 ms.
Summary task: three-way tie in correctness. Gemini wins on cost/speed. Haiku adds the most specificity. GPT-4o Mini is the middle ground.
The Leaderboard
| Dimension | Winner | Runner-up |
|---|---|---|
| Cheapest | Gemini 2.5 Flash Lite | GPT-4o Mini |
| Fastest | Gemini 2.5 Flash Lite (by latency); Haiku (by tokens/sec on some tasks) | — |
| Best code correctness | GPT-4o Mini / Gemini (tie) | — |
| Best math reasoning | Claude Haiku | GPT-4o Mini |
| Most consistent instruction-following | GPT-4o Mini | Claude Haiku |
| Weirdest failure mode | Claude Haiku (wrote parser that calls Claude API) | — |
Claude Haiku 4.5 is priced like the premium option in the budget tier, and on math-reasoning tasks it earns that premium — you get cleaner chains of thought and more specificity. But on code tasks, Haiku is actively dangerous in a way Gemini and GPT-4o-mini are not: it hallucinated an architecture where none was asked for.
The Pragmatic Pick
If you want one model for everything budget-tier: GPT-4o Mini. It wasn't the outright winner on any single axis, but it passed every test without a pathology and cost less than Haiku by an order of magnitude.
If you're cost-optimizing aggressively: Gemini 2.5 Flash Lite. Fastest, cheapest, passed every test when given enough token budget. Raise max_tokens on reasoning tasks or it'll truncate.
If you're doing structured reasoning and need the cleanest chain-of-thought: Claude Haiku 4.5. But do not use it for generic code generation without output review. The "solve parsing by calling another LLM" pattern will show up in other prompts too.
The Bigger Picture
The cheap-fast-smart triangle is real but the corners have moved. All three budget-tier models from 2026 would have been frontier models in 2024. The cost of "good enough for most work" is now under $0.001/request. The question is no longer can you afford quality — it's whether you're routing the right tasks to the right tier.
For the full cross-tier picture, see the 300x price spread analysis. For when you should pay premium, see the task-type decision framework.
Tests run 2026-04-19 via the Promptster MCP server. Temperature 0.1–0.3 depending on task. GPT-5 Mini excluded due to consistent API errors; GPT-4o Mini substituted. All raw response data available on request.