Tool-Calling Reliability Across 11 Providers: Who Emits a Valid Call Every Single Time?
By Promptster Team · 2026-06-04
There's a benchmark-shaped hole in the LLM landscape. We have SWE-bench for coding, GPQA for reasoning, MMLU for trivia. We have nothing standard for the single capability most agentic apps depend on: does the model emit a valid tool call that matches your schema — every time — and does it correctly refrain when no tool applies?
This matters more than the leaderboards admit. A model that's brilliant at reasoning but emits malformed function arguments 3% of the time will take down an agent loop, because that loop runs the tool-call path thousands of times a day. Tool-calling isn't a quality dimension you average — it's a reliability dimension you tail-test. 99% schema-valid sounds great until you're making 50,000 calls a day and 500 of them silently break.
Two Failures, Not One
"Tool-calling reliability" collapses two distinct failures that need separate measurement:
- False negatives — a tool clearly applies, but the model answers in prose instead of calling it, or emits arguments that don't match the schema (wrong types, missing required fields, hallucinated extra fields, JSON wrapped in prose).
- False positives — no tool applies, but the model calls one anyway, or invents a tool that doesn't exist.
Most informal "it works for me" testing only catches the first. The second is sneakier and just as damaging: an over-eager model that calls get_weather when the user asked a philosophy question burns latency, money, and trust. A reliable model is calibrated about when to reach for a tool — the same calibration property that separates honest from fabricating models in our citation leaderboard.
The Test Schema
We use one deliberately strict tool with required fields, an enum, and a typed number — enough to catch sloppy argument generation:
{
"name": "create_calendar_event",
"description": "Create a calendar event. Only call this when the user explicitly asks to schedule, book, or add something to a calendar.",
"parameters": {
"type": "object",
"properties": {
"title": { "type": "string" },
"start_iso": { "type": "string", "description": "ISO 8601 datetime" },
"duration_minutes": { "type": "integer", "minimum": 1 },
"visibility": { "type": "string", "enum": ["public", "private"] }
},
"required": ["title", "start_iso", "duration_minutes"]
}
}
Then a set of prompts split into two buckets:
Should trigger (test false negatives):
- "Book a 30-minute private sync called 'Roadmap review' tomorrow at 2pm."
- "Add lunch with Sam to my calendar for Friday noon for an hour."
- "Schedule a standup every weekday morning — start with Monday 9am, 15 minutes." (ambiguity: recurrence isn't in the schema)
Should NOT trigger (test false positives):
- "What's a good agenda for a roadmap review meeting?"
- "Why does my calendar app keep crashing?"
- "Cancel my 2pm." (no cancel tool exists — must NOT fabricate one)
A correct model: calls the tool with schema-valid arguments on the first three, answers in prose on the last three, and never invents a tool. The recurrence prompt is a deliberate trap — the schema can't express recurrence, so the ideal behavior is to call once for Monday and say it can't do recurrence, not to hallucinate a recurrence field.
Scoring
| Metric | Definition |
|---|---|
| Schema-valid rate | Of triggered calls, % that fully validate against the JSON schema |
| Trigger accuracy | % of should-trigger prompts that produced a call |
| Refrain accuracy | % of should-NOT-trigger prompts that produced no call |
| Phantom-tool rate | % of responses that invented a tool not provided |
| Field-error breakdown | Missing required / wrong type / enum violation / extra field |
The headline number is schema-valid rate, but phantom-tool rate is the one that should scare you — a model inventing tools in an agent loop is an unbounded failure.
The Comparison
We ran a focused version of this test across four providers — a should-trigger prompt (schedule a cleaning), a should-not prompt, and the deliberately under-specified ambiguous prompt — asking each model to emit a tool call or {"tool":null}.
| Provider | Should-trigger | Should-not | Ambiguous | Raw JSON? | Cost (sum of 3 calls) |
|---|---|---|---|---|---|
| OpenAI GPT-5.5 | Correct call | {"tool":null} |
{"tool":null} |
Yes | $0.00755 |
| Anthropic Opus 4.7 | Correct call | {"tool":null} |
{"tool":null} |
Yes | $0.00384 |
| DeepSeek V4 Pro | Correct call | {"tool":null} |
{"tool":null} |
Yes | $0.00233 |
| Mistral Large | Correct call | {"tool":null} |
{"tool":null} |
No (```json fences) | $0.00021 |
What We Found
The headline result: all four providers followed the tool protocol correctly on all three prompts. Each emitted the right create_calendar_event call when scheduling was clearly requested, returned {"tool":null} when no tool applied, and — notably — correctly declined the under-specified ambiguous request with {"tool":null} rather than guessing at missing arguments or inventing a field. No phantom tools, no schema violations.
The one real differentiator was format, not logic. Mistral Large wrapped its JSON in ```json markdown fences on every single prompt — all three responses came back as fenced blocks rather than raw JSON. That's a parser gotcha if your agent loop expects raw JSON, because the fences have to be stripped before JSON.parse. The other three returned raw JSON ready to parse. The model is correct, but a naive parser still chokes.
On cost, Mistral Large was the absolute cheapest by a wide margin — $0.00021 across all three calls, against $0.00755 for GPT-5.5 (36× more expensive). DeepSeek V4 Pro was the cheapest among the raw-JSON providers at $0.00233 for the batch. Latency was tight across the field: Mistral averaged ~780ms, Opus 4.7 ~1.2s, GPT-5.5 ~2.7s, DeepSeek V4 Pro ~4.5s (its longer responses on the ambiguous prompt dragged the average). DeepSeek V4 Pro burned 389 output tokens on the ambiguous prompt — far more than the 10-50 the others used — because it reasoned through the ambiguity before emitting {"tool":null}. Right answer, expensive envelope.
This is the same lesson as structured outputs compared: emitting strict, schema-valid structure is a separate skill from being smart — and even when every model gets the logic right, the envelope (raw JSON vs. fenced) still diverges across providers. It's also a specific instance of why prompts fail on different LLM providers — the same tool definition lands differently across formats.
The Real Lesson
Tool-calling reliability is the most production-critical capability nobody standardized a benchmark for, and it's a tail property: the 1% of malformed calls is what breaks your agent at scale, not the 99% that work. Test both directions — does it call when it should, and does it shut up when it shouldn't — and weight phantom-tool rate heavily, because an invented tool in a loop has no ceiling on the damage.
Define this test against your actual tool schemas before you wire a model into an agent. And if those tools are MCP tools, the next step is testing MCP tools before shipping — the schema is the same problem, one layer up.
Tests run 2026-05-30 via the Promptster /v1/prompts/compare API. Temperatures as noted. Costs computed from the May 2026 pricing.ts.