Build a Multi-Model Router on the Promptster API — 2026 Model Wave Edition
By Promptster Team · 2026-05-28
Last month we shipped a task-aware LLM router in an afternoon — a 60-line classifier-plus-route that sends easy prompts to a nano model and hard prompts to a frontier model. That post is still the right place to start; the architecture hasn't changed.
What changed is the roster. The spring 2026 model updates reshuffled every tier: Claude Opus 4.6 is a strong frontier coding option, GPT-5.2 is OpenAI's current release, Gemini 3 Flash is a recent cheap-fast entrant, and DeepSeek V3.1 is a budget anchor for frontier-class work. A routing table written a couple of months ago is already stale. This post is the v2 — same architecture, refreshed routes, plus the two upgrades everyone asked for: a latency axis and a fallback path.
This is the reason teams keep switching to multi-model: the optimal model for a given task moves every few weeks, and hardcoding one vendor means re-litigating that decision in your application code every time the leaderboard shifts.
The Architecture (Now With Fallback + Latency)
┌───────────────┐
│ user prompt │
└───────┬───────┘
│
▼
┌────────────────────────────────┐
│ classifier (cheap, fast model) │
│ → (task_type, priority) labels │
└───────┬────────────────────────┘
│
▼
┌────────────────────────────────┐
│ route lookup │
│ (task_type, priority) → tier │
│ latency-critical → speed tier│
│ quality-critical → frontier │
│ default → budget tier │
└───────┬────────────────────────┘
│
▼
┌────────────────────────────────┐
│ execute on primary │
│ on 5xx / timeout ↓ │
│ execute on fallback (same tier)│
└───────┬────────────────────────┘
│
▼
┌───────────────┐
│ final answer │
└───────────────┘
Still two LLM calls in the happy path: one cheap classifier, one execution. The fallback only fires on a provider error, so it costs nothing on healthy requests. Classification overhead remains ~1–2% of total request cost; the savings on routed work are still 10–100x.
The Routing Table (2026 Wave)
The whole point of the router is that this table is data you edit, not logic you rewrite. Here's the refreshed version reflecting the current roster. Each entry is a (provider, model) primary with a same-tier fallback.
from typing import Literal
TaskType = Literal["code", "math", "extraction", "creative", "factual", "general"]
Priority = Literal["latency", "quality", "default"]
# (provider, model) primary + same-tier fallback for resilience.
# Edit these to match YOUR measured workload — see /app to benchmark.
ROUTES: dict[tuple[TaskType, Priority], dict] = {
# Hard coding & reasoning → frontier tier
("code", "quality"): {"primary": ("anthropic", "claude-opus-4-6"),
"fallback": ("openai", "gpt-5.2")},
("math", "quality"): {"primary": ("google", "gemini-3.1-pro-preview"),
"fallback": ("anthropic", "claude-opus-4-6")},
# Latency-critical → speed tier
("general", "latency"): {"primary": ("groq", "llama-3.3-70b"),
"fallback": ("cerebras", "gpt-oss-120b")},
("extraction", "latency"): {"primary": ("google", "gemini-3-flash-preview"),
"fallback": ("groq", "llama-3.3-70b")},
# Budget default → cheapest frontier-class / cheap-fast
("code", "default"): {"primary": ("deepseek", "deepseek-reasoner"),
"fallback": ("openai", "gpt-5.2")},
("extraction", "default"): {"primary": ("google", "gemini-3-flash-preview"),
"fallback": ("deepseek", "deepseek-chat")},
("creative", "default"): {"primary": ("openai", "gpt-5.2"),
"fallback": ("anthropic", "claude-opus-4-6")},
("factual", "default"): {"primary": ("perplexity", "sonar"), # web-connected
"fallback": ("google", "gemini-3.1-pro-preview")},
("general", "default"): {"primary": ("deepseek", "deepseek-chat"),
"fallback": ("google", "gemini-3-flash-preview")},
}
# Fallback to the budget default if an exact (task, priority) key is missing.
DEFAULT_ROUTE = {"primary": ("deepseek", "deepseek-chat"),
"fallback": ("google", "gemini-3-flash-preview")}
<!-- verify model ids: claude-opus-4-6, gpt-5.2, gemini-3.1-pro-preview, gemini-3-flash-preview, deepseek-reasoner, deepseek-chat -->
Notice the design: the same task type routes to a different model depending on priority. Coding under a latency SLA goes to a speed model; coding where quality is paramount goes to Opus 4.6; coding by default goes to one of the cheaper frontier-class options (DeepSeek Reasoner). That second axis is the upgrade over the v1 router.
The Code
Requires a Promptster API key (pk_live_* or pk_test_*) and requests.
import os
import requests
from typing import Literal
PROMPTSTER_KEY = os.environ["PROMPTSTER_API_KEY"]
BASE_URL = "https://www.promptster.dev/v1"
CLASSIFIER = ("google", "gemini-3-flash-preview") # cheap + fast classifier
def call_promptster(provider: str, model: str, prompt: str,
temperature: float = 0.3, max_tokens: int = 1000) -> dict:
r = requests.post(
f"{BASE_URL}/prompts/test",
headers={"Authorization": f"Bearer {PROMPTSTER_KEY}"},
json={"provider": provider, "model": model, "prompt": prompt,
"temperature": temperature, "max_tokens": max_tokens},
timeout=60,
)
r.raise_for_status()
return r.json()
def classify(prompt: str) -> tuple[str, str]:
"""Return (task_type, priority). Cheap model, deterministic."""
p = f"""Classify this prompt. Reply with exactly two lowercase words separated by a space:
WORD 1 (task_type): one of [code, math, extraction, creative, factual, general]
WORD 2 (priority): 'latency' if it reads time-sensitive/real-time, 'quality' if correctness is critical, else 'default'
Prompt: {prompt}
Reply with ONLY the two words."""
provider, model = CLASSIFIER
out = call_promptster(provider, model, p, temperature=0.0, max_tokens=8)
parts = out["response"].strip().lower().split()
valid_tasks = {"code", "math", "extraction", "creative", "factual", "general"}
valid_prios = {"latency", "quality", "default"}
task = parts[0] if parts and parts[0] in valid_tasks else "general"
prio = parts[1] if len(parts) > 1 and parts[1] in valid_prios else "default"
return task, prio # type: ignore[return-value]
def route(prompt: str) -> dict:
task, prio = classify(prompt)
spec = ROUTES.get((task, prio), DEFAULT_ROUTE)
primary_p, primary_m = spec["primary"]
try:
answer = call_promptster(primary_p, primary_m, prompt)
used = f"{primary_p}/{primary_m}"
except requests.HTTPError as e:
if e.response is not None and e.response.status_code < 500:
raise # 4xx is our bug, not a provider outage — don't fall back
fb_p, fb_m = spec["fallback"]
answer = call_promptster(fb_p, fb_m, prompt)
used = f"{fb_p}/{fb_m} (fallback)"
answer["_router"] = {"task": task, "priority": prio, "routed_to": used}
return answer
if __name__ == "__main__":
import sys
res = route(sys.argv[1])
print(f"Routed to: {res['_router']['routed_to']}")
print(f"Cost: ${res.get('cost_usd', 0):.6f}")
print(res["response"])
Run it:
export PROMPTSTER_API_KEY=pk_live_...
python router_v2.py "Refactor this auth middleware to be thread-safe. Correctness is critical."
# → Routed to: anthropic/claude-opus-4-6 (code, quality)
python router_v2.py "Pull the order ID and ship date from this email. Need it now for the live dashboard."
# → Routed to: google/gemini-3-flash-preview (extraction, latency)
python router_v2.py "Summarize this changelog."
# → Routed to: deepseek/deepseek-chat (general, default)
Why These Routes
Every entry should be a measurement, not a hunch. The honest caveat: the 2026-wave routes above are starting points derived from public positioning and our prior runs — you should re-benchmark them against your own traffic. The routing decisions worth defending:
- Quality-critical code → Opus 4.6. A strong frontier coding option. Fallback to GPT-5.2 as a close alternative.
- Default code → DeepSeek Reasoner. One of the cheapest frontier-class options today; if it holds up on your tasks (we test that thesis in our DeepSeek deep-dive), it's a reasonable default.
- Latency-critical → Groq / Cerebras / Gemini 3 Flash. The speed tier. We benchmark exactly these in reducing latency with Groq.
- Factual → Perplexity. The only routinely-web-connected option; the rest hallucinate recent facts.
The cost case for routing at all is unchanged from the 300x price-spread study: paying frontier rates for every request because "it's safer" is the margin other teams' AI-ops are capturing.
Production Hardening
The script above ships, but add these before you scale:
- Cache classification.
hash(prompt) → (task, priority)for 24h. Many prompts repeat. - Log everything.
task,priority,routed_to,cost_usd,latency_ms. After a week you'll know if your table is calibrated — and which routes are misfiring. - Schedule a weekly re-benchmark. Models deprecate and new ones dethrone your picks. A scheduled comparison test across your top task types tells you when the table needs editing.
- Watch fallback rate. A spiking fallback rate is a provider-health signal before it's an incident.
The Real Win
You still don't need a framework, a vector DB, or a meta-learner to route LLM traffic. You need a cheap classifier, a routing table backed by your data, a fallback path, and weekly monitoring. The only thing that changes month to month is the contents of one dictionary — and because the dictionary is data, updating it for the next model wave is a one-line edit, not a refactor.
That's the whole argument for going multi-model: you future-proof against a roster that won't stop moving.
Code tested against Promptster API v1 as of 2026-05-26. Routing table reflects the spring 2026 model updates; re-benchmark against your own workload before trusting the mappings. Model IDs are plausible API strings pending verification.