Rate Limits Are the New Outage: Build Multi-Provider Failover in Python
By Promptster Team · 2026-06-09
The old reliability question was "what if my provider goes down?" In 2026 that's the wrong question. Frontier providers rarely go down — they hit capacity ceilings. Demand spikes, you blow past your per-minute quota or the provider's regional capacity, and you get a wall of HTTP 429s. Your service is degraded just as surely as if the API were offline, except nothing is broken and there's no incident page to point at.
Rate-limit errors are the new outage. And the fix is the same shape as outage handling: don't depend on one provider. This post builds same-tier failover — detect a 429, fall back to an equivalent model on a different provider, keep serving.
Why Single-Provider Retry Isn't Enough
The instinct is to back off and retry the same provider. That works for a transient blip. It does not work for a sustained capacity ceiling — you're just politely waiting in a line that isn't moving while your users time out.
429 on Provider A
│
┌────┴────────────────────────┐
▼ ▼
retry SAME provider fail over to
(good for blips) SAME-TIER alt provider
(good for capacity ceilings)
The robust answer is both: a short backoff-retry on the primary, then a fallthrough to an equivalent provider. This is the reliability argument behind why developers are switching to multi-model — diversification isn't only about quality, it's about not having a single capacity dependency.
Define Same-Tier Fallback Chains
The key idea: maintain a chain of equivalent models per tier, so a fallback doesn't tank quality. A frontier-tier request falls over to another frontier model, not to a nano model.
| Tier | Primary | Fallback 1 | Fallback 2 |
|---|---|---|---|
| Frontier | anthropic/claude-opus-4-6 |
openai/gpt-5.2 |
google/gemini-3.1-pro-preview |
| Fast | groq/llama-3.3-70b |
cerebras/llama3.1-8b |
together/... |
| Cheap frontier | deepseek/deepseek-reasoner |
openai/gpt-5.2-mini |
— |
For latency-sensitive paths, lead the fast tier with Groq — see reducing latency with Groq for why it sits first in the chain.
The Failover Loop
One function: try each link in the chain, back off on 429 within a link, move to the next link when a link is exhausted.
import os, time, requests
BASE = "https://www.promptster.dev/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['PROMPTSTER_API_KEY']}"}
CHAINS = {
"frontier": [
("anthropic", "claude-opus-4-6"),
("openai", "gpt-5.2"),
("google", "gemini-3.1-pro-preview"),
],
}
def generate(prompt, tier="frontier", per_link_tries=2):
last_err = None
for provider, model in CHAINS[tier]:
for attempt in range(per_link_tries):
r = requests.post(f"{BASE}/prompts/test", headers=HEADERS, json={
"provider": provider, "model": model,
"prompt": prompt, "temperature": 0.2, "max_tokens": 500,
}, timeout=60)
if r.status_code == 429:
# Capacity ceiling on THIS provider. Back off briefly,
# but be ready to move to the next link.
wait = float(r.headers.get("Retry-After", 2 ** attempt))
time.sleep(min(wait, 4))
last_err = "429 on %s/%s" % (provider, model)
continue
if r.status_code >= 500:
last_err = "%d on %s/%s" % (r.status_code, provider, model)
break # provider hiccup — skip to next link immediately
r.raise_for_status()
data = r.json()
data["metadata"]["served_by"] = f"{provider}/{model}"
return data
raise RuntimeError(f"all providers exhausted; last error: {last_err}")
Two deliberate choices worth calling out:
- 429 → short backoff, then advance. We give the primary one quick retry, but a sustained ceiling means moving on beats waiting.
- 5xx → advance immediately. A provider returning 500 won't fix itself in two seconds; don't waste the retry budget on it.
Don't Forget Your Own Rate Limit
Failover handles the provider's ceiling. You also have Promptster's per-minute limit, which depends on your tier:
| Tier | Requests / minute |
|---|---|
| Free | 5 |
| Builder | 30 |
| Scale | 120 |
| Enterprise | 500 |
If you're sustaining enough volume to trip provider ceilings, you're probably near your own limit too. The same backoff branch that catches a provider 429 also catches the API's 429 — so the loop above already handles both. Just make sure your concurrency isn't so aggressive that you're self-inflicting limits.
Verify the Chain Before You Trust It
A failover chain you've never exercised is a chain that fails the first time it matters. Before shipping, confirm each link actually returns equivalent-quality output for your task — run the same prompt across all three with compare and eyeball the results. This is also the foundation for promoting the loop into a full multi-model router on the API.
The Real Lesson
Treat capacity like you treat uptime: assume your primary provider will be full at the worst possible moment, and have a same-tier alternative ready. A 30-line failover loop turns a wall of 429s from a user-facing outage into a logged metadata line that says served_by: openai/gpt-5.2. Diversify the dependency, exercise the chain, and the new outage stops being an outage.
Code tested against Promptster API v1 as of 2026-06-09. Requires a pk_live_* or pk_test_* key from /developer/api-keys. Model IDs are illustrative — confirm current IDs in the docs.