Rate Limits Are the New Outage: Build Multi-Provider Failover in Python

By Promptster Team · 2026-06-09

The old reliability question was "what if my provider goes down?" In 2026 that's the wrong question. Frontier providers rarely go down — they hit capacity ceilings. Demand spikes, you blow past your per-minute quota or the provider's regional capacity, and you get a wall of HTTP 429s. Your service is degraded just as surely as if the API were offline, except nothing is broken and there's no incident page to point at.

Rate-limit errors are the new outage. And the fix is the same shape as outage handling: don't depend on one provider. This post builds same-tier failover — detect a 429, fall back to an equivalent model on a different provider, keep serving.

Why Single-Provider Retry Isn't Enough

The instinct is to back off and retry the same provider. That works for a transient blip. It does not work for a sustained capacity ceiling — you're just politely waiting in a line that isn't moving while your users time out.

429 on Provider A
        │
   ┌────┴────────────────────────┐
   ▼                             ▼
retry SAME provider          fail over to
(good for blips)             SAME-TIER alt provider
                             (good for capacity ceilings)

The robust answer is both: a short backoff-retry on the primary, then a fallthrough to an equivalent provider. This is the reliability argument behind why developers are switching to multi-model — diversification isn't only about quality, it's about not having a single capacity dependency.

Define Same-Tier Fallback Chains

The key idea: maintain a chain of equivalent models per tier, so a fallback doesn't tank quality. A frontier-tier request falls over to another frontier model, not to a nano model.

Tier	Primary	Fallback 1	Fallback 2
Frontier	`anthropic/claude-opus-4-6`	`openai/gpt-5.2`	`google/gemini-3.1-pro-preview`
Fast	`groq/llama-3.3-70b`	`cerebras/llama3.1-8b`	`together/...`
Cheap frontier	`deepseek/deepseek-reasoner`	`openai/gpt-5.2-mini`	—

For latency-sensitive paths, lead the fast tier with Groq — see reducing latency with Groq for why it sits first in the chain.

The Failover Loop

One function: try each link in the chain, back off on 429 within a link, move to the next link when a link is exhausted.

import os, time, requests

BASE = "https://www.promptster.dev/v1"
HEADERS = {"Authorization": f"Bearer {os.environ['PROMPTSTER_API_KEY']}"}

CHAINS = {
    "frontier": [
        ("anthropic", "claude-opus-4-6"),
        ("openai",    "gpt-5.2"),
        ("google",    "gemini-3.1-pro-preview"),
    ],
}

def generate(prompt, tier="frontier", per_link_tries=2):
    last_err = None
    for provider, model in CHAINS[tier]:
        for attempt in range(per_link_tries):
            r = requests.post(f"{BASE}/prompts/test", headers=HEADERS, json={
                "provider": provider, "model": model,
                "prompt": prompt, "temperature": 0.2, "max_tokens": 500,
            }, timeout=60)

            if r.status_code == 429:
                # Capacity ceiling on THIS provider. Back off briefly,
                # but be ready to move to the next link.
                wait = float(r.headers.get("Retry-After", 2 ** attempt))
                time.sleep(min(wait, 4))
                last_err = "429 on %s/%s" % (provider, model)
                continue

            if r.status_code >= 500:
                last_err = "%d on %s/%s" % (r.status_code, provider, model)
                break  # provider hiccup — skip to next link immediately

            r.raise_for_status()
            data = r.json()
            data["metadata"]["served_by"] = f"{provider}/{model}"
            return data

    raise RuntimeError(f"all providers exhausted; last error: {last_err}")

Two deliberate choices worth calling out:

429 → short backoff, then advance. We give the primary one quick retry, but a sustained ceiling means moving on beats waiting.
5xx → advance immediately. A provider returning 500 won't fix itself in two seconds; don't waste the retry budget on it.

Don't Forget Your Own Rate Limit

Failover handles the provider's ceiling. You also have Promptster's per-minute limit, which depends on your tier:

Tier	Requests / minute
Free	5
Builder	30
Scale	120
Enterprise	500

If you're sustaining enough volume to trip provider ceilings, you're probably near your own limit too. The same backoff branch that catches a provider 429 also catches the API's 429 — so the loop above already handles both. Just make sure your concurrency isn't so aggressive that you're self-inflicting limits.

Verify the Chain Before You Trust It

A failover chain you've never exercised is a chain that fails the first time it matters. Before shipping, confirm each link actually returns equivalent-quality output for your task — run the same prompt across all three with compare and eyeball the results. This is also the foundation for promoting the loop into a full multi-model router on the API.

The Real Lesson

Treat capacity like you treat uptime: assume your primary provider will be full at the worst possible moment, and have a same-tier alternative ready. A 30-line failover loop turns a wall of 429s from a user-facing outage into a logged metadata line that says served_by: openai/gpt-5.2. Diversify the dependency, exercise the chain, and the new outage stops being an outage.

Code tested against Promptster API v1 as of 2026-06-09. Requires a pk_live_* or pk_test_* key from /developer/api-keys. Model IDs are illustrative — confirm current IDs in the docs.