Open-Weight Inference Speed: The Same Llama Runs ~10× Faster on One Host
By Promptster Team · 2026-06-12
The pitch for open-weight models is portability: the same weights run anywhere. True — but "the same weights" does not mean "the same product." Latency, throughput, and price-per-token are set by the host, not the model, and they swing wildly. Picking Llama 3.3 70B is half the decision; picking where it runs is the other half, and it's the half most teams skip.
So we ran the identical model — llama-3.3-70b — on two hosts through Promptster's compare view and measured what actually changes when only the host changes.
The Test
Three everyday tasks (merge two sorted lists, a two-sentence summary, a strict-JSON extraction) at matched temperatures, same prompts, same model weights — only the host differs: Together AI vs Groq.
A note on scope: we wanted three hosts, not two. Fireworks AI returned 5xx errors on every call on our account (a provider/key issue, not a model-id one), and Cerebras's Qwen endpoint was rate-limited, so neither produced usable data this run. That's its own lesson — verify a host actually serves you before you design around it — and it's why this is a clean two-host comparison rather than the four-way we planned.
Results — same model, same prompts, only the host changes
| Task | Together (latency / cost / tok/s) | Groq (latency / cost / tok/s) | Groq speedup |
|---|---|---|---|
| Code (merge sorted lists) | 6,607 ms / $0.000326 / 48 tok/s | 576 ms / $0.000228 / 429 tok/s | ~11× |
| Summarize (2 sentences) | 3,607 ms / $0.000128 / 20 tok/s | 491 ms / $0.000098 / 141 tok/s | ~7× |
| JSON extract | 1,924 ms / $0.000064 / 8 tok/s | 191 ms / $0.000044 / 63 tok/s | ~10× |
The outputs were equivalent — it's the same model, so the merge function, the summary, and the JSON all came back correct and near-identical on both hosts. Everything that differed was operational: Groq was 7–11× faster and modestly cheaper on every task, with 3–9× the throughput.
This Isn't a Groq Quirk — Together Was Slow on Big Models Generally
In a separate run for our Qwen 3 235B vs DeepSeek comparison, Together-hosted Qwen 3 235B took 38–52 seconds on some single-shot prompts — an order of magnitude slower than DeepSeek's own API on the same questions. Two different models, same pattern: Together's hosting of large open-weight models carried heavy latency in our runs. If you're serving interactive traffic, that's disqualifying regardless of how good the weights are.
What This Means
- The host is a first-class decision, not an afterthought. A 10× latency swing on identical weights will dominate your p95 long before model choice does. Benchmark the host, not just the model.
- Match the host to the workload. For interactive, latency-sensitive paths, a speed-specialist host (Groq here, and Cerebras when it's available to you) is the point. For batch or cost-only jobs, a slower host can be fine — but measure it, don't assume.
- "Available on our account" is a real gate. Two of the four hosts we tried were unusable for us this week. Portability is theoretical until your key actually gets a 200.
The Real Lesson
Open weights commoditize the model; they do not commoditize the service. The same Llama 3.3 70B was a snappy 200–600 ms experience on one host and a multi-second one on another, at the same quality. Before you standardize on a host for an open-weight model, run your own prompts through it and read the latency and cost columns — the weights will be identical, and everything that matters to your users won't be. For where open-weight models fit at all, see benchmarking open-source vs closed-source; for squeezing latency once you've picked a fast host, see reducing latency with Groq.
Tests run 2026-05-26 via the Promptster /v1/prompts/compare API. Identical model (llama-3.3-70b) on Together (meta-llama/Llama-3.3-70B-Instruct-Turbo) and Groq (llama-3.3-70b-versatile). Latency/cost are per-call figures from the run. Fireworks (5xx on our account) and Cerebras (rate-limited) did not return usable data this run.