Open-Weight Inference Speed: The Same Llama Runs ~10× Faster on One Host

By Promptster Team · 2026-06-12

The pitch for open-weight models is portability: the same weights run anywhere. True — but "the same weights" does not mean "the same product." Latency, throughput, and price-per-token are set by the host, not the model, and they swing wildly. Picking Llama 3.3 70B is half the decision; picking where it runs is the other half, and it's the half most teams skip.

So we ran the identical modelllama-3.3-70b — on two hosts through Promptster's compare view and measured what actually changes when only the host changes.

The Test

Three everyday tasks (merge two sorted lists, a two-sentence summary, a strict-JSON extraction) at matched temperatures, same prompts, same model weights — only the host differs: Together AI vs Groq.

A note on scope: we wanted three hosts, not two. Fireworks AI returned 5xx errors on every call on our account (a provider/key issue, not a model-id one), and Cerebras's Qwen endpoint was rate-limited, so neither produced usable data this run. That's its own lesson — verify a host actually serves you before you design around it — and it's why this is a clean two-host comparison rather than the four-way we planned.

Results — same model, same prompts, only the host changes

Task Together (latency / cost / tok/s) Groq (latency / cost / tok/s) Groq speedup
Code (merge sorted lists) 6,607 ms / $0.000326 / 48 tok/s 576 ms / $0.000228 / 429 tok/s ~11×
Summarize (2 sentences) 3,607 ms / $0.000128 / 20 tok/s 491 ms / $0.000098 / 141 tok/s ~7×
JSON extract 1,924 ms / $0.000064 / 8 tok/s 191 ms / $0.000044 / 63 tok/s ~10×

The outputs were equivalent — it's the same model, so the merge function, the summary, and the JSON all came back correct and near-identical on both hosts. Everything that differed was operational: Groq was 7–11× faster and modestly cheaper on every task, with 3–9× the throughput.

This Isn't a Groq Quirk — Together Was Slow on Big Models Generally

In a separate run for our Qwen 3 235B vs DeepSeek comparison, Together-hosted Qwen 3 235B took 38–52 seconds on some single-shot prompts — an order of magnitude slower than DeepSeek's own API on the same questions. Two different models, same pattern: Together's hosting of large open-weight models carried heavy latency in our runs. If you're serving interactive traffic, that's disqualifying regardless of how good the weights are.

What This Means

The Real Lesson

Open weights commoditize the model; they do not commoditize the service. The same Llama 3.3 70B was a snappy 200–600 ms experience on one host and a multi-second one on another, at the same quality. Before you standardize on a host for an open-weight model, run your own prompts through it and read the latency and cost columns — the weights will be identical, and everything that matters to your users won't be. For where open-weight models fit at all, see benchmarking open-source vs closed-source; for squeezing latency once you've picked a fast host, see reducing latency with Groq.


Tests run 2026-05-26 via the Promptster /v1/prompts/compare API. Identical model (llama-3.3-70b) on Together (meta-llama/Llama-3.3-70B-Instruct-Turbo) and Groq (llama-3.3-70b-versatile). Latency/cost are per-call figures from the run. Fireworks (5xx on our account) and Cerebras (rate-limited) did not return usable data this run.