Qwen3 235B vs DeepSeek V4 Pro: The Open-Weight Frontier War, 2026 Edition

By Promptster Team · 2026-06-14

The interesting fight in 2026 isn't open vs closed anymore — it's open vs open. The lineup just changed: DeepSeek V3.1 is retired, and DeepSeek V4 Pro is the new flagship on the official DeepSeek API. Qwen3 235B is still the heavyweight on Together. Both open-weight, both frontier-class, both cheap enough to run on volume.

We already put V4 Pro and V4 Flash through a hard debugging task against Opus 4.7. This post turns the lens sideways: the two leading open-weight options against each other, accessed the canonical way — Qwen3 235B via Together, DeepSeek V4 Pro via its own API — with quality, cost, and latency on the same axes.

Why This Matters Now

For years the open-weight story was "good enough, much cheaper." That's no longer the pitch. Both models are credible on hard tasks — we made the broader case in benchmarking open-source vs closed-source in 2026. So this isn't a charity case for the budget option; it's a genuine frontier matchup that happens to cost a fraction of the closed alternatives.

The Battery

Four task types, scored independently, because the winner flips depending on what you ask:

Task type Probe
Code Binary search with docstring, executed
Reasoning / math Discount word problem with visible working
Structured Strict JSON extraction
Multilingual Japanese round-trip translation

Results

Task Qwen3 235B (Together) DeepSeek V4 Pro (official API) Quality
Code — binary search ✅ correct + docstring · 5.4s · $0.000098 ✅ correct + docstring · 7.3s · $0.001589 tie
Math — original price ✅ "$50", shown working · 1.5s · $0.000027 ✅ "$50", shown working · 3.6s · $0.000745 tie
Structured — strict JSON {"name":"Marwa","age":31,"city":"Tunis"} · 0.7s · $0.000017 ✅ same JSON (extra spaces) · 2.6s · $0.000432 tie
Multilingual — JA round-trip ✅ faithful JA + clean English back · 1.4s · $0.000028 ✅ faithful JA + clean English back · 8.0s · $0.001608 tie
Totals $0.000170 · 9.0s wall $0.004374 · 21.5s wall tie

Per-task winner on quality: a clean tie. Both models were correct on all four — binary search returned with a proper docstring (Qwen wrapped in a code fence, DeepSeek returned with return -1 explicit; both correct), the discount math landed on $50 with one-line working on both, the JSON was schema-clean and content-identical (DeepSeek used slightly more whitespace), and the Japanese round-trip came back faithful on both. On the thing you'd actually choose a model for, these two open-weight systems are interchangeable across this battery.

What Flipped Since Last Time

When we ran this matchup in March, Together was the slow host: Qwen3 235B took 38–52 seconds on short tasks while DeepSeek V3.1's API answered in 1–2 seconds. That story has reversed.

In this run, Qwen3 on Together was faster than DeepSeek V4 Pro on every single task — sometimes by 4×. The math problem came back in 1.5 seconds on Qwen vs 3.6 seconds on DeepSeek; the JSON extraction in 0.7s vs 2.6s; the multilingual round-trip in 1.4s vs 8.0s. Together appears to have sorted out whatever was slow before, and V4 Pro — being the reasoning tier — burns visible time even on prompts that don't need it.

The cost picture flipped too. V4 Pro's reasoning-tier pricing ($1.74 / $3.48 per million in/out) is materially higher than V4 Flash's, and on this battery DeepSeek's bill was 25.7× larger than Qwen's in absolute terms ($0.004374 vs $0.000170). Still fractions of a cent either way, but it's worth naming: the new DeepSeek flagship is no longer the cheap option in this matchup.

The lesson from open-weight inference speed by host still holds — host choice swamps model choice on latency. It just swung the other direction this quarter.

The Cost Axis Isn't a Tiebreaker — It's the Point

With closed frontier models you pick the best and eat the cost. With these two, quality is close enough and cost is low enough that the decision is genuinely live. So it comes down to what you're optimizing for:

A model that ties on quality but takes 4× longer is fine for batch and bad for interactive. The model picks itself once you know which loop you're optimizing.

Reproduce It

curl -X POST https://www.promptster.dev/v1/prompts/compare \
  -H "Authorization: Bearer $PROMPTSTER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "<one task from the battery>",
    "configurations": [
      {"provider": "together",  "model": "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"},
      {"provider": "deepseek",  "model": "deepseek-v4-pro"}
    ],
    "temperature": 0.2
  }'

Run each task as its own batch, score it, and execute the outputs. The leaderboard is downstream of your own data, not a vendor's slide.

The Real Lesson

The open-weight frontier war is good news no matter who wins, because the prize is yours: two frontier-class models, neither locked to a single host, both cheap enough to route freely. On quality they tied across our battery — so the decision moves to the axes that actually differ in production: host latency and cost-per-passing-response. This quarter Together flipped from slow to fast and DeepSeek's flagship moved up the reasoning tier; that's exactly why you re-run the battery instead of trusting last quarter's verdict. Measure both, and route on the result. For the cost framing, see the 2026 cost-per-quality breakdown; for the host effect, open-weight inference speed by host.


Tests run 2026-05-30 via the Promptster /v1/prompts/compare API. Qwen3 235B via Together (Qwen/Qwen3-235B-A22B-Instruct-2507-FP8), DeepSeek V4 Pro via its official API (deepseek-v4-pro). Temperatures 0.0–0.2; outputs executed/validated by hand. Costs computed from the May 2026 pricing.ts.