90 Days of Prompt Engineering: The Cross-Cutting Lessons From ~90 Posts of Testing

By Promptster Team · 2026-06-24

At 30 days we published a first recap — speculative in places, grounded in a handful of ad-hoc tests. At 60 days we followed up with the patterns that held and the surprises that emerged. This is the 90-day mark.

Across roughly 90 published posts — benchmarks, tutorials, security red-teams, compliance maps, and a model wave that reshuffled the entire frontier in late April — a few cross-cutting lessons have compounded into convictions. Others have been complicated by new evidence. This post is the synthesis: what we'd now stake advice on, what the April–May model wave changed, and the questions we're carrying into the next 90 days.

What Compounded Into Conviction

Multi-provider testing is still the single highest-leverage habit. Every post since day one reinforces it. The strongest single statement of the case remains why we stopped trusting single-provider benchmarks: the launch-post score predicts almost nothing about your production cost or quality. Ninety days of testing did not produce one counterexample.

Route by task shape, not brand. The task-type framework has held up across every domain we tested — coding, extraction, reasoning, creative, and now multimodal. Defaulting everything to the frontier overpays; defaulting everything to nano ships bugs.

Evals are the new unit tests — and the gate is the point. We started saying "evals are the new unit tests" as a slogan. By 90 days it's operational: a golden set that blocks a deploy on regression (evals as a production gate) is the difference between catching a regression and apologizing for one.

Never let a model judge itself. The self-preference bias finding from the judge bias audit generalized everywhere. Cross-provider judge panels are non-negotiable for any number that feeds a model-choice decision.

What the April–May Model Wave Changed

The single biggest event of the last 30 days was the frontier wave: Opus 4.6, GPT-5.2, Gemini 3.1 Pro, DeepSeek Reasoner, Qwen3 235B — most of them landing within weeks of each other. We tracked the fallout in the April model-wave drift report. Three things shifted:

Shift	Before the wave	After the wave
Frontier coding leader	Contested	Opus 4.6 a strong coding option; GPT-5.2 a close alternative
Hallucination floor	Mid-tier models fabricated confidently	GPT-5.2 pitched on reduced hallucination vs GPT-5
Cheap-frontier frontier	Didn't really exist	DeepSeek Reasoner: frontier-class at budget pricing
Multimodal ceiling	Untested on this blog	Gemini 3.1 Pro a strong multimodal contender

We put the big three head-to-head in GPT-5.2 vs Opus 4.6 vs Gemini 3.1 Pro and isolated the budget disruptor in testing DeepSeek against the big models. The economic headline: the 2026 cost-per-quality frontier-tax analysis shows the premium for "the best model" buys less than it did a quarter ago, because the budget tier caught up faster than the frontier pulled away.

The Surprises That Held — and the One That Broke

From the 60-day post, the surprises that stayed surprising:

Shared training data produces shared errors. Consensus across models is not verification when the models share blind spots.
Open-weight injection resistance lags frontier resistance by more than the quality gap does. New red-teams on MCP tool descriptions extended this from prompts into tool metadata — the attack surface grew, not shrank.
Reasoning-model APIs bill invisible tokens. Still true, still trips up naive cost accounting.

The assumption that broke: that tool-calling reliability tracks general quality. It doesn't. Our tool-calling reliability tests found models that top coding benchmarks can still mangle function-call schemas, and cheaper models can be more reliable at structured tool use. Capability and reliability are different axes.

The Advice We'd Now Stake

After 90 days, the recommendations we're most confident in — unchanged in spirit from 60 days, sharpened by the wave:

Route by task, and re-route after every model wave. The right model in March was not the right model in May.
Budget tier is the pragmatic default for well-specified work — and DeepSeek Reasoner raised the quality ceiling of that default.
Cross-provider judge panels for any decision-grade number. Self-preference bias is real and consistent.
Test injection resistance — now including tool/MCP metadata, not just prompts.
Schedule drift detection on high-stakes prompts. A model wave is precisely when silent quality shifts happen.
Treat prompts as code: versioned, diffed, reviewed, eval-gated.
Generate compliance evidence as a byproduct, not a project. The same version diffs and request logs feed the EU AI Act, SOC 2, and HIPAA at once.

Open Questions for the Next 90 Days

The questions we're carrying forward — some inherited from the 60-day post, some new:

Multimodal under load. We've started on document and image understanding, but field-accuracy at production volume across the big three is wide open.
Long-running agent quality over 20+ turns. Still mostly single-turn benchmarks out there; real agents degrade in ways we can't yet predict.
Cross-lingual evaluation. Almost everything we've tested is English. Providers diverge more off-English.
Migration cost as a first-class metric. We started measuring it in migrating GPT-5 to GPT-5.2 — the diffs are subtle and the regressions are real. How do you quantify "switching cost" before a wave forces your hand?
Does budget-frontier convergence hold? If DeepSeek Reasoner and peers keep closing the gap, the entire cost-per-quality calculus inverts. We'll know in 90 days.

The 90-Day Summary in Three Sentences

Cross-provider testing is still the highest-ROI habit for AI-backed products, and the April model wave only made the case stronger by scrambling every default. The gap that's closing fastest is between budget and frontier quality; the gaps that aren't closing are security (open-weight injection), reliability (tool-calling), and honest evaluation (self-judging models). The published launch-post score still tells you approximately nothing about your production workload — which is the same thing we said on day one, now with 90 posts of evidence behind it.

For the full arc, start at the 30-day recap, continue through the 60-day synthesis, and pull whichever benchmark or tutorial matches the problem in front of you. The tests themselves live in Promptster — re-run any of them on your own workload from the comparison view.

Synthesis of roughly 90 published posts and the test runs behind them, through June 2026. Forward-looking questions are exactly that — open. Re-run any cited test on your own workload before betting on its conclusion.