Observability Is Now Table Stakes: Monitoring Prompts in Production with Scheduled Tests
By Promptster Team · 2026-05-31
A year ago, "LLM observability" meant grepping your logs after a customer complained. In 2026 it's table stakes. The most-cited number from this year's tooling surveys: roughly 89% of teams running LLMs in production now do some form of observability — and that's ahead of the share doing formal evals. Teams figured out the cheap, obvious thing first: watch what's actually coming back from the model.
That's the right order. Evals tell you whether a prompt is good in a lab. Observability tells you whether it's still good at 3am on a Tuesday when the provider quietly shipped a new checkpoint. You need both, but if you only have time for one, watching production wins.
Observability vs Evals vs Monitoring
These words get used interchangeably. They aren't the same thing.
| Practice | Question it answers | When it runs | Trigger |
|---|---|---|---|
| Evals | Is this prompt good enough to ship? | Pre-merge / CI | Code change |
| Observability | What is production actually returning right now? | Continuously | Always-on |
| Drift detection | Did quality silently degrade since last week? | On a schedule | Cadence |
The trap most teams fall into: they build evals, gate their CI, and call it done. But CI only fires when you change something. The most dangerous LLM regressions come from changes you didn't make — a silent model refresh, a tightened safety classifier, a tokenizer bump. Your commit history is clean. The output is different. (We dug into exactly this failure mode in scheduled drift detection.)
The Three Signals Worth Watching
You don't need a 40-panel dashboard. You need three signals on your highest-stakes prompts:
- Quality — an eval score (relevance, accuracy, completeness, clarity) against a reference input set.
- Latency — p50 and p95 round-trip time. A provider degrading from 1.2s to 4s p95 is a real incident even if quality holds.
- Cost — tokens and dollars per call. A prompt that started emitting 3x the reasoning tokens after a model update is a budget leak.
If a signal moves outside its band, you want a notification — not a quarterly review.
Setting Up Continuous Monitoring with Promptster
Promptster's Scheduled Tests run a saved test on a cron cadence, store every run, and fire SLA alerts when a threshold trips. Here's the end-to-end workflow.
Step 1 — Build a reference test
In the app, run your production prompt against a representative input set across the providers you actually serve. Save it. This saved test is your canary — keep it small (20–50 inputs) and stable. Don't edit it casually; a moving baseline is a useless baseline.
Step 2 — Schedule it
Create a schedule pointing at the saved test. Pick a cadence proportional to stakes:
0 9 * * * # daily 09:00 UTC — mission-critical prompts
0 9 * * MON # weekly Monday — important but not on-fire
0 */6 * * * # every 6h — anything customer-facing at scale
Via the API the same thing looks like:
curl -X POST https://www.promptster.dev/v1/schedules \
-H "Authorization: Bearer pk_live_your_key" \
-H "Content-Type: application/json" \
-d '{
"saved_test_id": "your-saved-test-id",
"cron": "0 9 * * *",
"alerts": {
"min_quality_score": 0.82,
"max_latency_ms": 4000,
"notify": ["email", "webhook"]
}
}'
Step 3 — Configure SLA alerts
Set the bands you actually care about. Reasonable starting points:
| Signal | Customer-facing | Internal tooling |
|---|---|---|
| Min quality score | 0.82 | 0.72 |
| Max p95 latency | 4000 ms | 10000 ms |
| Max cost / call delta | +25% vs baseline | +50% vs baseline |
Start lenient. A noisy alert that fires every other day gets muted, and a muted alert is worse than no alert.
Step 4 — Wire alerts somewhere a human sees them
Email is fine to start. A webhook into Slack or PagerDuty is better — observability only works if the signal reaches someone who can act. Promptster posts the run delta in the payload, so your handler can route by severity.
What the Dashboard Actually Tells You
Every scheduled run stores its score, latency, and cost. The trend line is the product. Three patterns to recognize on sight:
quality
0.90 ┤●●●●●●●●● ← stable: ignore
0.85 ┤ ●●●●● ← slow leak: investigate this week
0.80 ┤ ●●●●
0.75 ┤ ● ← cliff: page someone now
└──────────────────────
week 1 2 3 4
A slow leak usually means a silent minor-version rollout — decide whether to accept the new baseline or pin a version. A cliff usually means a safety-filter change or major refresh — investigate immediately, often you'll need to rewrite the prompt around a new refusal pattern.
Don't Monitor Everything
The fastest way to make observability useless is to point it at every prompt you have. Focus on:
- High-volume prompts (>1,000 calls/day)
- High-stakes prompts (money, health, legal, anything user-facing)
- Prompts feeding downstream parsers (a format shift breaks them silently)
Skip internal debugging prompts and anything a human reviews anyway. Start with 3–5 canaries.
The Real Lesson
Observability led the 2026 adoption curve because it's the cheapest insurance in the stack: a 20-input reference set run daily across three providers is a few cents a month, and it converts "a user told us the bot got worse" into "an alert told us before any user noticed." Evals prove a prompt is good once. Observability proves it's still good — and that's the part you don't control.
Wire the canary first, then layer evals into CI. For the CI side, see automating prompt testing for production and evals are the new unit tests.