Evals are a product surface, not a test suite.

Here is a pattern we see constantly. A team builds an AI feature. It demos beautifully. It ships. And then, slowly, nobody can answer a simple question: is it getting better or worse? Someone tweaks a prompt on Tuesday, swaps a model on Thursday, and by the following week quality has drifted in a direction no one can name, let alone measure.

The reflex is to treat evaluation like testing — a gate you build once, run in CI, and forget. That framing is the problem. Evaluation isn’t a safety net under the product. For anything probabilistic, evaluation is the product surface. It’s the only place where “good” stops being an opinion and becomes a number you can move.

Why “it works” stops being true

A deterministic function either returns the right value or it doesn’t. A model returns a distribution of plausible answers, and the boundary of “acceptable” lives in human judgment. That boundary moves: as your inputs shift, as the model is updated upstream, as your users discover edge cases you never imagined. Without a measurement that travels alongside the system, you have no way to know which way you’re drifting until a customer tells you.

An eval suite is the difference between knowing your system is good and hoping it still is.

What a real eval system looks like

The version that works is less glamorous than a leaderboard and far more useful. It has a few properties we insist on:

A graded golden set. A curated collection of representative examples, each with a known-good outcome, spanning the task families that actually matter to the business — including the awkward ones.
Mixed graders. Exact-match where the answer is crisp, model-graded where it’s fuzzy, and human spot-checks sampled continuously to keep the automatic graders honest.
Wired into CI. Every change — prompt, model, retrieval, parameters — runs against the golden set with a diff against the last known-good baseline. A regression blocks the merge, the same as a failing unit test.
Owned, not abandoned. The set grows as the product does. Every production incident becomes a new graded example, so the same failure can never ship twice.

The day it pays for itself

On one engagement, an eval harness caught a silent regression two days before a launch — a model update upstream had quietly degraded one task family while leaving the others untouched. No human would have noticed it in a demo. The golden-set diff noticed it in three minutes. That single catch paid for the entire harness, several times over.

Cost note: a good eval pipeline adds minutes to CI, not hours. The expensive thing is not running evals. The expensive thing is finding out in production that you should have been.

Build the gauge before you build the engine

The teams that ship trustworthy AI aren’t the ones with the cleverest prompts. They’re the ones who decided, early, that quality would be measured rather than felt. They built the gauge before they built the engine — so that every subsequent decision had a number attached.

If you’re shipping a model into anything that matters, the first question isn’t “which model.” It’s “how will we know it’s working tomorrow.” Answer that, and everything downstream gets easier. Skip it, and you’re not shipping a product. You’re shipping a vibe with a latency budget.

Category

Published

Reading time

Evals are a product surface, not a test suite.

Why “it works” stops being true

What a real eval system looks like

The day it pays for itself

Build the gauge before you build the engine