AI Evals: How We Test Agents Before They Hit Production

The most important file in any serious AI project is not the prompt. It is the eval set. Without one, you are guessing whether each change made the system better or worse. With one, you have something close to traditional engineering: a feedback loop that tells you the truth.

Why “looks fine” is not a test

Most early AI work is judged by vibes. Someone tries the prompt with a few inputs, the answers look reasonable, and it ships. Two weeks later a user reports something weird, the team tweaks the prompt, and now nobody knows whether the original cases still work. This is not a process; it is a slow regression machine.

Evals fix this by replacing vibes with a measurable score. You change the prompt, the model, or the tool definitions, you run the evals, and you see exactly what improved and what broke.

What an eval set actually contains

A useful eval set has three layers.

Golden cases. A handful of carefully chosen inputs with known-good outputs. These should cover the happy path, the obvious edge cases, and the inputs that have already failed in real life. Twenty to fifty good cases beat a thousand sloppy ones.

Rubrics. For open-ended outputs (drafts, summaries, plans) there is rarely a single correct answer. We define a rubric - “did the response cite the right doc?”, “did it preserve the customer’s tone?”, “did it avoid promising things outside policy?” - and use a model to grade against the rubric. Done well, this gets you 80% of the value of human review at a fraction of the cost.

Trajectory checks. For agents, the final answer is not enough. We also check that the agent took a sensible path: which tools it called, in what order, with what arguments. A correct answer reached through three unnecessary tool calls is still a problem.

Run them automatically, on every change

The point of evals is the loop, not the score. We run the suite on every prompt change, every model upgrade, and every tool addition. We track the pass rate over time, and we add new cases whenever something fails in production. Within a few months, the eval set becomes the most valuable asset in the project - more valuable than the prompts themselves.

What evals do not catch

Evals are necessary but not sufficient. They will not catch latency regressions, cost blow-ups, or behavioural drift in long conversations. For those you need observability in production - logging, sampling, human review of a slice of real traffic. Evals tell you the system can do the job; production telemetry tells you it actually is.

How we help

When we ship an AI agent, the eval suite ships with it. We build the golden set and the rubric judge during the first iteration, wire them into CI so every change is measured, and hand over a process - not just a system - the team can run after we leave. If you are about to deploy an agent and “we tested it” means a few manual prompts in a notebook, we should talk before that goes live.

AI Evals: How We Test Agents Before They Hit Production

Why “looks fine” is not a test

What an eval set actually contains

Run them automatically, on every change

What evals do not catch

How we help

More from the studio.

From Manual Ops to Automated Workflows: A Five-Step Roadmap

Core Web Vitals for WordPress: A Practical Playbook

Internal Copilots That Actually Get Used

Working on something similar?