Writing · AI Evaluation
An Operator's Framework For Evaluating LLM and Emerging-AI Tools
Most operator-facing AI evaluations are vibes. A working framework forces you to test against your own data, your own latency budget, and your own cost ceiling.
I get pitched five new AI tools a week. So does every operator I know. The vendor decks are gorgeous, the demos are crisp, the leaderboards are flattering, and somewhere in the second meeting the sales engineer mentions that yes, of course, you would want to fine-tune on your data for the best results.
Two months later you have a Slack channel full of half-shipped pilots, three contracts you cannot get out of, and a real question: what actually works for your workload, on your data, at your latency and cost ceiling. The tools that won the demo rarely win the pilot. The tools that win the pilot rarely win production. The gap is real and it is structural.
Here is the framework I use, and that I make every operator I work with use. It takes about a week to run and saves about six months of pilot drift.
The wrong way
Three patterns kill more AI evaluations than any other.
The vendor demo. The vendor controls the data, the prompts, the questions, and the camera angle. The demo is the best the product has ever performed. Anything you decide on the basis of a vendor demo is a decision you made on someone else's evidence.
The public leaderboard. MMLU, HumanEval, MT-Bench, and the rest are useful for model researchers and meaningless for operators. Public benchmarks measure performance on data that has likely leaked into training sets. They measure capability on tasks that are nothing like yours. A model that scores ten points higher on a public leaderboard can perform identically or worse on your workload.
The vendor pitch with a tailored example. The vendor sends a "custom" demo built on data they sampled from your industry. It looks great. It tells you nothing about your tail cases, your edge cases, or your idiosyncratic schema. It is a marketing artifact dressed as evidence.
The right way is to evaluate every tool on the same fixed harness, using your data, against your real production constraints. The harness is more important than the tool. Build it once and it pays back forever.
Step 1: Define the failure modes you actually care about
Before you write any test, write down what failure looks like. Not in the abstract — concretely, in your domain.
For a customer support workflow, failure might be: hallucinated policy details that contradict the actual policy; tone mismatch with the brand; failure to escalate when the customer indicates frustration; missed factual claims in the customer's message. Each of those is a different failure type with a different cost.
Rank them. Some failures are catastrophic. Some are annoying. The eval has to weight them accordingly. A model that has fewer total errors but more catastrophic ones can be worse than a model with more total errors but only minor ones. Most leaderboards average everything; you cannot.
If you cannot list at least five specific failure modes for the workflow you are evaluating, you do not understand the workflow well enough to evaluate any tool against it. Stop and figure that out first.
Step 2: Build a tiny golden eval set on YOUR data
The most underrated artifact in all of AI engineering. A golden set is a small, curated collection of real examples drawn from your production traffic, hand-labeled with the correct or expected output by people who actually understand the domain.
How small? Smaller than you think. A golden set of 100 to 300 examples is usually enough to discriminate between candidate tools meaningfully. The marginal value drops off fast above 500. The constraint is quality, not size. One hundred examples that genuinely cover your failure modes are worth ten thousand examples that don't.
How to build it. Sample real production data. Stratify the sample so it covers your major use cases and your known edge cases. Have a domain expert label each example with the expected output and a notes field for what would be acceptable variation. Lock the set. Version it. Treat it like a regression test suite, because that is what it is.
Run every candidate tool against the same locked golden set. Score the outputs against the labels using a combination of automated metrics and a human review for the things automated metrics cannot catch. The scores from the golden set are the only number you should trust about a tool's performance on your workload.
Step 3: Test against latency and cost ceilings
Accuracy is necessary and not sufficient. A tool that produces beautiful output at 4 seconds per call is useless for an interactive product. A tool that costs $0.40 per call is useless for any high-volume workflow.
Define your ceilings before you start. For interactive use, p95 latency under 800ms. For batch use, throughput per dollar. Cost per task at expected volume, projected over twelve months. Then run the candidates under realistic load — not single-call benchmarks, but concurrent calls at your expected production concurrency. Many tools that look fast in isolation collapse under concurrency.
I have seen pilots that picked the most accurate tool, deployed it, and discovered three weeks in that the per-call cost made the unit economics untenable. The accuracy advantage was real. It was also irrelevant given the cost. The decision should have been made up front, against the ceiling, with both numbers on the table.
Step 4: Test the deployment path
Most evaluations stop at "it produces good output." Production has more questions.
Does the vendor expose the streaming API your product needs, or only batch. Are there rate limits that will bite you at peak. What is the actual SLA — read it, do not take the salesperson's word. Where does data go, and is it retained or used for training. Can you bring your own keys for downstream services. Does the API support function calling, structured outputs, JSON mode, the specific features you depend on. Is there a way to roll back to a previous version of the model.
I lost a deal once because the vendor's "real-time" streaming API had a 1.2 second time-to-first-token at p95, and our product spec required 350ms. The accuracy was excellent. The deployment path did not work. We caught it in a load test before signing.
Build a small integration with each finalist before signing anything beyond a pilot agreement. Even a four-hour spike will surface integration issues that the sales call hid.
Step 5: The kill criteria
Define, in writing, what would make you kill the tool before you start. Not aspirational benchmarks — specific thresholds.
Below this golden-set score, kill it. Above this p95 latency, kill it. Above this per-call cost, kill it. Below this rate of catastrophic failures, kill it. After eight weeks if these production metrics are not met, kill it.
Kill criteria written in advance survive the sunk-cost trap. Kill criteria written after a pilot has been running for two months become rationalization. The number one reason bad AI tools stay in production is that nobody wrote down what failure looks like before the pilot started, so when the pilot underperforms, the team negotiates with itself instead of cutting it.
If you cannot write down the conditions under which you would kill the pilot, you are not evaluating. You are subscribing.
A 10-line checklist
This is the checklist I keep on my desk and walk every operator through.
- List the five failure modes that actually matter, ranked by cost.
- Build a golden set of 100 to 300 hand-labeled examples from your production data.
- Lock the golden set. Version it. Treat it as a regression suite.
- Define p95 latency, per-call cost, and throughput ceilings before you test anything.
- Run every candidate against the same harness with the same prompts and the same retry policy.
- Score with automated metrics plus a human review pass on a stratified sample.
- Load-test at expected production concurrency, not single-call.
- Audit the deployment path: SLA, rate limits, data retention, streaming, JSON mode, rollback.
- Write kill criteria with specific thresholds and a deadline before signing.
- Re-run the harness quarterly. Models drift. Vendors update silently. Your decision can age out.
The work is unglamorous. It is also the difference between a stack of working AI features and a graveyard of pilots. Operators who run this discipline ship one or two carefully chosen tools that drive real outcomes. Operators who skip it ship five vendor logos and no measurable improvement.
The framework is not specific to LLMs. It works for any emerging-AI tool — vector databases, voice systems, agent frameworks, eval platforms. The tools change. The discipline does not. Test against your data. Measure against your ceilings. Define your kill criteria. Then decide, and own the decision.
What this saves you
Running this framework on a typical evaluation costs about a week of focused work and produces a written decision that holds up to scrutiny. Skipping it costs months. I have watched teams pay six-figure annual contracts for tools that, when finally tested against a real golden set, performed worse than a frontier API at a tenth of the price. I have watched teams roll out a vendor in production and discover only after a quarter that the per-call cost made every workflow that used it gross-margin negative. None of that requires sophistication to prevent. It requires discipline.
The biggest second-order benefit is institutional. Once you have a harness, every future evaluation is faster. The golden set grows. The eval scoring stabilizes. New models and tools get tested against the same yardstick the team already trusts, and you stop relitigating the basics every time a new vendor walks in. The harness becomes an asset that compounds the way good infrastructure always does.
One more thing. The framework forces you to be honest about what you actually need. Half the operators I run this with discover, somewhere around step two, that their workflow does not need an LLM at all — a deterministic script with a small classifier hits ninety percent of the value at one percent of the cost. The framework is not just a buy guide. It is a forcing function for clarity. The cheapest tool is the one you didn’t need in the first place.
Ajit Samuel is a New York City based founder and operator. He architects, ships, and operates production AI, agentic systems, real-time data platforms, advertising technology, and growth infrastructure. ajitsamuel.com.