I Built 20 Autonomous AI Agents. Then I Learned Why Evaluation Has to Come First.

ai-architecture evaluation autonomous-agents comprehension-audit

It cost me 14 failed tasks and a system I had to rebuild from scratch. Here's what changed.

The spec said “renders should look professional.”

That was the acceptance criteria for a visual design system I built — an AI-powered pipeline that generates branded social media graphics across multiple brands and visual formats. The specification was architecturally complete. It defined inputs, outputs, component relationships, and deployment targets. It was also evaluatively empty. There was no instrument to measure whether a change improved or degraded the output.

I shipped it anyway.

Scene: Late-night workstation with multiple monitors showing design iterations, code review fatigue

Fourteen dispatch tasks. Three manual fixes in the IDE. Multiple CSS adjustment cycles. And a final output that still didn’t meet the bar. Every change was a guess. “Does this look better?” is not a measurable question when you’re reviewing your fifteenth variation at 2 AM.

I built the same system twice.

The second time started differently. Before any code was written, I built three things.

First, a regression test suite — 120 render combinations spanning every brand, every visual type, every theme mode. Not synthetic test data. Real content units pulled from the production queue, the same artifacts the system would process in daily operation.

Second, a measurable acceptance rubric. Dimensions within ±0 pixels of target. File sizes under 2MB. Eight named visual acceptance tests covering typography scale, color accuracy, logo placement, and safe-zone compliance. Not “looks good.” Passes or fails.

Third, a dependency-graph decomposition. Each build phase had explicit input/output contracts with prerequisite checks. Phase 3 couldn’t start until Phase 2’s outputs passed their acceptance gate. No phase could begin by assuming the previous one worked. Every phase had to prove it.

The rebuild completed in four dispatch tasks. Forty-three minutes of execution time. Zero rework iterations. 120 out of 120 regression pass rate. The output was assessed as presentation-quality — suitable for a conference panel deck.

Same system. Same tools. Same team of one. The difference was building the instrument first.

I wish I could tell you I figured this out on my own. I didn’t. The lesson came from a different project entirely, and it came the hard way.

I was building a notification processing system for an enterprise B2B platform — scheduled emails delivering order status updates to thousands of customers on daily, weekly, and monthly cadences. The core processing pipeline was straightforward: timer trigger, database query, template building, email delivery. The evaluation mechanism was “deploy and wait 24 hours to see if emails arrived.”

Every change to the template data structure, the email recipient logic, or the schedule calculation required a full deploy-and-wait cycle. A template change that looked correct in code review would produce a blank email field in production, and I wouldn’t know until the next day’s timer fired. If the fix was wrong, that was another 24-hour cycle.

Debugging was brutal. Not because the problems were complex — most were simple data-mapping errors. The brutality was the feedback loop. Twenty-four hours between “I think this works” and “it doesn’t.”

The Notification Pipeline

The turning point was building a test notification endpoint — an HTTP-triggered function that processed notifications on demand, bypassing the timer schedule. It accepted specific notification IDs, processed them immediately, and returned structured results showing exactly what was processed. Critically, it separated the “process and send” concern from the “update the schedule” concern. Test runs didn’t corrupt production schedule state.

That single endpoint compressed the feedback loop from 24 hours to seconds. Template changes could be verified in the same session they were made.

The team resisted building it. “We can just wait for the timer.” Two weeks of 24-hour debug cycles made the case that the evaluation instrument should have been built before the processing pipeline, not after.

These two experiences — the visual design system I built twice, and the notification pipeline I couldn’t debug — taught me the same lesson from opposite angles. One showed me what happens when you start with the evaluation system. The other showed me what happens when you don’t.

The pattern kept repeating. A content publishing pipeline shipped features for three weeks without formal specifications. Tasks were submitted as natural-language descriptions. AI agents interpreted intent, wrote code, and committed. Most of it worked. Some of it didn’t — and when it didn’t, there was no artifact to compare the output against. “Is this what was asked for?” was unanswerable because “what was asked for” existed only in a chat message that the executing agent had already forgotten.

The breaking point was a scheduling enforcement feature. The enforcement code existed. The schedule data existed. They referenced different file paths. No specification had ever defined the canonical path. The code was what I started calling “dark code” — code that worked, that was tested, that was committed, but that no human had ever comprehended as a system because no specification linked the components.

Dark code — code that worked, that was tested, that was committed, but that no human had ever comprehended as a system.

That experience forced the creation of a formal specification protocol. Every feature now requires a spec before the first task is dispatched. The spec defines inputs, outputs, file paths, acceptance criteria, and failure modes. If the output doesn’t match the spec, either the spec needs revision or the output needs rework. There is no third option where “it works but differently than specified” is acceptable.

The result: the same system went from 14 tasks with rework to 4 tasks with zero rework. The tool didn’t change. The team didn’t change. The specification changed.

When I built the Comprehension Audit — a diagnostic tool that scores how well someone understands their own AI project — the evaluation architecture was the first thing I designed. Not the frontend. Not the API. Not the scoring algorithm. The evaluation system.

The tool asks four questions. Each response gets assessed across eight weighted dimensions by a dual-run LLM judge operating at temperature 0.0 — deterministic, no creative drift. Both runs must converge within ±1 point on every dimension or the system flags a divergence. The convergence threshold isn’t a nice-to-have. It’s the instrument that tells me whether the judge itself is stable.

The scoring weights sum to 1.15, not 1.0. That’s intentional. The extra 0.15 is distributed across the two dimensions that matter most for real-world AI project success — failure mode awareness and architectural intentionality. Those dimensions carry more weight because, across every project I’ve built and every production system I’ve shipped, those are the two areas where comprehension gaps cause the most expensive failures. The weights aren’t intuitive. They’re calibrated against scar tissue.

Scene: Architect reviewing calibration data on screen, precision measurement instruments metaphor

I published 24 calibration examples — three per dimension, at score levels 2, 3, and 4. Not 1 and 5. The floor and the ceiling are easy to spot. The distinction between a 2 and a 3, between “surface understanding” and “working comprehension” — that’s where evaluators drift. That’s where I needed the instrument to be sharpest.

Before the calibration examples existed, the LLM judge scored 60% of test responses at Level 4 or higher. Manual assessment of those same responses put the real number closer to 15%. The judge wasn’t broken. It was doing what language models do — inflating scores toward the agreeable middle, producing assessments that feel right without being precise. The calibration data didn’t change the model. It changed what the model was comparing against.

The Discipline

This is the discipline I keep re-learning: the evaluation system is not a phase that comes after the build. It is the structure that makes the build possible.

“Does this look better?” is a conversation. “Does this pass 120 regression tests?” is an instrument. “We can just wait for the timer” is a 24-hour feedback loop that turns a 10-minute fix into a week-long debugging exercise. “Renders should look professional” is a spec that produces 14 tasks and rework. “Dimensions within ±0px, 8 named visual tests, 120/120 pass rate” is a spec that produces 4 tasks and zero rework.

The difference isn’t better tools or more time. It’s starting with the measurement system. When the spec includes what “good” looks like in terms a machine can verify, every downstream decision has a feedback loop. Without it, you’re guessing — and the most dangerous kind of guessing is the kind where every guess looks plausible.

Better specifications don’t just reduce rework. They eliminate the category of failure where you can’t tell whether you’re improving or not.

The Comprehension Audit is an open-source diagnostic tool that applies evaluation-first architecture to AI project assessment. Try it at wilfredmorgan.com/comprehension-audit, or explore the evaluation architecture in the open-source repository.

Wilfred Morgan

AI Systems Architect · Agentic AI Implementation

Book a Strategy Call →