MODULE 0412 min read·June 2026

How do you trust agent output without reviewing everything manually?

The harness is the infrastructure that makes autonomous agent tiers safe. Without it, you have two bad choices: review everything (which backs up) or review nothing (which breaks things).

There's a hard constraint at the center of AI-native engineering: you cannot manually review everything agents produce and still capture the velocity gains. But you also cannot blindly merge agent output and maintain delivery stability.

The harness is what resolves this. It's the verification and governance infrastructure that sits between agent execution and merge: automated checks that establish whether agent output is correct, safe, and compliant before a human needs to evaluate it. When the harness works, human review becomes an exception-handling layer rather than the primary quality gate.

Most engineering orgs have half a harness: they have tests. Tests are necessary but not sufficient. A full harness includes deterministic checks, behavioral scoring, governance enforcement, and audit logging. The gap between "we have tests" and "we have a harness" is where most AI engineering governance failures happen.

The hard constraint

Why agents can't verify their own work

One of the most common naive approaches to agent governance: ask the agent to review its own output before submitting. It's tempting. Agents can read code, they understand requirements, why not have them self-check?

The problem is systematic. Models that generate code are optimized for generating plausible-looking output. Evaluating output for correctness requires a different capability: detecting flaws, edge cases, and spec violations. Research from Anthropic's harness design work shows that models systematically miss their own generation errors. A 44% safety violation miss rate has been documented when agents self-evaluate. The same pattern that produces errors also produces blind spots in evaluation.

Matt Pocock's critique of self-improving agent systems captures it: when an agent can modify its own context or evaluation criteria, you get a positive feedback loop that optimizes for appearing correct rather than being correct. Verification must be external. The judge and the defendant cannot be the same system.

“The stop condition is the product. An agent without a well-defined termination condition is not a product. It's a process that runs until it runs out of tokens or produces something the engineer accepts by exhaustion.”
Deep Feed, 47-run experiment on agent loop completion

Architecture

The two-layer harness model

Every harness needs two distinct layers. Conflating them produces gaps. Building only one produces the other's failure mode.

Layer 1

Eval harness

Verifies that agent output correctly implements the intent. Answers: 'Did the agent build what we asked for?'

Components

Deterministic checks: types, lint, tests, security scans
LLM-as-judge scoring: behavioral review against spec acceptance criteria
Regression baselines: does this change anything that worked before?
Interface compliance: do public APIs match their contracts?

Failure mode without this layer

Without this layer: agents produce code that passes tests but fails behavioral intent. Review debt accumulates because reviewers can't trust that even the basics have been checked.

Layer 2

Governance harness

Enforces organizational policy on agent behavior. Answers: 'Is this change allowed to proceed at this stage?'

Components

Runtime policy enforcement: is this change in the right tier?
Escalation routing: route high-risk changes to appropriate humans
Audit logging: immutable record of what the agent did, what gates it passed
Override tracking: when humans override governance, why?

Failure mode without this layer

Without this layer: governance exists on paper but not in infrastructure. Individual engineers make ad-hoc decisions about what needs review. Compliance risk accumulates silently.

Building the eval harness

Eval gates in sequence

Eval gates should be ordered by cost and speed. Cheap, fast checks run first. Expensive, slow checks run only if cheap checks pass.

Gate 1: Deterministic fast checks (seconds)

Lint, type checking, formatting, import validation. These catch the most common agent errors (wrong types, style violations, import conflicts) in under 30 seconds. If these fail, there's no point running expensive checks. Every agent workflow must pass these before entering the review pipeline.

Gate 2: Test suite (minutes)

All existing tests must pass. AI-specific gate: no new test failures introduced. Track test coverage delta. If coverage decreases, flag for review. This is table stakes, not a harness. But it's the foundation everything else builds on.

Gate 3: Security and dependency scan (minutes)

Automated security scan (SAST), dependency vulnerability check, secrets detection. AI-generated code has 1.5–2x higher security vulnerability rate than human-written code, so this gate specifically addresses that. Don't skip it for 'minor' changes.

Gate 4: Behavioral spec check (if spec exists)

If the change has a spec with testable acceptance criteria, run them. This is where LLM-as-judge scoring applies: an evaluator model reads the spec, reads the diff, and scores compliance. Anthropic's harness design work provides the pattern. The evaluator model must be different from the generator model.

Gate 5: Regression baseline comparison

For changes to critical paths: run behavioral regression suite. Did this change anything that was working? This is especially important for AI-generated refactors, which can pass unit tests while breaking integration behaviors.

LEADER TAKEAWAY

The key harness design principle: each gate must be able to produce a clear pass/fail result that the governance layer can act on. A gate that produces "probably fine, but hard to say" is not a gate. It's a soft flag that gets ignored. Design your gates so the result is unambiguous.

Governance layer

Who owns what in AI governance

Governance without clear ownership produces the governance mirage: policy that exists on paper but isn't enforced. Use this as a starting point for your governance RACI.

Activity

CTO/VP Eng

Platform Lead

Eng Lead

Engineer

Define governance policy (decision tiers)

Build and maintain eval harness

Build governance enforcement in CI

Review and approve governance overrides

Maintain audit log and compliance reporting

Update spec acceptance criteria

Review escalated Tier 3 changes

R = Responsible, A = Accountable, C = Consulted, I = Informed

What changes

From tests to harness

We have tests

We have a harness

Tests verify code correctness. Reviewers verify behavioral intent.

Harness verifies both. Human review handles judgment calls only.

Every PR requires the same manual review regardless of risk.

Tier 1 changes auto-merge. Tier 2 gets routed to the right reviewer. Tier 3 escalates automatically.

Governance policy exists in a Confluence doc. Engineers apply it inconsistently.

Governance policy is enforced in CI. Violations can't merge without explicit override (which is logged).

When something breaks, unclear whether it was AI-generated and which gates it passed.

Full audit trail: agent session, gates passed/failed, reviewer, override reason if applicable.

Go deeper

From playbook to production

We work directly with engineering leaders who are making this transition now. You bring the real constraints; we help you build the coordination layer around them.

Talk to the team Back to the playbook