How do you trust agent output without reviewing everything manually?
The harness is the infrastructure that makes autonomous agent tiers safe. Without it, you have two bad choices: review everything (which backs up) or review nothing (which breaks things).
There's a hard constraint at the center of AI-native engineering: you cannot manually review everything agents produce and still capture the velocity gains. But you also cannot blindly merge agent output and maintain delivery stability.
The harness is what resolves this. It's the verification and governance infrastructure that sits between agent execution and merge: automated checks that establish whether agent output is correct, safe, and compliant before a human needs to evaluate it. When the harness works, human review becomes an exception-handling layer rather than the primary quality gate.
Most engineering orgs have half a harness: they have tests. Tests are necessary but not sufficient. A full harness includes deterministic checks, behavioral scoring, governance enforcement, and audit logging. The gap between "we have tests" and "we have a harness" is where most AI engineering governance failures happen.
The hard constraint
Why agents can't verify their own work
One of the most common naive approaches to agent governance: ask the agent to review its own output before submitting. It's tempting. Agents can read code, they understand requirements, why not have them self-check?
The problem is systematic. Models that generate code are optimized for generating plausible-looking output. Evaluating output for correctness requires a different capability: detecting flaws, edge cases, and spec violations. Research from Anthropic's harness design work shows that models systematically miss their own generation errors. A 44% safety violation miss rate has been documented when agents self-evaluate. The same pattern that produces errors also produces blind spots in evaluation.
Matt Pocock's critique of self-improving agent systems captures it: when an agent can modify its own context or evaluation criteria, you get a positive feedback loop that optimizes for appearing correct rather than being correct. Verification must be external. The judge and the defendant cannot be the same system.
“The stop condition is the product. An agent without a well-defined termination condition is not a product. It's a process that runs until it runs out of tokens or produces something the engineer accepts by exhaustion.”
Deep Feed, 47-run experiment on agent loop completion
Architecture
The two-layer harness model
Every harness needs two distinct layers. Conflating them produces gaps. Building only one produces the other's failure mode.
Eval harness
Verifies that agent output correctly implements the intent. Answers: 'Did the agent build what we asked for?'
Components
- Deterministic checks: types, lint, tests, security scans
- LLM-as-judge scoring: behavioral review against spec acceptance criteria
- Regression baselines: does this change anything that worked before?
- Interface compliance: do public APIs match their contracts?
Failure mode without this layer
Without this layer: agents produce code that passes tests but fails behavioral intent. Review debt accumulates because reviewers can't trust that even the basics have been checked.
Governance harness
Enforces organizational policy on agent behavior. Answers: 'Is this change allowed to proceed at this stage?'
Components
- Runtime policy enforcement: is this change in the right tier?
- Escalation routing: route high-risk changes to appropriate humans
- Audit logging: immutable record of what the agent did, what gates it passed
- Override tracking: when humans override governance, why?
Failure mode without this layer
Without this layer: governance exists on paper but not in infrastructure. Individual engineers make ad-hoc decisions about what needs review. Compliance risk accumulates silently.
Building the eval harness
Eval gates in sequence
Eval gates should be ordered by cost and speed. Cheap, fast checks run first. Expensive, slow checks run only if cheap checks pass.
Gate 1: Deterministic fast checks (seconds)
Lint, type checking, formatting, import validation. These catch the most common agent errors (wrong types, style violations, import conflicts) in under 30 seconds. If these fail, there's no point running expensive checks. Every agent workflow must pass these before entering the review pipeline.
Gate 2: Test suite (minutes)
All existing tests must pass. AI-specific gate: no new test failures introduced. Track test coverage delta. If coverage decreases, flag for review. This is table stakes, not a harness. But it's the foundation everything else builds on.
Gate 3: Security and dependency scan (minutes)
Automated security scan (SAST), dependency vulnerability check, secrets detection. AI-generated code has 1.5–2x higher security vulnerability rate than human-written code, so this gate specifically addresses that. Don't skip it for 'minor' changes.
Gate 4: Behavioral spec check (if spec exists)
If the change has a spec with testable acceptance criteria, run them. This is where LLM-as-judge scoring applies: an evaluator model reads the spec, reads the diff, and scores compliance. Anthropic's harness design work provides the pattern. The evaluator model must be different from the generator model.
Gate 5: Regression baseline comparison
For changes to critical paths: run behavioral regression suite. Did this change anything that was working? This is especially important for AI-generated refactors, which can pass unit tests while breaking integration behaviors.
Governance layer
Who owns what in AI governance
Governance without clear ownership produces the governance mirage: policy that exists on paper but isn't enforced. Use this as a starting point for your governance RACI.
R = Responsible, A = Accountable, C = Consulted, I = Informed
What changes
From tests to harness
Tests verify code correctness. Reviewers verify behavioral intent.
Harness verifies both. Human review handles judgment calls only.
Every PR requires the same manual review regardless of risk.
Tier 1 changes auto-merge. Tier 2 gets routed to the right reviewer. Tier 3 escalates automatically.
Governance policy exists in a Confluence doc. Engineers apply it inconsistently.
Governance policy is enforced in CI. Violations can't merge without explicit override (which is logged).
When something breaks, unclear whether it was AI-generated and which gates it passed.
Full audit trail: agent session, gates passed/failed, reviewer, override reason if applicable.
CO-BUILD PROGRAM
From playbook to production
We work directly with engineering leaders who are making this transition now. You bring the real constraints; we help you build the coordination layer around them.