Engineering PracticeJun 2026·13 min read

Harness Engineering: The Discipline AI-Native Teams Need.

Anthropic named it in March 2026. Stripe built it without naming it. Most teams haven't built it at all. A harness is the eval suite, policy gates, and observability layer that keeps agents on track. Without one, you have agents. With one, you have a system you can trust.

LoomStack Team

Jun 9, 2026

In March 2026, an Anthropic engineer published a post titled “Harness design for long-running application development.” The opening line was direct: “Harness design is key to performance at the frontier of agentic coding.” The rest of the post described a three-agent architecture built around a single insight: agents cannot reliably evaluate their own output, so the evaluation infrastructure must be separate from the generation infrastructure.

That post named a discipline most teams are quietly failing at. Not because they haven't heard of testing. Because they have confused “we have some tests” with “we have a harness.” Those are not the same thing. Tests are individual checks. A harness is the infrastructure that runs them, gates on them, learns from production, and enforces policy at runtime. The difference is what separates a team that discovers regressions from customer complaints and a team that catches them in CI.

In software testing, a harness is the collection of stubs, drivers, and fixtures that configure an environment for automated testing: the infrastructure that channels raw capability toward verified behavior.

Harness engineering is the discipline of building that infrastructure: the eval suite, policy gates, and observability layer that constrains AI agents toward verified behavior. This is what separates the 41% of AI agent deployments that hit positive ROI in year one from the 19% that never recover their investment.

The gap most teams have

LangChain's 2026 State of Agent Engineering surveyed teams with agents in production. 89% had implemented observability. Only 52.4% were running offline evaluations against test sets. Only 37.3% had online evals monitoring production traffic.

That means roughly half the teams shipping AI agents have no systematic way to know whether a prompt change, model update, or retrieval corpus update degraded their agent's performance. They find out from user complaints. Or they don't find out at all, because the degradation is gradual enough that nobody notices until something breaks badly.

Researchers at Peking University quantified what this costs. Their Claw-Eval benchmark found that trajectory-opaque evaluation, grading only final outputs rather than execution paths, misses 44% of safety violations and 13% of robustness failures that a proper harness catches. Pass@3 stays stable while Pass^3 drops up to 24% across three independent trials. The agents look capable when measured generously. The harness reveals that a quarter of their apparent capability is lucky runs.

Teams without automated regression suites experience what practitioners are calling “eval debt.” DigitalApplied's 2026 productivity report found that these teams lose 14 to 23 percentage points of accuracy over 18 months. Not from any single failure. From accumulated drift they never measured.

Gartner forecasts that eval and governance will grow from 18-24% of total agent program spend today to 28-34% by mid-2027. That is not because teams want to spend more. It is because the teams that skipped it are rebuilding from scratch after their agents drifted into unreliability.

What a harness actually is

The word predates AI by decades. In software testing, a test harness is the collection of stubs, drivers, and fixtures that automate test execution. Martin Fowler's testing guide, rooted in the XP tradition Kent Beck formalized at Chrysler in the late 1990s, describes the harness as what makes a “single command that executes all the tests” possible. It is not the tests themselves. It is the infrastructure that runs them reliably.

In 2026, the term has been extended and made precise for AI systems. The ProofAgent research team defined it cleanly: “A benchmark defines tasks. An LLM judge assigns scores. A harness coordinates the full evaluation process: domain-aware setup, curated evaluation intelligence, adversarial multi-turn trials, behavioral trace capture, post-hoc multi-juror scoring, consensus, and evidence-linked reporting.” A harness is not a scoring method. It is evaluation infrastructure.

For a production AI system, a harness has two layers. They serve different purposes and operate at different times.

	Test / Eval Harness	Governance Harness
When	Pre-deploy (CI) + production scoring	Runtime, before every agent action
Purpose	Verify output quality	Enforce action policies
Mechanism	Rubric scoring, regression gates	Policy evaluation, authority issuance
Fail behavior	Blocks PR merge or deploy	Blocks or escalates the action
Example tools	Braintrust, Promptfoo, Arize	APE, Sanctum, policy engines

Both layers are necessary. Neither is sufficient on its own.

A team with a perfect eval harness but no governance harness ships correct-looking code that still takes dangerous actions at runtime. A team with a governance harness but no eval harness blocks bad actions but has no mechanism to improve output quality over time or detect drift. The teams that got AI in production reliably built both. Most teams have built neither fully.

Anthropic's harness design

The March 2026 Anthropic post is the most concrete public account of what a harness looks like for a long-running AI system.

The architecture has three agents. A Planner takes a one-to-four sentence prompt and produces a structured product spec, ordered chunks of work with acceptance criteria and dependencies. A Generator implements the spec one feature at a time, in sprint-sized blocks, using git commits to checkpoint work between sessions. An Evaluator exercises the live application using Playwright automation, clicking through UI features, hitting API endpoints, inspecting database state, exactly as a real user would.

The Evaluator grades each sprint against a set of criteria with hard thresholds. If any single criterion falls below its threshold, the sprint fails and the Generator receives specific feedback on what went wrong. Before each sprint starts, the Generator and Evaluator negotiate a sprint contract: a structured agreement on what “done” looks like before any code is written. The Evaluator reviews the Generator's proposal to ensure the right thing is being built. They iterate until they agree.

“Applications from earlier harnesses often looked impressive but still had real bugs when you actually tried to use them. To catch these, the evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states.”
— Prithvi Rajasekaran, Anthropic Engineering, March 2026

The key insight is architectural separation. The Generator cannot grade its own output. When a single agent generates and evaluates its own work, it finds reasons the work is done rather than finding the gaps. Separating generation from evaluation, keeping the Evaluator's context free of the Generator's reasoning trace, produces consistent and reproducible grading. The Evaluator sees only the artefact and the rubric, never the Generator's chain of thought.

This pattern, which researchers have catalogued as the Planner-Generator-Evaluator harness, is explicitly inspired by Generative Adversarial Networks. The Generator produces. The Evaluator critiques. The friction between them is the mechanism that produces quality. It is TDD at the sprint level, between two AI systems, with a human supervising the loop rather than executing it.

Building the eval harness

The eval harness has three jobs: catch regressions before they ship, detect drift in production, and improve over time by feeding production failures back into the offline dataset.

The first layer is cheap and deterministic. Does the output parse as valid JSON? Does it match the required schema? Does it contain a forbidden phrase or a known failure pattern? These run fast, cost essentially nothing, and catch a large class of failures without any model calls. Every eval harness should have this layer and it should run on every PR.

The second layer uses an LLM-as-judge. A more capable model grades the output against a rubric: semantic accuracy, factual correctness, tone, safety. This is more expensive but handles what rule-based checks cannot. Braintrust, Promptfoo, LangSmith, and Arize Phoenix are all production-ready options for this layer. The specific tool matters less than wiring it into CI so that a score drop below threshold fails the build.

Write the rubric before the prompt

Define what a correct response looks like, what constitutes a failure, what the agent should never say, before writing any prompt. Encoding criteria first prevents the interpretation drift that produces inconsistent agent behavior across the team.

Gate releases on eval thresholds

Every PR that touches a prompt, a tool definition, or a model version runs the full eval suite. A score drop below any metric threshold fails the build. Without this gate, evals are reports nobody reads. With it, they are guardrails.

Run the same rubric in production

Score live production traces against the same rubric used in CI. Attach scores as metadata to the spans. When the offline average and production average start diverging, that gap is your drift signal.

Promote production failures into the dataset

Every production failure that clusters with similar failures becomes a regression test. The on-call engineer reviews the cluster, promotes representative traces into the offline dataset, and the next PR has to clear them. This is how the eval suite stays synchronized with reality rather than drifting against a frozen golden set.

FutureAGI put it plainly in their 2026 analysis of production failures: “Your eval set is a snapshot, production is a river.” The offline suite is the floor. Production traffic is where the floor rises. Teams that close this loop find their eval suite is the most durable asset on their AI system, more valuable than any particular prompt or model choice.

The governance harness

The eval harness operates before and after execution. The governance harness operates during execution. It intercepts every proposed agent action, evaluates it against current policies and shared governance state, and decides whether to permit, block, or escalate.

A 2026 paper on runtime governance for AI agents makes the formal argument precisely: “Prompting is not an instance of [policy evaluation] at all. It modifies path distributions without evaluating them. Access control is a degenerate instance. It uses only agent identity and action type, ignoring path and context. Runtime evaluation is the general case.” The implication: any policy whose violation depends on what happened earlier in the execution path cannot be enforced by prompting or access control. It requires a runtime policy engine.

The Gravitee State of AI Agent Security report found that 82% of executives feel confident their policies protect against unauthorized agent actions. That confidence is based on policy documentation, not runtime enforcement. Only 14.4% of organizations have full IT and security approval for their entire agent fleet. Only 24.4% have full visibility into agent-to-agent communication. The rest are running agents in a state they cannot observe, enforcing policies they cannot verify at runtime.

“Telling the model 'ask before deleting' is not enforcement. A policy engine evaluates server-side policies, emits webhooks, and lets operators resume after approval — without rebuilding a state machine per action.”
— Sanctum runtime documentation

The Agent Policy Engine (APE) open-source project describes what a governance harness does in concrete terms: “Intent is declared. A plan is approved. Policies are evaluated. Authority is explicitly issued. Tools are executed through enforcement. Authority is consumed and revoked.” Every action goes through this cycle. There is no way to take a tool call without the cycle completing.

As we analyze in Adaptive Autonomy, the governance harness is also how you implement risk-calibrated autonomy in practice. Low-risk actions pass automatically. High-risk actions require human approval. The policy engine is what applies those rules consistently across every agent, every action, every run.

Stripe and Anthropic in practice

Stripe did not publish a post called “harness engineering.” But their Minions system, which ships over 1,300 AI-written pull requests per week, is a textbook implementation of it. Three million tests. Linting that completes in under five seconds via a background daemon that precomputes which rules apply. Any check that would fail in CI enforced locally before the push. Agents run in parallel because the test harness gives each one a ground truth to work against.

What makes Stripe's Minions harness work is not just the scale of the test suite. It is the feedback architecture. The system distinguishes between three failure modes: tests that fail because the code is wrong, tests that fail because the test setup is wrong, and tests that fail because the harness environment drifted from production. Each failure mode routes differently. Code failures go back to the agent. Setup failures escalate to a human. Drift failures trigger a harness audit. A harness that cannot distinguish between these three types of failures will eventually drown in false positives or miss real regressions.

“Start with your developer environment, your test infrastructure, and your feedback loops. If those are solid, agents will benefit from them. If they're not, no model will save you.”
— Stripe, on building Minions

The pattern they describe, failures becoming test cases, test cases preventing regressions, is the production ratchet. Each cycle, the floor rises. The eval suite six months in is the most durable asset on the system, because it encodes every failure mode the team has encountered and every regression they caught.

A 12-metric framework derived from 100+ production deployments reached the same conclusion: “The teams shipping AI agents successfully in 2026 aren't the ones with the best models. They're the ones with the best evaluation infrastructure. Models are commodities. Evaluation is differentiation.”

Where the harness meets coordination

A harness is necessary but not self-sufficient. Stripe's harness works because it sits inside a coordination layer: agents that know when to escalate, a CI system that routes feedback, and an engineering organization that has made the harness a first-class part of the workflow rather than an optional add-on. As we cover in The AI Engineering Coordination Bottleneck, a harness without the surrounding coordination layer is like a quality gate with no one enforcing it.

Harness engineering is a team discipline, not an individual one. The eval rubric needs to reflect what the whole team believes correct behavior looks like. The governance policies need to encode the organization's actual risk thresholds, not a generic template. The production feedback loop needs someone on-call to review failure clusters and promote them into the dataset. None of this is automated. The automation runs the loop. Humans define what the loop is optimizing for.

According to TURION.AI's 2026 enterprise analysis, 41% of AI agent deployments achieve year-one ROI. The three traits they share: evaluation infrastructure built before deployment, governance maturity, and scoped task selection. Not better models. Not more agents. The environment the agent operates in determines the quality of its output far more than the model itself does.

The 64% of leaders who cite “inability to prove consistent performance” as their top barrier to deploying AI agents are, in effect, describing the absence of a harness. They have agents that work sometimes. They do not have infrastructure that tells them when and why the agents are working. Without that, deployment is a bet. With it, deployment is a decision.

The new XP discipline

Kent Beck built the test harness concept into XP because he recognized that automated self-testing, not the tests themselves but the infrastructure that runs them reliably, was what made continuous development sustainable. The discipline was not about writing more tests. It was about building the infrastructure that made testing fast, reliable, and part of every commit.

Harness engineering is the same discipline for AI-native development. Not eval frameworks. Not individual test cases. The infrastructure that runs evals on every PR, scores production traces against the same rubric, routes policy decisions through a runtime engine, and feeds failures back into the regression dataset automatically. The infrastructure that makes AI-native development sustainable rather than an endless firefight.

Anthropic named it. Stripe built it. The teams hitting positive ROI have it. The teams losing 14-23 percentage points of accuracy per eighteen months do not.

The reason most AI agent prototypes never reach production is not the model. It is that nobody can prove the agent will not regress on the next prompt change. Harness engineering is how you prove it.