LoomStack
CONCEPT

Harness Engineering

The discipline of building evaluation and governance infrastructure that constrains AI agents to produce reliable, policy-compliant output.

5/5 steps100% automated1 human step
Code Agent
sonnet-4
QA Eval GateFAIL
claude-sonnet-4

3/14 FAIL — idempotency not enforced

eval threshold not met → escalate
Policy CheckGATE
policy-engine

touches payments-service · security-critical

HumanSecurity Review
Aditi S.

Approve or reject — PR blocked until resolved

Deploy
sonnet-4

Definition

What is harness engineering?

In software testing, a harness is the collection of stubs, drivers, and fixtures that configure an environment for automated testing. Not the tests themselves, but the infrastructure that runs them reliably. Harness engineering extends this concept to AI systems.

Anthropic named it in March 2026. Their engineer wrote: “Harness design is key to performance at the frontier of agentic coding.” Agents cannot reliably evaluate their own output, so evaluation infrastructure must be separate from generation infrastructure.

Most teams confuse “we have some tests” with “we have a harness.” The difference is what separates the 41% of AI agent deployments that hit positive ROI in year one from the 19% that never recover their investment.

Framework

The two-layer model

Quality verification and policy enforcement are separate problems. Both need dedicated infrastructure.

Layer A

Eval Harness

Runs pre-deploy and in production. Verifies output quality through rubric scoring and regression gates. Blocks merge or deploy when thresholds are not met.

BenchmarksLLM judgesRegression gatesProduction scoring
Layer B

Governance Harness

Runs at runtime, before every agent action. Enforces action policies. Blocks or escalates actions that violate organizational rules, regardless of output quality.

Policy evaluationAuthority issuanceEscalation routingAudit logging

Platform

How LoomStack implements harness engineering

Evaluation gates, policy enforcement, production scoring, and audit trails as first-class workflow primitives.

Gate 01

Eval Gates

Multi-layer evaluation in CI: deterministic checks, LLM-as-judge scoring against rubrics, and regression detection against baseline datasets.

Gate 02

Policy Engine

Runtime governance that evaluates every agent action against organizational policy before execution. Violations are blocked or escalated, not discovered in review.

Gate 03

Production Scoring

Online evaluation monitoring production traffic. Detects accuracy drift, behavioral regression, and safety violations in real time.

Gate 04

Audit Trail

Immutable, evidence-linked record of every eval result, policy decision, and governance override. Full traceability for compliance and post-incident analysis.

Principles

Key principles

01

Tests are individual checks. A harness is the infrastructure that runs them, gates on them, and learns from production.

02

Agents cannot reliably evaluate their own output. Generation and evaluation must be architecturally separate.

03

Trajectory-opaque evaluation misses 44% of safety violations. The harness must observe execution paths.

04

Two layers are both necessary: an eval harness that verifies quality, and a governance harness that enforces policy at runtime.

05

Eval debt compounds silently. Teams without regression suites lose 14-23 percentage points of accuracy over 18 months.

Build your eval and governance harness

LoomStack provides the evaluation gates, policy engine, and production scoring infrastructure that turns agent output into trusted, governed results.