MODULE 0811 min read·June 2026

How do you measure whether this is actually working?

Individual productivity metrics tell you that engineers are faster. They don't tell you whether the org is delivering better. These are different questions requiring different measurements.

The board wants to know if the AI investment is working. Your AI tool vendor sent you a slide showing 40% faster task completion across your engineering team. But deployment frequency is flat. Change failure rate is up slightly. Two P1 incidents in the last quarter were traced back to AI-generated code that passed review.

You can't use the vendor's metric to answer the board's question. Individual productivity and organizational delivery are different things, and conflating them is how AI programs get defended with cherry-picked numbers while the underlying delivery problems compound.

This module builds the measurement framework that answers the actual question: is AI-native engineering producing better outcomes for the organization, and where specifically is it working or not?

The measurement problem

Why traditional metrics mislead in AI-native orgs

Standard engineering metrics weren't designed to distinguish human-authored from AI-generated work. When you measure them in an AI-native context, you get misleading signals.

Lines of code / commit size

⚠

AI commits are larger and generated faster. This metric goes up dramatically while quality may go down. It's now a leading indicator of review overhead, not productivity.

PR count / merge rate

⚠

LinearB found 98% more PRs merged with AI adoption. This number looks great. It says nothing about whether the features shipped actually worked or solved the problem.

Deployment frequency

⚠

DORA 2025: increased AI adoption correlated with decreased deployment frequency for the majority of teams. If you're measuring deployment frequency without segmenting by source, you may be missing that AI-generated work deploys slower despite being written faster.

Story points / velocity

⚠

AI completes tickets faster in the estimate phase. But the estimate was calibrated for human work. You're now comparing different units, which makes velocity comparisons meaningless.

KEY INSIGHT

The fix is not to stop using these metrics. It's to segment them by source. "PR acceptance rate" becomes "PR acceptance rate, AI-generated vs. human-authored." "Deployment frequency" becomes "deployment frequency by change source." Segmentation is what converts misleading aggregate metrics into useful diagnostic signals.

The attribution problem

How to segment by AI vs. human source

Before you can segment metrics, you need attribution. You need to know which changes were AI-generated, which were human-authored, and which were human-reviewed AI work. This is the observability foundation that Module 01's maturity assessment covers.

Minimum viable attribution: label AI-generated PRs at creation time. This can be a PR label applied by the agent workflow, a commit trailer (`Co-authored-by: claude-code`), or a CI tag. Once you have the label, you can segment most standard metrics.

Better attribution: trace from agent session through PR to deployment and production. This requires instrumentation at each step but enables full lifecycle measurement. You can ask "which production incidents were caused by AI-generated changes in the last 90 days?" That's the question that justifies or challenges your governance investments.

The framework

AI-attributed metrics: leading vs. lagging

Leading indicators

Tell you what's coming before it shows up in delivery

AI PR first-pass acceptance rate

Falling rate = agents working without sufficient context. Rising rate = context layer improving.

PR pickup time (AI-generated)

If this is growing, your review capacity isn't scaling with your generation capacity. Coordination problem building.

Review concentration ratio

What % of AI-generated PRs are reviewed by the top 10% of reviewers? High concentration = bottleneck risk.

Eval gate pass rate on first submission

Low rate = spec quality or context quality issue. Rising rate = harness improving agent output.

Autonomous-to-escalation ratio

What % of Tier 1 changes stay autonomous vs. getting manually escalated? Low ratio = governance over-triggering.

Lagging indicators

Tell you whether it actually worked, confirmed after the fact

Change failure rate by source

The most important metric. Does AI-generated code cause more production incidents than human-written code? If yes, by how much, and in which categories?

AI-attributed incident rate

Of P1/P2 incidents in the last quarter, what % were traced to AI-generated changes? This is the governance health metric that matters to the board.

Code churn ratio (AI vs. human)

AI-generated code that gets rewritten within 2 weeks indicates context or spec quality problems. Should trend down as context layer improves.

Deployment frequency (AI-generated features)

Are features implemented by agents actually deploying faster than human-implemented features? If not, the coordination overhead is consuming the gains.

MTTR segmented by change source

If AI-generated changes take longer to recover from when they fail, that's a harness quality signal. Your eval gates aren't catching subtle behavioral issues.

Leader artifact

The 4-metric CTO dashboard

These four metrics tell the story you need to tell the board. They're based on DORA's 7-capability model, segmented for AI-native orgs.

Delivery throughput delta

Formula

Deployment frequency (current quarter) vs. (12 months ago)

Target direction

Should be increasing. If flat despite AI adoption, you have a coordination problem.

Note

Ideally broken out: AI-generated features vs. human-implemented features

AI-attributed change failure rate

Formula

Incidents caused by AI-generated changes / Total AI-generated deployments

Target direction

Should be at parity with or below human-authored change failure rate. Above means governance gaps.

Note

Critical: must be segmented from overall CFR

Review leverage ratio

Formula

AI-generated PRs merged autonomously / Total AI-generated PRs

Target direction

Should be increasing as harness matures and context layer improves. Stagnation means coordination overhead isn't reducing.

Note

Track trend over 90 days, not point-in-time

Coordination overhead index

Formula

Review time (AI PRs) / Review time (human PRs)

Target direction

Ideally approaching 1.0 as context and harness improve. Above 2.0 is a problem. Below 1.0 means your harness is working.

Note

Track by team. High ratios indicate teams with poor context infrastructure

LEADER TAKEAWAY

If you can't compute all four of these today, the hardest one to instrument is #2 (AI-attributed change failure rate). That requires tracing AI-generated changes through to production and linking them to incidents. Start instrumenting this now. It takes 3–6 months to accumulate enough data to be meaningful, and it's the metric that will matter most in 12 months.

Board reporting

Reporting AI engineering ROI honestly

The board wants ROI. The honest answer at most orgs right now: "Individual productivity is up. Organizational delivery improvement is lagging by 6–12 months as we build the coordination infrastructure. Here's the evidence that the infrastructure investment is on track."

The key move: separate the argument into two parts. Part one is the gains (real, use vendor data). Part two is the progress on infrastructure (your coordination overhead index, your AI-attributed CFR, your review leverage ratio). The board question is whether you'll eventually capture the full value. Your infrastructure metrics are the evidence.

What you should not do: report individual productivity metrics as organizational ROI. The DORA data is public. If the board has a sophisticated technical advisor, they'll ask the question you don't want to answer: "If developers are 40% faster, why isn't your deployment frequency up?" Have the answer ready, or have the metrics that show you're building toward it.

Go deeper

From playbook to production

We work directly with engineering leaders who are making this transition now. You bring the real constraints; we help you build the coordination layer around them.

Talk to the team Back to the playbook

How do you measure whether this is actually working?

Why traditional metrics mislead in AI-native orgs

How to segment by AI vs. human source

AI-attributed metrics: leading vs. lagging

The 4-metric CTO dashboard

Delivery throughput delta

AI-attributed change failure rate

Review leverage ratio

Coordination overhead index

Reporting AI engineering ROI honestly

Related reading

From playbook to production