How do you measure whether this is actually working?
Individual productivity metrics tell you that engineers are faster. They don't tell you whether the org is delivering better. These are different questions requiring different measurements.
The board wants to know if the AI investment is working. Your AI tool vendor sent you a slide showing 40% faster task completion across your engineering team. But deployment frequency is flat. Change failure rate is up slightly. Two P1 incidents in the last quarter were traced back to AI-generated code that passed review.
You can't use the vendor's metric to answer the board's question. Individual productivity and organizational delivery are different things, and conflating them is how AI programs get defended with cherry-picked numbers while the underlying delivery problems compound.
This module builds the measurement framework that answers the actual question: is AI-native engineering producing better outcomes for the organization, and where specifically is it working or not?
The measurement problem
Why traditional metrics mislead in AI-native orgs
Standard engineering metrics weren't designed to distinguish human-authored from AI-generated work. When you measure them in an AI-native context, you get misleading signals.
Lines of code / commit size
AI commits are larger and generated faster. This metric goes up dramatically while quality may go down. It's now a leading indicator of review overhead, not productivity.
PR count / merge rate
LinearB found 98% more PRs merged with AI adoption. This number looks great. It says nothing about whether the features shipped actually worked or solved the problem.
Deployment frequency
DORA 2025: increased AI adoption correlated with decreased deployment frequency for the majority of teams. If you're measuring deployment frequency without segmenting by source, you may be missing that AI-generated work deploys slower despite being written faster.
Story points / velocity
AI completes tickets faster in the estimate phase. But the estimate was calibrated for human work. You're now comparing different units, which makes velocity comparisons meaningless.
The attribution problem
How to segment by AI vs. human source
Before you can segment metrics, you need attribution. You need to know which changes were AI-generated, which were human-authored, and which were human-reviewed AI work. This is the observability foundation that Module 01's maturity assessment covers.
Minimum viable attribution: label AI-generated PRs at creation time. This can be a PR label applied by the agent workflow, a commit trailer (`Co-authored-by: claude-code`), or a CI tag. Once you have the label, you can segment most standard metrics.
Better attribution: trace from agent session through PR to deployment and production. This requires instrumentation at each step but enables full lifecycle measurement. You can ask "which production incidents were caused by AI-generated changes in the last 90 days?" That's the question that justifies or challenges your governance investments.
The framework
AI-attributed metrics: leading vs. lagging
Leading indicators
Tell you what's coming before it shows up in delivery
AI PR first-pass acceptance rate
Falling rate = agents working without sufficient context. Rising rate = context layer improving.
PR pickup time (AI-generated)
If this is growing, your review capacity isn't scaling with your generation capacity. Coordination problem building.
Review concentration ratio
What % of AI-generated PRs are reviewed by the top 10% of reviewers? High concentration = bottleneck risk.
Eval gate pass rate on first submission
Low rate = spec quality or context quality issue. Rising rate = harness improving agent output.
Autonomous-to-escalation ratio
What % of Tier 1 changes stay autonomous vs. getting manually escalated? Low ratio = governance over-triggering.
Lagging indicators
Tell you whether it actually worked, confirmed after the fact
Change failure rate by source
The most important metric. Does AI-generated code cause more production incidents than human-written code? If yes, by how much, and in which categories?
AI-attributed incident rate
Of P1/P2 incidents in the last quarter, what % were traced to AI-generated changes? This is the governance health metric that matters to the board.
Code churn ratio (AI vs. human)
AI-generated code that gets rewritten within 2 weeks indicates context or spec quality problems. Should trend down as context layer improves.
Deployment frequency (AI-generated features)
Are features implemented by agents actually deploying faster than human-implemented features? If not, the coordination overhead is consuming the gains.
MTTR segmented by change source
If AI-generated changes take longer to recover from when they fail, that's a harness quality signal. Your eval gates aren't catching subtle behavioral issues.
Leader artifact
The 4-metric CTO dashboard
These four metrics tell the story you need to tell the board. They're based on DORA's 7-capability model, segmented for AI-native orgs.
Delivery throughput delta
Formula
Deployment frequency (current quarter) vs. (12 months ago)
Target direction
Should be increasing. If flat despite AI adoption, you have a coordination problem.
Note
Ideally broken out: AI-generated features vs. human-implemented features
AI-attributed change failure rate
Formula
Incidents caused by AI-generated changes / Total AI-generated deployments
Target direction
Should be at parity with or below human-authored change failure rate. Above means governance gaps.
Note
Critical: must be segmented from overall CFR
Review leverage ratio
Formula
AI-generated PRs merged autonomously / Total AI-generated PRs
Target direction
Should be increasing as harness matures and context layer improves. Stagnation means coordination overhead isn't reducing.
Note
Track trend over 90 days, not point-in-time
Coordination overhead index
Formula
Review time (AI PRs) / Review time (human PRs)
Target direction
Ideally approaching 1.0 as context and harness improve. Above 2.0 is a problem. Below 1.0 means your harness is working.
Note
Track by team. High ratios indicate teams with poor context infrastructure
Board reporting
Reporting AI engineering ROI honestly
The board wants ROI. The honest answer at most orgs right now: "Individual productivity is up. Organizational delivery improvement is lagging by 6–12 months as we build the coordination infrastructure. Here's the evidence that the infrastructure investment is on track."
The key move: separate the argument into two parts. Part one is the gains (real, use vendor data). Part two is the progress on infrastructure (your coordination overhead index, your AI-attributed CFR, your review leverage ratio). The board question is whether you'll eventually capture the full value. Your infrastructure metrics are the evidence.
What you should not do: report individual productivity metrics as organizational ROI. The DORA data is public. If the board has a sophisticated technical advisor, they'll ask the question you don't want to answer: "If developers are 40% faster, why isn't your deployment frequency up?" Have the answer ready, or have the metrics that show you're building toward it.
CO-BUILD PROGRAM
From playbook to production
We work directly with engineering leaders who are making this transition now. You bring the real constraints; we help you build the coordination layer around them.