How do you ensure AI builds what you actually need?
Code is regenerable. Specs are not. When agents can generate thousands of lines per hour, the highest-leverage investment shifts from writing code to defining intent.
GitHub's Spec Kit has 112,000 stars on GitHub. Engineers are clearly looking for something.
They're looking for the same thing engineering leaders need: a way to ensure that agent-generated code actually reflects what the organization intended to build. The faster code generation gets, the more the limiting factor becomes intent definition. If you can't precisely specify what you want, a faster agent just gives you the wrong thing faster.
Spec-driven development is the practice of writing explicit, structured intent documents before agents write code, then using those documents as the source of truth that governs, evaluates, and validates agent output. It's not a new methodology. It's the natural response to an old problem, at higher velocity.
The shift
Code is regenerable. Specs are the real artifact.
In a world where agents can generate working code from a well-written spec, the code itself becomes a derivative. If you have the spec, you can regenerate the code. If you have only the code, you don't know what the spec was, and you don't know whether the code correctly implements it.
This changes the economics of spec work entirely. Traditionally, engineers resisted writing detailed specs because the cost was high (time) and the value was unclear (who reads them?). In an agent-native workflow, the spec is the primary input. Writing a good spec means the agent produces correct code on the first pass. Writing a vague spec means review debt, rework cycles, and the coordination failures described in Module 00.
Microsoft's Specification-Driven Development work puts it precisely: "Specifications contain information about how to implement a task, including context, intent, requirements, constraints, interface contract, and validation criteria." The spec is the stop condition for the agent. Without it, the agent has no way to know when it's done.
“Spec-first development is the practice of writing down what you intend to build, not just for humans to read, but as a structured document that agents can execute against.”
Microsoft Developer Blog, 2026
The framework
The spec–context–evals trinity
Every durable AI engineering workflow has three components. Missing any one of them produces a specific failure mode.
01
Spec
The structured intent document: what are we building, why, what constraints apply, what does correct behavior look like? The spec is the source of truth. Agents generate code from it. Reviewers evaluate code against it.
02
Context
The organizational knowledge that makes the spec executable: architecture, conventions, adjacent services, prior decisions, regulatory requirements. Without context, agents produce spec-compliant code that's architecturally wrong for your specific system.
03
Evals
The acceptance criteria that determine whether the code correctly implements the spec. Not just test passing, but behavioral verification: does this actually do what the spec says? Evals are how you close the loop between intent and implementation.
Agent gets a ticket title and a conversation thread as context. Produces plausible code that misses three implicit requirements.
Agent gets a structured spec with explicit requirements, architectural context, and acceptance criteria. First-pass acceptance rate increases significantly.
Reviewer can't tell if AI code is correct because there's no reference for 'correct' other than their own mental model.
Reviewer evaluates code against spec. The spec is the reference. Review becomes faster and more consistent.
When production incident occurs, difficult to trace whether the implementation matched the intent.
Spec is an immutable artifact. Audit trail from intent to implementation to production outcome is clear.
Anatomy of a spec
What a good spec contains
This isn't a prompt template. It's an organizational artifact that governs agent work.
# Spec: Refund API, Partial Refund Support
## Intent
Add partial refund capability to the Payments service. Customers should be able to
request refunds for specific line items, not just full orders.
## Context
- Payments service: /services/payments (owns Stripe integration)
- Refund decisions go through the RiskEngine before processing
- All monetary amounts are stored as integers (cents), never floats
- Stripe API docs: https://stripe.com/docs/api/refunds (internal mirror at /docs/stripe)
- Related decision: ADR-042 (why we do NOT use Stripe's native partial refund)
## Requirements
- Accept line_item_ids[] in refund request (1 to N items from the original order)
- Validate total refund amount ≤ original charge amount
- Route to RiskEngine.assess() before processing; block if risk_score > 0.7
- Emit refund.initiated event to the payments event stream
- Idempotent: same request_id must not produce duplicate refunds
## What's out of scope
- Refund approval workflows (handled by the merchant portal, not this API)
- Multi-currency refunds (follow-up spec: payments/refund-multicurrency.md)
## Acceptance criteria
- [ ] Partial refund for valid line items completes and returns 200
- [ ] Refund for amount > original charge returns 422 with validation error
- [ ] Duplicate request_id returns 200 (idempotent, no duplicate processing)
- [ ] High-risk refund (score > 0.7) returns 403 with risk_code in body
- [ ] refund.initiated event is emitted with correct line_item_ids
## Constraints
- No floats anywhere in the payment flow (lint rule: no-float-money)
- All external API calls must use the internal retry wrapper (lib/resilience)
- Test coverage ≥ 90% for the refund pathImplementation levels
Three levels of spec rigor
You don't need the full template for every change. Match the rigor to the risk.
Spec-first (high-risk changes)
Full spec written before any agent work begins. Contains context, requirements, out-of-scope, and acceptance criteria. Required for: new APIs, new services, changes to security or payment flows, regulatory requirements. The spec is reviewed and approved before agents are started.
Spec-anchored (standard changes)
Abbreviated spec: 3–5 bullet requirements, clear acceptance criteria, explicit constraints. Written in the ticket or PR description. The agent is started with this as context, and the PR is reviewed against it. Appropriate for most feature work.
Spec-as-source (refactors, bug fixes)
The existing code IS the spec. The agent is given the current behavior as the target to preserve, plus the specific defect to fix. Acceptance criteria is 'existing tests pass, new test covering the bug passes.' Lower overhead, appropriate for lower-risk changes.
The governance connection
A spec is a stop condition
From a governance perspective, the spec is the mechanism that prevents agents from running indefinitely or expanding scope beyond what was intended. Without a spec, there's no authoritative answer to "is the agent done?" With a spec, the acceptance criteria define done.
This connects directly to Module 04's harness engineering: the eval harness runs acceptance criteria from the spec. If the spec has clear, testable criteria, the harness can verify completion automatically. If the spec is vague, the harness has nothing to verify against, and human review can't be meaningfully automated.
For leaders: your investment in spec quality directly determines how much of your review burden can be automated. Clear specs with testable acceptance criteria are the prerequisite for autonomous verification.
Getting started
How to introduce specs this week
Day 1
Pick one change type
Don't change everything at once. Pick your highest-volume, highest-risk change type (new API endpoints, schema migrations, security-adjacent changes). Require spec-first for that type only. See what changes.
Week 1
Write the template
Create a spec template for your org. It should take 20 minutes to fill out, not 2 hours. Include: intent (1 paragraph), context (links to relevant docs), requirements (5–10 bullets), out of scope, acceptance criteria (testable).
Week 2
Measure first-pass acceptance
Track first-pass acceptance rate for PRs with specs vs. PRs without. This is your ROI measurement. In most cases, spec'd work has significantly higher acceptance rates. The rework cycles decrease, the review conversations shorten.
Month 1
Expand to your context layer
Once specs are working for individual features, think about how they connect to your organizational context. Your architectural decision records, your service-level contracts, your coding conventions: these are specs for your system as a whole.
CO-BUILD PROGRAM
From playbook to production
We work directly with engineering leaders who are making this transition now. You bring the real constraints; we help you build the coordination layer around them.