LoomStack
Engineering PracticeJun 2026·13 min read

Agentic Engineering: The Discipline Behind AI-Native Teams.

Adding AI agents to an engineering org without changing how the org operates is like adding ten junior engineers with no onboarding, no shared context, and no code review. The discipline of coordinating agents at production scale is what separates teams shipping reliably from teams firefighting every week.

LS
LoomStack Team
Jun 9, 2026

At Sequoia Capital's AI Ascent 2026, Andrej Karpathy retired the term he coined just one year earlier. Vibe coding, he said, was the right name for 2025. For 2026, he had a new one: agentic engineering.

Vibe coding raised the floor. Anyone could describe a feature and get working code back. Agentic engineering raises the ceiling. It is the professional discipline of coordinating fallible, stochastic agents while preserving correctness, security, and quality. The first is about access. The second is about what you do once you have it.

57% of engineering organizations have agents in production today, according to LangChain's 2026 State of AI Agent Engineering survey of 1,300 professionals. 72% of enterprise AI projects involve multi-agent architectures, up from 23% in 2024. Anthropic reports that 80% of its new production code is now written by Claude and that each engineer ships 8x more code per quarter than the 2021–2025 baseline. The adoption numbers are not the story. Most organizations running agents are not running them well. The production failure rate, the review time explosion, the bug acceleration: those are the story.

The teams doing this right are not just using better models. They have built something different: a coordination layer around their agents that gives them shared context, policy enforcement, and feedback loops. That is what agentic engineering actually means in practice.

What changed in December 2025

Karpathy is precise about when the transition happened. In November 2025, he was writing roughly 80% of his code himself and using AI for the rest. By December, that ratio had inverted. He was delegating 80% to agents. "I can't remember the last time I corrected it," he said. "I just trusted the system more and more."

The same shift showed up at Anthropic. Boris Cherny, head of Claude Code, announced in January 2026 that he had not written any code in more than two months. Top engineers at both Anthropic and OpenAI said AI now writes 100% of their code. In May 2026, Anthropic reported that Claude's success rate on complex, open-ended engineering problems reached 76%, a 50-point increase in six months. One engineer deployed Claude to fix a persistent class of API errors; it shipped 800+ fixes autonomously and reduced the error rate by a factor of 1,000.

These numbers describe something real. But they also describe what the best-equipped teams can do when they have invested in the infrastructure to make agents safe to run autonomously. They are not what most teams experience.

Most teams experience the other side of the DORA data.

The metrics that do not get written about

The 2025 DORA State of AI-assisted Software Development report surveyed 5,000 technology professionals and found 90% are using AI at work, with 80% reporting increased individual productivity. The headline reads well. The rest of the data is harder to spin.

Faros AI measured engineering telemetry alongside DORA across real organizations. Their findings: median PR review time up 91% in 2025 data and 441% in 2026 data. PR size up 154%. Bugs per developer up 9% in 2025, accelerating to 54% by 2026. Incidents per PR up 242%. 31% more PRs merging with no review at all. Individual productivity went up. Organizational delivery stability went the other direction.

The DORA report's conclusion: AI acts as an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones. If your coordination infrastructure was weak before, agents make it worse faster. If your review process was already overloaded, agents accelerate the overload. The model is not the variable. The infrastructure around it is.

This is what we analyzed in The AI Engineering Coordination Bottleneck: adding velocity without adding coordination infrastructure produces the same compounding chaos Brooks described for human teams. The DORA data is showing it in production.

Two agents, half the results

CooperBench, published on arXiv in early 2026, tested multi-agent collaboration directly. The finding is stark: GPT-5 and Claude Sonnet 4.5 achieve only 25% success when two agents collaborate, roughly half the 50% success rate of a single agent handling the same workload. Broader analysis of multi-agent configurations found performance degradation of 39 to 70 percent relative to single-agent baselines.

The paper identifies three root causes. Agents fail to communicate actionable information. Agents deviate from their own stated commitments. Agents hold incorrect expectations about what their collaborating partner is doing. These are not model capability failures. They are coordination failures. The models can write the code. They cannot coordinate on who is writing what, why, and whether the combined result will hold together.

"Coordination, not raw coding ability, is the central bottleneck for multi-agent software development."

“Externally imposed protocols mask rather than solve the underlying coordination problem.”

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

Protocols imposed from outside the agents do not fix this. What fixes it is shared context: a persistent memory layer that each agent reads from and writes to, so that every agent in a workflow knows what the others know and what they have already decided. Without that layer, you have independent agents each solving their own version of the problem. They may produce individually correct outputs that combine into something incoherent.

The Governed Memory paper (arXiv 2603.17787) put the implication clearly: "The bottleneck in multi-agent AI shifts from which model is best to who controls shared memory, access, and context flow across agents." The model choice is table stakes. The infrastructure is what determines outcomes.

The four failure modes in production

Teams scaling agents in 2026 are running into the same failure patterns independently. Augment Code's analysis of engineering organizations transitioning to agentic workflows identified four recurring problems that show up once agents move beyond isolated use to multi-team production.

Context fragmentation

Multi-agent systems fragment context by design. Each agent runs in an isolated context window, so unless you build an explicit layer to share state, agents operating on related work have no common ground. They produce outputs that are individually reasonable and collectively inconsistent. Synchronization overhead grows with the number of agents.

Agent drift

Prompts and agent configurations evolve informally. One engineer tightens a system prompt, another loosens a tool definition, a model version updates silently. Without version-controlled agent configurations and policy enforcement, the behavior of your agent fleet drifts in ways that are invisible until something breaks in production.

Knowledge silo formation

An agent that resolved a class of incidents over two weeks has learned patterns that no other agent or human has access to. When that agent's session ends, the context evaporates. A new agent starting the next shift begins from scratch. The organizational knowledge that accumulates through work is trapped in ephemeral context windows instead of building into a persistent memory that compounds over time.

Governance gaps

As agent adoption outpaces operating design, organizations discover that nobody defined which decisions agents can make autonomously and which require a human. The result is not usually catastrophic immediately. It is the gradual accumulation of automated decisions that nobody reviewed, followed by a production incident that is hard to trace because the decision trail is incomplete.

These are not hypothetical risks. The Faros data showing incidents per PR up 242% and median review time up 441% is what these failure modes produce at scale. The code volume is up. The coordination infrastructure has not kept pace.

The question is what the teams getting this right are doing differently.

How teams are restructuring around agents

The structural shift that agentic engineering requires is not about headcount. It is about the relationship between humans and execution. The old model: humans execute, tools assist. The new model: humans steer, agents execute.

Matthew Kruczek's analysis of what engineering organizations look like when built natively for agents describes it concretely: small cross-functional pods of four people directing a large, flexible set of agents. Four people, not more, because the limit is no longer how many engineers are typing. It is how many streams of work a small group can direct and review before quality slips. "The agents are the output. The people are the judgment."

Each pod runs 20+ agents simultaneously across different pieces of a problem. Two people handle design: writing specs, directing agents, reviewing combined output. The work is no longer typing code. It is making sure the code the agents produced is coherent, correct, and worth shipping. You add capacity by adding agents to a fixed core of people, not by adding management layers between work and decision-making.

This structure only works if the platform layer underneath it is solid. Kruczek calls it the part most companies underbuild: the shared agent platform, the connections to your systems, the testing setup, and the guardrails that keep agents inside their boundaries. The pod model breaks down without it. Agents operating without shared context and policy enforcement produce the CooperBench 25% success rate, not the Anthropic 8x multiplier.

New engineering roles are crystallizing around this infrastructure. Augment Code's analysis of agentic engineering operating models identifies roles that did not exist two years ago: Context Engineer (managing memory, tool selection, context-window policies across agents), Agent Reliability Engineer (production monitoring and behavioral consistency), Agent Orchestration Engineer (multi-agent handoffs and output synchronization), AI Governance Owner (defining autonomy boundaries and maintaining audit trails). These are not renamings of existing roles. They are responses to real operational gaps that emerged once agents moved from demos to production systems.

“Moving from humans execute, tools assist to humans steer, agents execute creates a real tradeoff. Companies can increase execution speed quickly, but they can also overload review capacity, weaken governance, and fragment knowledge if agent adoption scales faster than operating design.”

Augment Code, Agentic Engineering Operating Model

The teams that have moved fastest are the ones that treated this as an architecture decision, not a tooling decision. They did not just add AI tools to existing workflows. They redesigned the workflows around agents and built the coordination infrastructure those workflows require.

What the infrastructure layer actually needs

LeadDev's analysis of designing human-agent engineering teams identifies four layers that cannot be left to chance: work (who is responsible for what), coordination (how information reaches the right agents and humans), shared context (what everyone knows), and governance (which decisions can be automated and which require humans). In human teams, most of these are established by culture and habit. In human-agent teams, they have to be explicit.

The shared context layer is where most organizations underinvest first. Anthropic's multi-agent research architecture uses subagents with isolated context windows, each returning 1,000–2,000 token condensed summaries to a shared layer. Token usage explains 80% of performance variance on their internal benchmarks. The architecture outperformed single-agent Claude Opus 4 by 90.2% on their internal research evaluation. The performance gain did not come from a better model. It came from better context management.

The Governed Memory research quantified what systematic context governance produces in production: roughly 50% token savings from progressive context delivery, fast-path routing at around 850ms, and bounded retrieval that prevents context bloat from degrading model attention on fresh instructions. These are not marginal improvements. At the scale of dozens of agents operating continuously across engineering workflows, they determine whether the system is economically viable and whether it stays reliable over time.

The governance layer is where most organizations underinvest second. As we argue in Adaptive Autonomy: Why "Fully Autonomous AI" Is the Wrong Goal, the binary of full autonomy versus full human control is a false choice. What works in production is risk-calibrated autonomy: agents earn autonomy incrementally through demonstrated competence, and the system knows the difference between a low-risk refactoring and a high-risk schema migration. That distinction has to be encoded in policy, not left to individual agent judgment.

LangChain's survey found that 89% of organizations have implemented some form of agent observability and 62% have detailed tracing that allows inspecting individual agent steps. That is progress. But 32% still cite quality as the top production barrier, and 30% of developers report little or no trust in AI-generated code even while using it daily. The observability is there. The feedback loops that close on it are not.

This connects directly to the testing argument in Test-Driven Development Is More Important Than Ever: observability without closed feedback loops is a report nobody acts on. The teams running agents reliably have wired their traces into evaluation pipelines that catch regressions before they reach production, not just surfaces them afterward.

What Karpathy actually meant

The 10x engineer used to mean someone who wrote code faster than everyone else. Karpathy's claim at Sequoia is that the ceiling is far higher now, but not because engineers type less. Because engineers who understand systems deeply can direct agents across far more simultaneous workstreams than a single person could previously handle.

His description of what the agentic engineer does is precise and worth reading literally: "design specs, supervise plans, inspect diffs, write tests, create evaluation loops, manage permissions, isolate worktrees, and preserve quality." None of those are passive activities. None of them are about trusting the model to get it right. All of them are forms of engineering judgment applied to a system that requires more oversight, not less, as it operates faster and more autonomously.

This is what distinguishes agentic engineering from vibe coding at the organizational level. Vibe coding is what an individual does with an AI tool. Agentic engineering is what a team does when it decides to treat agents as first-class actors in its engineering system, with all the coordination, governance, and observability infrastructure that implies. The failure modes of naive multi-agent systems are well documented. The teams avoiding them have not found better models. They have built better environments for their models to operate in.

The CIO Magazine framing from early 2026 captures the operating principle: delegate, review, own. AI agents handle first-pass execution. Engineers review for correctness, risk, and architectural alignment. Ownership of trade-offs and outcomes stays human. That clarity is what makes autonomous execution safe to run at scale. Without it, you get fast execution and unclear accountability, which is a different kind of problem than slow execution.

The teams at Anthropic and Stripe that are making this work are not the teams with the most agents. They are the teams with the clearest operating model: agents that know their scope, humans that know their role, and infrastructure that enforces the boundary between them.

The discipline, not the tools

The model choice matters less than most organizations think. The 25% two-agent success rate from CooperBench used frontier models. The 90.2% improvement from Anthropic's multi-agent architecture used the same underlying models as their single-agent baseline. The difference was context management and coordination design.

Agentic engineering is the discipline of building that coordination design deliberately: shared memory that persists across agent sessions, policy enforcement that defines autonomy boundaries, observability that closes feedback loops rather than just surfacing data, and team structures that put human judgment where it matters most. The discipline is not new. Good engineering organizations have always needed coordination infrastructure. What is new is that agents make the absence of it immediately and visibly expensive.

Karpathy's inflection point was December 2025. For most engineering organizations, the inflection point is now. 57% have agents in production. The metrics on what happens without coordination infrastructure are in. The teams that build this layer now will compound that advantage. The teams that skip it will spend most of their new speed on debugging what the agents broke.

Agents are not difficult to deploy. Deploying them in ways that hold together over time requires the same rigor every other distributed system has always required. That is what agentic engineering means.

Build the coordination layer with us

LoomStack is the orchestration and governance layer for agentic engineering teams: shared context, policy enforcement, observability, and feedback loops built into every agent workflow from day one.