Key takeaways
- Mega-agents fail in production. The right unit is an atomic, declaratively-defined agent that can be tested, audited, and swapped without touching the rest of the mesh.
- The agent harness is fully commoditized — Claude Code and OpenCode won. There is nothing to build or buy at that layer.
- Context, memory, orchestration, and session state are where every serious company is rolling their own. That is where the real infrastructure gap sits.
- Review is not a UI screen — it is a primitive. The contract between an agent that produces work and a human who verifies it is the underbuilt piece of the stack.
- 2026 is the year agents break out of engineering. The platforms built for code agents will not survive contact with GTM, ops, and customer success.
- One human will supervise hundreds of agents, not seven direct reports — but only if the observability and review layers are good enough.
FAQ
What is an atomic agent mesh?
An architecture where enterprise AI is composed of many small, declaratively-defined agents instead of a single mega-agent. Each atomic unit has a defined input, output, and purpose, so it can be tested, audited, and swapped without touching the rest of the system.
Which layers of the agent stack should you build vs. buy?
The harness is commoditized — almost every serious team is on Claude Code or OpenCode, and there is nothing to build there. LLM gateways and external integrations are mostly bought. Context, memory, orchestration, session state, and review are where every company is rolling their own, and that is where the real infrastructure gap sits.
Why is review treated as a primitive rather than a UI?
Review is the contract between an agent that produces work and a human who verifies it, and that contract differs by artifact type — code diffs, documents, spreadsheets, visual changes. Building review as a screen gives you a feature; building it as a primitive that accepts artifact types and returns appropriate verification surfaces gives you something composable across use cases.
The Mega-Agent Fantasy Is Already Falling Apart
There is a persistent fantasy in enterprise AI: deploy one powerful agent that handles everything. One system that ingests every data source, reasons across every domain, executes autonomously. AGI in a box. The pitch is seductive. The reality is a nightmare.
Every organization that has tried to build a mega-agent has hit the same wall. The thing is opaque. No one knows why it made a decision. When it breaks — and it will break — no one can isolate the failure. And when the CEO walks down the hall and says "that's wrong, change it," nobody knows which part of the system to touch.
This is not an AI problem. This is a software architecture problem. We solved it decades ago with the same insight: decompose.
Agents Are Software, Not Prompts
The industry has spent two years treating agents as a new category of thing. They are not. Agents are software, and they benefit from the same engineering principles that have always mattered: modularity, testability, observability, clear interfaces. The fact that there is an LLM call somewhere in the execution path does not change the nature of what you are building.
Once you accept this, the architecture becomes obvious. You do not build one agent. You build many small, atomic agents — each with a defined input, a defined output, and a clear purpose. Each can be tested independently. Each can be swapped, improved, or retired without touching the rest of the system.
This is the agent mesh. Not a theoretical framework. A practical architecture for getting AI into production.
The Declarative Atomic Unit
What makes an atomic agent genuinely atomic? It needs to be defined declaratively — a single specification that captures what it does, what it takes as input, what it produces, and what constraints govern its behavior. Defined this way, agents become composable. A parent agent orchestrates child agents. Child agents run in parallel. The entire graph can be inspected by reading a spec file rather than tracing through opaque code.
This is the same insight that made infrastructure-as-code work. When infrastructure is a YAML file, anyone can read it, audit it, version-control it. When an agent is a declarative spec, the same properties hold. A business analyst can read it. A compliance team can audit it. An engineer can swap the underlying model without changing the spec.
Composability enables model optimization at the atomic level. If one agent's job is to say yes or no, you do not need a frontier model. Test cheaper models against the known-good outputs of the expensive one, validate accuracy, swap. The savings compound across hundreds of agents — but only because each agent is isolated enough to evaluate independently.
The Build-vs-Buy Map
I just spent time going deep on how real companies — Ramp, Stripe, Spotify, Coinbase, Earthly — are actually building their agent infrastructure. Not what they say on stage. What they actually deployed.
The picture is striking: the agent stack is almost entirely homegrown. There is very little purchasing happening at the layers that matter.
Lay out the seven layers — harness, tools and file ops, context and memory, orchestration and session state, sandbox and execution, LLM gateway, observability and evals — and a clear map emerges.
The harness is commoditized. With few exceptions, every company is on Claude Code or OpenCode. Spotify migrated from a homegrown loop to Goose to Aider to Claude Code. Bitrise hand-rolled because they are literally a CI/CD platform. Everyone else converged. This layer is done. There is no company to build here. Even the players who raised money to compete — Augment, Poolside — are fighting over a slot that is becoming a commodity feature of existing platforms. The harness is a force multiplier, not a standalone business.
The LLM gateway is bought. Triggers and external integrations are mostly bought. Observability is sometimes bought — Coinbase uses LangSmith — but more often built.
Context, memory, skills, orchestration, and session state? Almost entirely built in-house. Across every company I studied, regardless of size, these layers are blue on the chart. Nobody has a good off-the-shelf answer for how agents accumulate organizational context. Nobody has a good off-the-shelf answer for how agents coordinate multi-step work.
That is the real gap. Not the harness. Not the integrations. The connective tissue that makes agents operationally useful.
The Maturity Gradient
Organizing companies by the maturity of their dev tooling makes the pattern obvious.
Mature engineering orgs — Stripe, Spotify, Coinbase — already had infrastructure to repurpose. They forked an open-source harness, plugged it into existing CI/CD and observability, and started shipping. Their agent infra is mostly reused human dev infra.
Less mature organizations — Ramp is the standout — bought almost everything off the shelf. Modal for sandboxes. Cloudflare for orchestration. GitHub for auth. They purchased every layer because they had no existing platform to extend. That is the right call when you are validating whether the use case works at all. But every one of those decisions has a half-life. When a core part of how you make money depends on infrastructure, you eventually bring it in-house.
Then there are the scrappy teams. Earthly: one engineer, an EC2 instance as the sandbox, no orchestration, building it themselves because at that scale you can. It works until it does not — until the engineer goes on vacation and the load-bearing side project breaks.
The opportunity sits in the gap. The small companies that are scrappy today will need real infrastructure tomorrow. Whether they hire another engineer or pull up a search bar depends entirely on whether a platform exists that meets them where they are.
The Component Problem Is Real
Even when you do everything right inside your own mesh, the components you depend on — external APIs, SaaS tools, vendor systems — are largely not ready for this world. Most commercial software was built for humans clicking through UIs, not for agents making programmatic decisions at speed. Connect your auth system to your CRM so signups become leads, and you discover the CRM API has no sorting, no date filtering, no way to retrieve recent records without pulling the entire dataset. This is the state of most commercial APIs when you try to use them as components in an automated system.
The agent mesh is also a pattern for insulating your system from the fragility of external dependencies. Each atomic agent that wraps an external service is a replaceable unit. When the vendor finally fixes their API, or you switch vendors, you swap one agent. The rest of the mesh does not care.
Review Is Not a Screen. It Is a Primitive.
Here is where most teams go wrong. They think about review as a product surface — a page in the UI where someone looks at a diff and clicks approve. That is the dessert. The vegetables are understanding what review actually requires at the infrastructure level.
An agent working on code produces diffs. An agent writing a marketing brief produces a document. An agent updating a spreadsheet produces numerical output. An agent modifying a website produces a visual change. These are all review artifacts, and they are fundamentally different in how they need to be presented, verified, and approved.
Build a review system that only handles code diffs and you have built a feature, not a primitive. Build a review system that accepts an artifact type as input and returns the appropriate verification surface and you have built something composable. Something that scales across use cases without requiring a new product for each one.
This is the API-level thinking that matters. What is the request? What is the response? What is the contract between the agent that produces work and the human who verifies it? Get that right and the UI becomes a composition exercise. Get it wrong and you rebuild from scratch every time you expand to a new use case.
Review is the bottleneck. Not model quality. Not prompt engineering. Not integrations. The thing that separates an impressive demo from a production deployment is whether a human can look at what the agent produced, understand it, and approve it with confidence. And almost nobody is building review as a first-class primitive.
Non-Determinism Demands Human Correction Loops
Agent systems malfunction all the time. Not because they are poorly built, but because they are non-deterministic by nature. New software plus non-deterministic software means you need humans watching it, telling it what is wrong, letting it fix itself.
This is how agent systems get smart — through human correction. Not through better prompts written once and deployed forever. Not through more sophisticated models. Through the grinding, iterative process of a human observing an output, judging it wrong, explaining why, and letting the system incorporate the feedback. Times a thousand.
The atomic mesh makes this tractable. A human correcting the system can identify exactly which agent produced the bad output and provide targeted feedback. In a monolith, the same correction is nearly impossible — you do not know which part of the chain went wrong, so you cannot provide precise feedback, so the system cannot improve precisely.
This has direct implications for staffing. You do not just need engineers to build the agents. You need domain experts, operators, and analysts who can evaluate outputs and provide structured feedback. The mesh does not run itself. It runs under human supervision, and the quality of that supervision determines the quality of the mesh.
Context Is the Hardest Problem Nobody Has Solved
Across every company I studied, context management is the layer most consistently built in-house and least well served by vendor solutions. This is not a coincidence.
Organizational context is not a search problem. It is not a matter of indexing documents and returning relevant chunks. It is the problem of knowing what is true about an organization, surfacing disagreements between humans about what is true, and maintaining that knowledge as it evolves.
When I signed a deal to help a company build agents, the first thing I realized is that the first agent we needed was one that understood organizational context — standard operating procedures, what the portal does, who knows what, where the disagreements are between the CEO and the ops team. That kind of context is not a feature of any existing product. Some of it is in SharePoint. Some of it is in Slack threads. Some of it exists only as conflicting assumptions in different people's heads.
This is not a retrieval-augmented generation problem. It is a knowledge management problem, and it gets harder as the organization gets larger. The companies that figure out how to give agents reliable organizational context — without forcing humans to migrate to new tools — will have a durable advantage. Meet people where their data already lives. Integrations are table stakes. The intelligence layer that resolves conflicting information and maintains a coherent picture is where the real value sits.
The Grayscale Between Engineering and Everywhere Else
There is a persistent myth that you need two separate products: one for engineers, one for everyone else. A coding tool and a business tool. Left door or right door. This is wrong. The reality is a grayscale.
Consider a company running both an engineering agent and a non-engineering AI co-worker. The engineering agent required the full stack: sandboxes with VNC and Chromium, Modal compute, complex orchestration. The non-engineering co-worker? Slack channels and pre-commit hooks. Same company, radically different infrastructure needs.
Non-engineering agents are simpler precisely because they do not need environment setup. They do not need to know your stack. They do not need CI/CD or test suites. They can be ephemeral — spin up, do the work, spin down — without maintaining state about a complex dev environment.
But they still need to be defined in software. They still need context. They still need to execute code when an API call is not enough. They still need review infrastructure so a human can verify output before it goes anywhere consequential.
Treat this as binary — Cursor or Zapier — and you end up in an uncanny valley. Too complicated for business users, too simple for engineers. Design composable primitives with flexible artifact types and you serve the entire spectrum without building separate products.
2026: Agents Break Out of Engineering
Here is the bet. 2026 is the year agents break out of engineering organizations. Highly capable agents exist now within engineering orgs but barely exist outside them in any meaningful way. GTM, customer success, operations, finance — these functions are next.
When they come, they will follow the same pattern we have already seen. A motivated person builds a scrappy agent for their team. It works. Other teams want it. The person who built it becomes an accidental platform team of one. Then it breaks.
This is the Airflow pattern all over again. One team runs it for their use case. Another team piggybacks. Then three more. Suddenly you have an internally hosted service with SLAs that nobody signed up to provide. The people running it are users, not platform engineers, and they do not want the job.
The teams that built internal agent platforms for engineering will face a choice: extend their platform to serve the whole organization, or let every team fend for themselves. Extending is hard because the platform was built with engineering assumptions — about environments, deployment, what review looks like. The homegrown engineering platform is not going to scale to non-technical teams.
The company that provides a self-hosted platform for this moment — one that handles context, orchestration, review, and multi-agent coordination without requiring a platform engineering team to run it — will capture the expansion. Not because they built the best agent. Because they built the infrastructure that makes every agent operationally real.
Humans Will Manage Hundreds of Agents
AI is not going to replace humans. It is going to replace human-judgment-based workflows. Someone still has to supervise. Someone has to notice when an agent is drifting. Someone has to approve changes.
The difference is scale. A manager handles about seven direct reports — the limit because humans are emotionally complex and require constant context-switching. Agents are different. Most of them, most of the time, are just cruising along. A human should be able to supervise dozens or hundreds, if the observability and review layers are good enough. The atomic mesh is the only architecture that scales human oversight.
The missing piece is outside stimulus. Humans self-correct through conversations, feedback, social cues. Agents have none of this by default. A supervisor layer — human, automated, or hybrid — has to provide that stimulus deliberately. A higher-level agent reviewing subordinate outputs on a cycle. A human reviewing a dashboard weekly. Automated tests flagging drift. Some feedback loop must exist, because agents without outside stimulus do not learn.
Where the Real Work Happens
The gap between AI demo and AI deployment is not a technology gap. Not a model gap. Not a harness gap. It is an infrastructure gap. Specifically, it is the absence of composable primitives for the boring parts of operationalization: how agents trigger, how they coordinate, how their work gets reviewed, how humans maintain authority, and how organizational context flows to the right agent at the right time.
The pattern that works in production is straightforward. Context goes in. The agent executes in the background. It produces reviewable output. A human approves before anything ships. That loop sounds simple. Building the infrastructure to support it across diverse use cases, artifact types, and user sophistication levels is where the real engineering happens.
If you are building in this space, the lesson is clear. Do not fight over the commoditized layers. The harness is settled. The LLM gateway is settled. The unsolved problems are context, orchestration, and the review primitives that turn an agent from a demo into a production system. That is where every company is building custom, where no vendor has won, and where the need is about to explode as agents move beyond engineering.
Build the mesh one atomic agent at a time. Make every unit declarative, auditable, and replaceable. Treat review as a primitive, not a screen. Solve context for organizations, not just code. The companies that get those four things right will own the next layer of enterprise AI infrastructure.
The gap between AI demo and AI deployment is not intelligence. It is engineering. And the engineering that matters most right now is not the agent itself — it is the mesh that holds them all together.
Sources
Related Essays
The Agent Harness Problem
Enterprise agents need layered interfaces, real software skills, and flexible platforms. The harness around the model matters more than the model.
Workflows, Not Roles: Why Scoped Agents Win
Role-based agents promise to replace a job and require infinite context engineering to deliver. Workflow-based agents start small, ship in a week, and compound.
Knowledge Work Automation Is Software Engineering. Period.
Enterprises treating AI agents as a procurement decision will fail. Agents are software, every obvious idea commoditizes in weeks, and the bottleneck has moved from writing code to reviewing it.