Key takeaways
- Every serious engineering org is standing up an internal agent factory right now, and the stack looks remarkably similar across organizations — orchestration, sandboxes, context, tools, observability.
- Homegrown systems built by ambitious individuals always hit a ceiling. They go into disrepair when maintainers leave, stay feature-poor, and run on a desktop-grade footing when production needs cloud-grade reliability.
- The speed of AI-assisted development has outrun organizational coordination. Five teams independently build the same Slack bot and nobody finds out until they present to leadership months later.
- Code is no longer the bottleneck. Alignment, deployment pipelines, feature flag pacing, and code review have become the dominant constraints, and agent platforms have to address the full lifecycle.
- The framework cannot be the product. Open-source the harness and sell the indispensable infrastructure piece — observability, sandboxing, orchestration — that gets harder as the ecosystem matures.
- The opportunity is not to compete with what Shopify is building internally. It is to build the system that replaces every homegrown platform when the engineer who built it moves on.
FAQ
Why do homegrown agent platforms tend to fail within a year?
They are typically built by ambitious individuals who have other jobs, so when those engineers get promoted or leave, the platform loses its maintainer. Without a dedicated team, these systems stay feature-poor, run on desktop-grade infrastructure, and accumulate operational debt that production workloads eventually expose.
What is the coordination problem AI tooling has created inside large companies?
AI has democratized building to the point where multiple teams independently spin up the same product without realizing it. Traditional organizational processes were designed for slower build cadences, so duplication now goes undetected for months. A centralized agent platform with cross-team visibility becomes a governance layer, not just a productivity tool.
Why is the framework not the right thing to commercialize?
Frameworks get commoditized, forked, or absorbed into model provider offerings the moment a major player ships built-in orchestration. The durable commercial layer is the indispensable infrastructure piece — observability, sandboxing, orchestration — that gets harder, not easier, as agent ecosystems mature.
Something is happening inside the engineering organizations of the most sophisticated companies in the world, and almost none of them want to talk about it.
They are building internal agent software factories. Not running pilots. Not experimenting. Standing up full platform teams dedicated to developing, deploying, and operating AI agents as production software systems. The work is politically sensitive, strategically differentiating, and moving so fast that nobody wants to be on the record about what they are doing.
I have had this conversation, in some form, with engineering leaders at a dozen companies in the last quarter. The shape is always the same. A Slack bot wired into the data warehouse. A CLI-based coding agent running in ephemeral containers. A vibe-coding platform that lets non-engineers spin up internal apps. A small team of principal engineers — or sometimes one ambitious staff engineer — holding the whole thing together with duct tape and conviction.
This should sound familiar to anyone who lived through the early days of workflow orchestration. When Airflow was gaining traction, every company with a data team had already built something homegrown. Ambitious engineers with free time build an internal tool, it gets adopted organically, and for a while it works. Then it doesn't.
The Stack Looks the Same Everywhere
The thing that struck me as I started comparing notes across companies is how identical the stack is. Different teams, different industries, different starting infrastructure — and yet the architecture converges.
It is always some version of: a coding CLI (Claude Code, Codex, Cursor, or a custom wrapper) running inside sandboxed VMs, hooked into the data warehouse, production logs, and a growing set of internal tools through MCP or homegrown adapters. A Slack interface for kicking off jobs and reviewing output. Some attempt at context engineering — markdown files in a git repo, indexed for retrieval, version-controlled like a codebase. A deployment pipeline, an evaluation harness, a half-built observability layer.
That convergence is the productization tell. When fifty different companies independently build the same thing, the thing wants to be a product. It just has not been one yet.
What the companies building these systems are quietly taking for granted is the decade of platform engineering that made it possible. Shopify did not "stand up an agent platform" — they had well-configured sandboxes, sophisticated context plumbing, internal tool registries, and evaluation frameworks built up over years. The agent layer is the new part. Everything underneath is old infrastructure outsiders do not see. Which means the company with thirty engineers and no platform team cannot replicate this. They will try. They will get something working. And it will fall over.
Homegrown Systems Hit a Wall
The problem with homegrown platforms is not that they are bad. They are often quite good — for the first year. The problem is that they are built by people who have other jobs.
The engineer who built your internal agentic coding platform is not going to maintain it forever. They will get promoted, get poached, or move to a more interesting project. When they do, their system becomes a liability. The Shopify Slack agent gained traction organically and is now getting a team built around it. That is the best case. It is still fragile.
The firehose of internal AI tools at these companies is, by one insider's account, "out of control." A lot of these tools are built not because they are the highest-leverage work, but because developers want to practice agentic workflows and build portfolio pieces in case the next round of layoffs comes. A lot of engineering energy producing tools with a bus factor of one.
These platforms go into disrepair. They stay feature-poor because no one has the mandate to invest in them continuously. They are desktop-grade tools when the organization needs a cloud-grade system with autoscaling, proper sandboxing, audit trails, and the reliability production workloads demand.
I have seen this movie before. At Astronomer, we built the commercial Kubernetes-based Airflow platform. Our customers were companies that had outgrown a single-machine Airflow deployment some data engineer stood up as a POC and accidentally promoted to production. The pattern is identical now. The big cloud providers will target the easy 80% with managed offerings. The commercial winners will own the long tail of operational requirements that homegrown systems never get around to building.
Internal AI Tooling Has a Twelve-Month Shelf Life
I had a conversation recently with an engineering leader at one of the fastest-growing infrastructure companies in the world. His team has gone deep on AI adoption — custom skills on top of Claude Code for incident investigation, log analysis, SRE workflows. By any reasonable measure they are ahead of the curve.
What stuck with me was not how much they had built. It was the shelf life of what they had built. Three to four weeks of work produced tooling that matched or exceeded what dedicated AI SRE vendors were offering. That is impressive. It is also a warning sign. If a small team can build something competitive in a month, what happens to that tooling in six months when the person who built it has moved on?
The instinct to build is correct. You cannot stop engineers from doing this, and you should not try. People are building AI tools internally because their careers depend on it. The mistake is the build-forever assumption.
Here is what actually happens. Someone spends a few weeks building a Claude Code skill or a custom automation. It works. Models change, APIs change, the team's needs evolve, and suddenly the tooling feels like it was built in a different era — because in AI terms, it was. The top five percent of engineering orgs can keep up with this pace. Everyone else hits a prioritization wall. You have a business to run. Maintaining bespoke AI tooling is not your core competency, even if building it felt natural.
Worse, automating knowledge work always comes back to writing software. Not configurations. Not prompt templates. Not MCP server setups. Software. Software has properties prompt engineering does not. It can be tested, versioned, maintained by someone other than the original author. It can also rot if it is not actively maintained. The enterprise AI conversation needs to shift from "how do we adopt AI tools" to "how do we build and maintain AI software systems." Those are different questions.
The Coordination Crisis
There is a second-order problem the build-vs-buy framing usually misses. It is not just that homegrown platforms become maintenance burdens. It is that AI tooling has democratized building to the point where organizational coordination collapses.
At Yum Brands — 50,000+ restaurants, roughly 8,000 engineers across Pizza Hut, Taco Bell, KFC, and other brands — the AI team is watching this play out in real time. Because AI tools now allow non-engineering teams to build functional products, teams across all brands are independently spinning up AI-powered projects. The problem is not that the products are bad. The problem is that five teams build the same integration aggregator without knowing it, and nobody discovers the duplication until a month or two later when they present to leadership.
As one lead engineer described it to me: "Things are just going a little too fast. Because now everybody thinks they're an engineer, there's just so much slop. Leadership is having a hard time getting actual deliverables because people keep shifting their objectives because AI is allowing them to finish things so quickly."
This is the coordination tax nobody budgeted for. In the old world, building was slow enough that coordination happened naturally — you could not waste months of engineering time without someone noticing. In the new world, a team can build a million-dollar product in weeks, and the organizational processes designed to prevent duplication were built for a slower cadence. When everyone can build, nobody knows what has already been built.
This is a hundred-million-dollar problem at Yum's scale, and it is a governance problem that tooling created. A centralized agent platform with visibility across teams stops being a developer productivity tool and starts being the coordination layer that prevents millions of dollars of duplicated effort.
Code Is No Longer the Bottleneck
The most consistent observation from engineers at large companies is that code production is no longer the constraint. AI has made writing code dramatically faster. Everything surrounding the code has not.
At Shopify, deploying the main monolith can take hours. A PR merged in the morning might not reach production until the next day's deploy. Feature flags enforce pacing — five hours minimum to get a flag from 0% to 100%. A trivial change can take days. A meaningful change with a couple of fix loops can take a week.
At Yum, the bottleneck has shifted further upstream. Their multi-agent systems — coordinator agents, investigator agents, action agents — handle thousands of incidents per hour. The engineering challenge is not generating code or even orchestrating agents. It is managing the explosion of output. When AI can produce 45 PRs in a single repo in a day, the constraint becomes code review and human verification.
The new bottleneck is taste. Taste does not scale with token throughput.
It is not enough to generate code faster. The platform has to understand the deployment pipeline, the flag rollout, the verification step, the review velocity. Internal teams have a massive advantage here because they understand their own deployment reality intimately. That advantage is also their weakness — they are building for one deployment reality, on top of an already overburdened engineering organization.
The Agent Harness Matters as Much as the Model
One underappreciated dimension: the agent harness — the CLI or framework that wraps the model — matters as much as the model itself. Engineers report meaningfully different results from the same model run through different harnesses. The harness is not a thin wrapper. It is an opinionated layer that shapes agent behavior in ways that matter.
This fragmentation is playing out at every large company. Adoption is split — some engineers use Cursor, some Copilot, some Claude Code, some the open-source CLIs. The Shopify Slack agent was designed from the start to support multiple CLI backends, with code paths for both Pi and Claude even though it is currently only running one. Swapping is a configuration change, not a rewrite. That is the kind of decision that separates systems built by experienced engineers from systems built during a hackathon.
Companies building internal tools are typically locked to one framework because that is what the team that built it knows. A real product has to be agent-agnostic the way good deployment infrastructure is cloud-agnostic. The harness matters. The model matters. Both are moving targets. Bring-your-own-keys and CLI-agnostic design are table stakes for the mid-market enterprise buyer who refuses to be locked in.
The Observability Gap
I recently built an internal tool that decides whether someone who signs up for our product gets routed as a sales lead. It is a mix of deterministic code and a small LLM classification step. It works. The problem is that no one knows how it works except me.
The scoring formula is logic in a repo. There is no place where it is atomically described, inspectable, or auditable. That is barely good enough for me, and not good enough for an enterprise. We are building agents at a rapid pace, and the work they do is the tip of the iceberg. The harder question is whether they are doing the right work, and whether they are doing it correctly.
What it would mean to have a directory of AI agents the way you have a directory of employees: every agent has a description of what it does, how often it runs, what its recent outputs look like. You can inspect logic. You can judge performance. When the CEO says "that's wrong, it should work this way," there is a clear path to making the change and verifying it took effect.
Humans learn from outside stimulus. Agents have none of that by default. They will keep doing the wrong thing forever unless we build the infrastructure for humans to watch, judge, and course-correct. The agents that work in production will be the ones designed to be supervised, not the ones designed to be autonomous.
The other pattern I keep seeing is that people build agents as monoliths. One big automation that does everything. The better model is atomic agents — small, single-purpose units with clearly defined inputs and outputs, each independently testable, each swappable for a cheaper model when the task is simple enough. You cannot do that when everything is tangled together.
The Iteration Problem Belongs to Users, Not Developers
Agents are always slightly wrong when first built. They require hundreds or thousands of conversations to get close to reliable. The people who know what is wrong are not the developers who built the agent. They are the end users who interact with it every day.
An insurance adjuster knows when the agent is handling a claim incorrectly. A salesperson knows when the outreach tone is off. If the only way to correct the agent is to file a ticket with engineering and wait for a code change, the iteration cycle is too slow to ever converge on quality. The platforms that win will let users correct agents the way they would correct a direct report — conversationally, immediately, without writing code.
This is also why the engineering agent platform is quietly becoming the enterprise agent platform. The primitives are the same. A sandbox for execution. Tool access via MCP or APIs. A prompt that defines the role. Model routing. A conversational refinement interface. These are what engineering agents need, and exactly what a sales agent or an insurance adjuster agent needs. The difference is cosmetic, not architectural. Companies that constrain their platforms to engineering workflows are creating an opening. The mid-market CTO wants one tool, not five.
Frameworks Are Free. Infrastructure Is the Product.
The uncomfortable question anyone building in this space has to answer: even if you find the right patterns, how do you charge for them?
The framework itself cannot be the product. LangChain learned this. They open-sourced their agent framework and shifted commercial focus to LangSmith, an observability platform. The framework is the adoption vehicle; observability is the revenue model. E2B took a different path to the same conclusion — they sell the sandbox infrastructure where agents actually run.
Both arrived at the same structural insight. Give away the opinionated harness. Sell an indispensable piece of the stack that scales with consumption. The alternative — selling the framework itself — runs into the problem that frameworks get commoditized, forked, or absorbed by the model providers. When Anthropic ships built-in orchestration, framework-level features collapse into the model provider's product overnight.
So the strategic question is: what is the critical piece you sell alongside the free framework? It has to be something that gets harder, not easier, as the ecosystem matures. Observability gets harder because the systems get more complex. Sandboxing gets harder because security requirements escalate. Orchestration gets harder because multi-model, multi-harness workflows create session translation problems most teams underestimate. The defensible move is to pick the hardest infrastructure problem — the one enterprises cannot solve with a weekend hackathon — and own it completely.
The Real Product Is What Replaces Homegrown
The strategic insight is not to compete with what Shopify or Stripe are building internally. You will lose that fight. They have context, data access, and engineers who understand their specific domain better than any vendor ever will.
The opportunity is everyone else. For every Shopify with a polished internal platform, there are a hundred companies where one developer stood up a hacked-together system that now runs in production on a single large VM. Bus factor of one. No autoscaling — they are paying $3,000 a month regardless of usage. A blip of downtime on every deploy. Mid-market enterprises that do not have principal engineers with spare cycles for custom agent infrastructure. Even when they try, they are not going to build something as good as what the big companies built with dedicated teams.
The product strategy is to study what the winners built, aggregate those requirements into a comprehensive platform, and sell it to the companies that cannot or should not build it themselves. This is not novel. It is how every successful infrastructure company has been built. Study the homegrown systems at the top, build the commercial version, sell it to the middle.
We are at maximum entropy right now. Every week brings a new model, a new harness, a new internal tool. It will get more chaotic in six months. The companies that come out ahead will not be the ones with the most internal AI tools. They will be the ones that operationalized agent infrastructure early — shared context, multi-agent execution environments, reviewable output pipelines — and let their people focus on the actual work instead of maintaining bespoke tooling.
The gap between AI demo and AI deployment is called software engineering. The gap between one team's internal bot and enterprise-grade agent infrastructure is called a product. The internal tools era is a phase. What comes next is infrastructure.
Sources
Related Essays
Agents in Production: GTM Mesh and the Death of the ERP
The same mesh-of-agents pattern that closes the gap between ad click and revenue is the one that retires the ERP dashboard. Two examples, one architecture.
From Agent to Platform: Why the GTM Is Services-Shaped
There is no winner-take-all platform in agents. The GTM is services-shaped, the harness is commoditized, and coding-agent companies have a narrow window to pivot before the floor falls out.
The Agent Harness Problem
Enterprise agents need layered interfaces, real software skills, and flexible platforms. The harness around the model matters more than the model.