Knowledge Work Automation Is Software Engineering. Period.

Key takeaways

Automating knowledge work with LLMs is the same activity as writing modern software — there is no procurement shortcut.
Every obvious agent idea gets commoditized within weeks, so the durable moat is staying in a problem space long enough to make it actually work in production.
The bottleneck has shifted from generating code to reviewing it; taste and judgment are now the scarcest resources in an engineering organization.
Background agents that run in sandboxes against the real codebase beat RAG-over-vectors for any non-trivial code generation task.
The context and memory layer underneath agents is the defensible infrastructure play, not the orchestration framework on top.
A token-cost reckoning is coming, and the survivors will be the ones who built evaluation and feedback loops into their agent systems from day one.

FAQ

Why can't enterprises just buy an AI agent product off the shelf?

Because the specificity of how each business operates makes every real automation project a custom software project. The integrations, edge cases, and workflows are unique enough that no vendor anticipates them, so even a "buy" decision turns into a software engineering effort.

What makes background agents different from interactive coding assistants?

Background agents pick up tickets from systems like Linear or Jira, clone the repo into a sandbox, and open pull requests on their own. The developer's first interaction is the review, which shifts the bottleneck from writing code to evaluating it. That review-at-scale problem is the next frontier.

Why do sandboxes beat RAG for code generation?

Vector retrieval gives you a similarity approximation of context, while a sandbox gives the agent the real repository and standard tools like grep and find. Compute is cheap enough that sixteen failed searches do not matter, but starting from real context dramatically improves output quality on non-trivial tasks.

Every enterprise CEO is talking about AI on earnings calls. Every VP of Engineering has experiments running. Almost none of them have operationalized a single agent workflow that runs in production, unsupervised, delivering measurable business outcomes.

The reason is simple and uncomfortable: they think they have an AI problem when they actually have a software engineering problem.

The Procurement Trap

Most organizations approach AI automation as a procurement exercise. Pick a vendor. Buy a point solution. Plug it in. Move on to the next line item.

The moment you try to automate real knowledge work — accounts payable timing, sales prioritization, customer support triage, brand management across 500 people in a Fortune 50 — you discover that no commercial product fits what you actually need. The APIs you depend on are underpowered. The Figma MCP is garbage. Half the ad platforms do not have proper APIs at all. Salesforce was not designed for agent interaction. The integrations are janky and the workflows are unique to your business in ways no vendor anticipated.

We recently built a system for our own GTM organization at Tembo — a scoring algorithm that tells us which deals to focus on each day and what actions to take according to our sales methodology. It sounds like something you should be able to buy. You cannot. Not because the technology does not exist, but because the specificity of how your business operates makes every real automation project a custom software project.

That is the insight most executives are missing. Automating knowledge work with LLMs is the exact same thing as writing modern software. If you are trying to avoid touching software, you are going to touch software anyway. Every AI idea you have is either custom software or buying software and integrating it — and the integration alone is a software project.

The gap between an AI demo and an AI deployment is called software engineering. The gap between an AI deployment and an AI operationalization is called organizational design. Both of these are harder than picking a model.

Every Obvious Idea Gets Commoditized in Weeks

Here is the second uncomfortable truth, this one for the founders building agent products: every founder building in AI right now has the same recurring nightmare. You ship something on Monday, customers love it on Tuesday, and by Friday three other companies have pivoted into your space with the same product. A frontier lab drops something adjacent. A YC batch launches five competitors. The thing you thought was your breakthrough turns out to be obvious.

The barrier to building has effectively collapsed. A CEO and a co-founder with no CTO, no engineering team, just vibe coding their way through pull requests — they can ship a real product now. They barely review code. They set up automations that generate screenshots and videos of new features. They move at speeds that were impossible two years ago.

This means every idea with any surface area gets built by multiple teams simultaneously. Not because anyone copied anyone, but because the idea was obvious and the tools to build it are available to everyone. The window between "novel product" and "commoditized feature" has compressed from years to weeks.

If your agent strategy depends on being the only one who thought of something, you have already lost.

The counterintuitive move in this environment is to do the opposite of what every YC company in a competitive space is doing. Instead of pivoting fast, chasing traction, throwing spaghetti at the wall — pick a hard problem and refuse to leave. The founders who built 40 products and shipped none of them to revenue learned a lot, but they did not build companies. The ones who stayed in a space, fought through the commoditization waves, and kept compounding their understanding of the problem — those are the ones who eventually broke through.

Longevity and diligence are underrated when everyone else is optimizing for speed and novelty.

Agents Are Software, Not Prompts

The reason commoditization does not actually finish the job is that prompt-chains are not products. Real enterprise knowledge work automation requires real software underneath.

Consider the Fortune 50 company with 500 people doing brand management. There is no universal model for how that work gets done. There might be a methodology in a book somewhere, but no one follows it perfectly. Every individual has their own opinions, their own workflows, their own edge cases. An agent that tries to impose a single workflow on 500 people will fail. An agent that adapts to each person's context and learns from their feedback has a chance.

The hard part is not wiring an LLM to a Slack channel. The hard part is the context layer, the memory, the feedback loops, the evaluation frameworks, the deterministic guardrails, the integration with whatever idiosyncratic systems your company actually runs on. That is software engineering. Production-grade agents need version control, testing, human review loops, and continuous iteration — the same rigor as any other enterprise software system.

Workflow-based agents that start small and grow into roles will outperform role-based agents that try to do everything from day one. A workflow agent that writes changelogs every week can be tuned to perfection. A role-based agent that tries to implement features, fix bugs, review PRs, and write changelogs will be mediocre at all of them. Start where the volume already is — automations and triggered workflows outnumber human-initiated agent runs roughly 10:1 in production. Then assemble workflows into something that resembles a role.

The Context and Memory Problem Nobody Has Solved

Teams that are technically proficient are already building their own productivity harnesses. They use Claude Code to spin up a standup skill, wire it to five MCP servers, and generate a daily summary. It works — until it does not.

The failure mode is predictable. Nothing is cached, so every run makes massive token calls. It is slow. Accuracy degrades when one data source times out. The agent never references the last time it ran the skill — it starts from scratch every execution. There is no learning, no feedback incorporation, no memory.

This is the infrastructure gap that actually matters. Not another wrapper on top of an LLM. Not another Slack bot. The gap is in the data layer — the context and memory infrastructure that agents need to actually improve over time. It is the kind of annoying, non-mission-aligned plumbing that enterprises will pay someone else to handle rather than build themselves.

No one has won this space yet. There is no dominant player in agent memory and context. The frontier labs have not productized anything meaningful here. And the enterprise need for a managed, secure, cost-effective context layer is not going away. The orchestration framework on top will commoditize. The data layer underneath will not.

The Bottleneck Has Moved

Now flip the lens to the engineering org actually consuming all this agent capability. Something strange is happening inside large companies. Teams that used to ship a handful of pull requests per week are generating dozens per day. Non-engineering teams are building their own products. Leadership is discovering, sometimes months too late, that five different teams built the same integration aggregator without talking to each other.

The conventional narrative about AI in software engineering is still stuck on generation. Can the model write the code? How accurate is it? What percentage of keystrokes does it save? These are Phase 1 questions, and most organizations have already moved past them — whether they realize it or not.

There are three phases of AI-assisted engineering, and they map cleanly onto where every organization is on the curve:

Phase 1: Autocomplete. GitHub Copilot, tab-completion, inline suggestions. Useful, incremental, easy to adopt. Where most enterprises got comfortable.
Phase 2: Interactive agents. Claude Code, Cursor, Codex CLI — a developer drives, the AI executes. Where most engineering teams are today. The developer is still in the loop for every decision.
Phase 3: Background agents. A ticket gets created in Linear or Jira. An agent picks it up, clones the repo into a sandbox, searches the codebase, writes the fix, opens a pull request. The developer's first interaction with the work is the review.

Phase 3 is where the paradigm actually shifts. And it is here right now — this is what we are shipping at Tembo.

When your system can produce 45 pull requests in a single day, the constraint is no longer writing code. It is reading code. It is deciding whether the AI chose the right approach, not just whether the code compiles. It is taste.

The math is brutal. If an AI agent gets the implementation right nine times out of ten, that sounds great. But if you are generating dozens of PRs per day, you have a steady stream of work that requires a human with deep context to evaluate. The one-in-ten that chose the wrong approach is not obviously wrong — it compiles, it passes tests, it looks reasonable. Finding it requires someone who understands the system's intent, not just its syntax.

This is the new bottleneck, and almost no one is building for it. The industry is obsessed with making generation faster and cheaper. The organizations that win will be the ones that solve the review problem — that build systems where humans can efficiently exercise judgment over AI-generated work at scale.

Why Sandboxes Beat RAG

There is a technical insight here that matters for anyone building or buying agent infrastructure. Most enterprise AI architectures default to RAG — vectorize your codebase, retrieve relevant chunks, feed them to the model. It works for question-answering. It is a poor fit for code generation.

Background coding agents work differently. They clone the actual repository into an ephemeral sandbox. The agent uses standard tools — grep, find, the file system itself — to locate relevant code. It might take sixteen attempts to find the right file. That is fine. Compute is cheap. What matters is that when it starts writing code, it has real context, not a vector similarity approximation of context.

The sandbox gets destroyed when the task is done. There is no persistent state to manage, no vector index to keep in sync with your codebase. Stateless, disposable, accurate. This is the architecture that actually works for enterprise code generation at scale.

The Coordination Problem Technology Made Worse

The technical challenges of background agents are solvable. The organizational challenges are harder.

When AI lowers the cost of building something to near zero, every team builds. Marketing builds an internal tool. Operations builds an integration. Three different brand teams within the same company build the same aggregator. No one finds out until the quarterly review.

This is not a technology problem. This is a coordination problem that technology made worse. When building was expensive, the cost itself was a coordination mechanism — you had to get budget approval, which meant someone asked whether anyone else was already doing this. When building is cheap, that natural check disappears.

The enterprises that figure out AI operationalization will not be the ones with the best models or the most tokens. They will be the ones that build organizational systems — visibility, governance, shared context, a universal log of every agent action — that match the speed of AI-assisted development. The agent mesh is not just a technical architecture. It is an organizational one.

Human-in-the-Loop Is Not a Limitation

There is a temptation to see human review as a temporary constraint — something we will automate away once the models get good enough. This is wrong, and it is dangerous.

Human review is not the bottleneck to be eliminated. It is the quality gate that prevents AI-generated slop from compounding into technical debt that takes years to unwind. The organizations currently shipping AI-built products without engineering review are building on sand. They do not know it yet because the failures have not cascaded.

The pattern that works is: context in, background execution, reviewable output, human approval. The agent does the work. The human exercises judgment. Code does not merge without a human saying yes.

This is not a conservative position. It is the only position that scales. The alternative — letting AI-generated code flow into production without review — is how you end up with 18,000 water bottles ordered at a Taco Bell.

The Token Reckoning

One more uncomfortable truth before the close. By the end of 2026, organizations that have run a full year of P&L on their AI agent deployments are going to have a reckoning.

Right now, everyone is in experimentation mode. Engineers are burning ten to fifteen million tokens a day. Powerful models are being used for tasks that do not require them. Standup skills run daily and no one actually reads the output. The spend is justified because everything is new and the financial reward for being hands-on with AI is real.

This cannot be the steady state. The answer is not to cut tokens — it is to evaluate output. Did this agent run actually produce something valuable relative to what it cost? Are people using the output, or is it just noise in a Slack channel?

The organizations that build evaluation into their agent infrastructure now will be the ones that can scale their AI deployments. Everyone else hits a wall when the CFO starts asking questions.

The Through-Line

If you are an executive: stop treating this like procurement. The gap between a vendor demo and a working agent in your business is filled with software you have to write or commission. Hire engineers, not consultants.

If you are a founder: pick a hard problem and stay in it. Every obvious idea will be cloned by Friday. The moat is not novelty. The moat is the boring infrastructure underneath — the context layer, the memory, the evaluation, the integration plumbing — and the diligence to keep building it after the first commoditization wave.

If you are an engineering leader: your teams are already using AI coding tools whether you sanctioned it or not. The question is whether you have the infrastructure to absorb the output. Can your review processes handle ten times the PR volume? Do you have visibility into what every team is building? Are your agents running in environments where they have real codebase context, or are they hallucinating against stale vector embeddings?

Background agents are not coming. They are here. The only question is whether your organization is structured to use them, or be buried by them.

— Ry

Sources