← Back to essays
·11 min read·By Ry Walker

The Agent Harness Problem

Key takeaways

  • Enterprise agent systems need layered interfaces that serve technical builders and non-technical operators without dumbing anything down.
  • Skills defined as markdown are insufficient — real skills require tools, tests, memory, and deep integration with the systems they operate on.
  • Platforms must support multiple CLIs, multiple models, multiple repos, and self-hosted deployment, because no enterprise wants to be locked into one vendor's harness.
  • The same engineering primitives — sandboxes, tool access, governance, iteration loops — that power coding agents serve every other knowledge worker.
  • The iteration loop is the product: building an agent is easy, getting an agent right is the hard part, and conversational refinement beats consulting engagements.
  • Context engineering, not infrastructure plumbing, is the work that determines whether an agent system actually functions.

FAQ

What is the agent harness?

The harness is everything around the model — the sandbox, tools, system prompts, memory, governance, and iteration loop. It is the layer that turns a capable model into a usable agent inside an enterprise.

Why are markdown skills insufficient for enterprise agents?

Markdown descriptions tell an agent what exists but not how to operate on it reliably. Real skills need executable tools wired to specific endpoints, automated tests, schema validation, and memory of prior runs so the agent does not waste effort rediscovering the environment.

Why must agent platforms support multiple CLIs, models, and repos?

Enterprises already have multiple model contracts, mixed coding CLIs, and complex repository topologies including monorepos and submodules. A platform locked to one harness inherits that harness's bugs forever and cannot serve customers who need self-hosted or EU-resident deployment.

Every enterprise building agents right now is having the same conversation, whether they realize it or not. It goes like this. We built something that works for engineers. Now we need non-technical people to use it. And we cannot dumb it down.

This is not a UX problem. It is an architecture problem. And most teams are solving it wrong.

The instinct is to slap a chat interface on top of an agent system and call it accessible. Chat is the default interaction pattern for AI right now, and it works well enough for single-player exploration. But the moment you need an agent system that serves a franchise operator in Austin differently than it serves a VP of marketing at headquarters — while both retain full visibility into what the agent is actually doing — chat alone falls apart.

The real requirement is layered interfaces with layered permissions, connected to the same underlying agent mesh. An engineer should be able to push a nine-file PR through their agent. A product manager using the same platform needs the agent to keep changes small, ask before opening PRs, and stay inside a narrow scope. The difference is not a simplified UI. It is a different system prompt, different guardrails, different defaults — but the same primitives underneath.

That underneath is the harness. And the harness is where every interesting problem lives.

Skills Are Software, Not Markdown

The biggest lie in agent infrastructure right now is that you can describe a skill in a markdown file and call it done.

A skill that says "when the user asks about pipeline metrics, query the analytics database" is not a skill. It is a wish. A real skill has executable tools wired to specific endpoints, automated tests on those tools, validation that the data shape matches what the agent expects, memory of what happened the last hundred times the agent ran this workflow, and integration tailored to the actual data model — not a generic description of the data model.

When an agent needs to query analytics data, a markdown description of the analytics platform is not enough. The skill needs to specify exactly which tables, which event names, which API endpoints. Without that level of detail, the agent burns enormous effort on discovery, trying to figure out where data lives instead of answering the question. You watch this happen in real time and realize the model is fine. The harness is starving.

This is why agent platforms that ship "skill marketplaces" full of markdown configurations are selling something closer to a wiki than to software. Skills need to be treated like any other software artifact — versioned, tested, reviewable, with deterministic checks that run before the agent uses them and observability for what happened when it did.

The corollary: context engineering is the real work, not the infrastructure plumbing. Standing up an EC2 instance with bash scripts and a YAML config is something a competent engineer can do in a week. Getting the context right — the docs, the deterministic checks, the retrieval logic that puts the right information in front of the model at the right time — is the work that never ends.

The maintenance burden splits in two. The infrastructure layer — runtimes, dev containers, CI/CD integrations, API connections — tends to be a one-and-done effort that rarely breaks once set up. The context layer — the agents.md files, skill definitions, memory journals, deterministic check libraries — needs continuous iteration. Because these are autoregressive systems, longer context documents degrade performance. The feedback loop never closes.

Flexibility Is Not Optional

Stage one of agentic engineering was tab autocomplete in the editor. Stage two was CLI-based coding agents on a developer's laptop. Stage three is running those CLIs in the cloud, triggered by tickets, alerts, releases — multiplayer, observable, integrated into existing workflows rather than siloed on one machine.

Stage three is where every interesting enterprise question shows up. And the answer to almost every one of those questions is: the platform has to be flexible.

Flexible across CLIs, because the same coding CLI with the same model produces different results depending on the system prompt and harness. A team that has standardized on Claude Code for its main repo and Codex for its data pipelines does not want a platform that supports only one. The harness wraps the CLI; if you only support one harness, you have committed your customer to that one harness's bugs forever.

Flexible across models, because every customer with a real procurement department already has multiple model contracts. Selling tokens is not a business. The orchestration layer that lets teams bring their own keys and route work across providers is the business.

Flexible across repos, because real engineering organizations have monorepos, polyrepos, submodules, vendored libraries, and at least three abandoned repositories that someone insists are still load-bearing. Coding agents that clone a single repo and work within its boundaries fall over the moment they encounter a type definition in a submodule. They invent types that already exist. They reference fields that do not match the actual schema. They grep their way through files instead of using language server protocols to navigate type hierarchies. The fix is not a smarter model. It is a harness that understands repository topology and gives the agent the same tooling — language servers, type checkers, live linter feedback — that human developers already rely on.

Flexible on deployment, because EU data residency and self-hosted operation are not edge-case requirements. They are table stakes for any agent platform selling into regulated industries or European companies. The competitive moat is not the interface or the model. It is the ability to run inside the customer's cloud with no data leaving their environment. Most SaaS-native competitors cannot satisfy that and never will.

The companies that win this market will be the ones that give developers the most freedom and customization in a way that remains sane. The right abstractions matter more than the right features.

The Primitives Are the Same Across Roles

Here is the part most people miss when they look at the market today. The cloud IDE companies are competing for engineering teams. The "agent OS" startups are competing for business operations teams. The RPA incumbents are trying to rebrand as AI-native. The foundation model providers are shipping individual-grade tools that enterprises cannot adopt because there is no governance, no self-hosting, no audit trail.

In twelve months, every one of these categories will be trying to become the others.

Strip away the surface and every enterprise agent needs the same things. A sandbox where it can execute without breaking production. Access to tools — APIs, databases, internal systems. A prompt interface where a human defines the job. A governance layer that produces reviewable, approvable output before anything touches the real world.

These are engineering primitives. They were built for coding agents. They are not exclusive to coding agents.

An insurance adjusting agent needs a sandbox just as much as a bug-fixing agent does. A sales workflow agent needs tool access just as much as a PR review agent does. A compliance agent needs governance just as much as a release agent does. The difference is cosmetic — the UX layer that the end user sees. The knowledge worker does not want to see npm install commands. They want to describe what they need in plain language and watch it happen. But underneath, it is the same machine.

This is why platforms that over-constrain to engineering workflows — that hardcode "you are a software engineer" into their system prompts — are leaving enormous value on the table. They have built the engine. They just refuse to put a different body on the car.

The strategic tension every agent platform faces: you cannot afford to go wide on day one, because you are too small and going broad kills focus. But you cannot afford to stay narrow forever, because the market is converging. Every company over a hundred employees is going to want one platform where engineering, sales, operations, and compliance can all build and run agents inside their own cloud.

The resolution is not to pick. It is to go deep on primitives that are inherently broad. Build the sandbox so well that it works for any workload. Build the governance layer so well that it satisfies HIPAA and SOC 2 and whatever else the regulated enterprise needs. Build the iteration loop so well that a non-technical user can talk an agent into existence and then talk it into correctness over weeks.

When the market demands breadth, you do not have to rebuild. You open the door.

The Iteration Loop Is the Product

Building an agent is not the hard part. Getting an agent right is the hard part.

Every agent is wrong on day one. It misunderstands edge cases. It formats output incorrectly. It makes assumptions that do not match how the team actually works. The only people who can fix these problems are the people who use the agent every day — not the engineers who built it.

This is why the old model breaks. Hire engineers to build static agents for business teams. The business team discovers problems. They file tickets. The engineers prioritize other work. The agent rots. Six months later, nobody uses it. This is exactly the failure pattern of every RPA project for the last fifteen years, and every AI vendor that copies the RPA delivery model is going to repeat it.

The new model is conversational iteration. The user talks to the agent like a direct report. "You are doing this wrong. Do it this way instead." The agent generates a pull request with the proposed change to its own configuration. A human reviews and approves it. The agent gets better. The change is versioned, audited, and reversible. This is not a chatbot. This is software development happening through conversation, with full governance underneath.

It also implies something most platforms ignore: each beneficiary should have control over their own instance. Ten salespeople should have ten separately evolving agents. A manager layer observes divergence across instances and locks in company-wide policies when patterns emerge. Pre-configured agents that try to work out of the box fail the same way one-size-fits-all human consultants fail. A book-smart, experience-poor agent that learns through interactions with its specific user beats a polished agent that nobody can shape.

And every agent task — every PR, every email, every enriched record — should produce two outputs. The deliverable, and a companion context update capturing what the agent learned during execution. Without that secondary output, agents repeat the same mistakes indefinitely, burning tokens and patience on problems they have already encountered. The platforms that build this learning loop into the substrate, rather than bolting it on, will look completely different from the platforms that did not.

What Engineering Leaders Should Actually Ask

Stop thinking about categories. Stop asking whether you need a cloud IDE or an agent OS or an RPA replacement. Ask instead whether the platform has the primitives to run any agent, for any team, inside your environment, with the governance your organization requires.

Ask whether your non-engineering teams can use it without seeing code, and whether your engineering team can use it without feeling constrained.

Ask whether the platform produces reviewable output that goes through an approval process before it touches production systems.

Ask whether skills are real software — with tools, tests, integration, and observability — or whether they are markdown files dressed up as features.

Ask whether the platform supports multiple CLIs, multiple models, multiple repos, and self-hosted deployment, or whether it has quietly locked you into one vendor's harness.

Ask whether the iteration loop works without filing an engineering ticket every time something needs to change.

Most importantly, ask what the platform looks like a year from now, when half your sales team and half your operations team are also building and customizing agents. Because that is where this is going. The same primitives that let an engineer push a nine-file PR are the primitives that let an account executive build an enrichment workflow that they actually understand and own.

The companies that get this right will not have better AI tools. They will have given every team in the organization a squad of engineers in their pocket, building and refining the workflows that actually run the business. That is not a feature. That is a structural advantage that compounds every single day.

The model is going to keep improving. The harness is what your team has to live inside. Build that part right.