The prompt is no longer the interesting part of an agent. The harness — orchestration, tool integrations, persistence, the rules governing agent behavior — is where all the complexity lives now.
Teams building agentic systems describe the same evolution. The first version was a structured LLM workflow: deterministic steps, clear inputs, clear outputs. It worked, but it was too rigid for what customers needed. The rewrite is fully agentic. The agent chooses tools, decides on follow-up actions, adapts to context. That autonomy requires a much more sophisticated harness to keep it from going off the rails.
Same dynamic in coding agents. The prompt is cheap. The harness — tool call management, persistence across sessions, multiplayer coordination, policy enforcement, observability — is the product. And the harness is where the context engineering burden concentrates.
The harness surfaces immediately in real deployments
The complexity of the harness is not theoretical. It shows up the moment a real enterprise team tries to deploy agents across their SDLC.
Consider a company consolidating three acquired platforms onto a single engineering org. They have GitLab, Bitbucket, and GitHub. They have Java, Python, and legacy codebases in the same repos. They use Jira, Confluence, Datadog, Snowflake, Microsoft Teams. They need agents that enrich tickets, review PRs, and eventually write code — all while respecting HIPAA constraints and PE-firm governance requirements.
Every one of those requirements is a harness problem, not a prompt problem:
Multi-source-control orchestration. The agent needs to open PRs across repos that live in different source control providers. The harness manages authentication, repo detection, and identity — including whether the agent's contributions appear as a bot account or accidentally impersonate the user who configured it. That identity question alone has real consequences: an SVP's account leaving code review comments on every PR looks like micromanagement, not automation.
Skill routing by context. When the same agent reviews PRs across Python and Java codebases, the harness determines which skills get loaded. Do all skills load every time? Does the agent select the right ones? Does it need explicit instructions to invoke a Python skill on Python code? These are harness configuration questions. The prompt says "review this PR." The harness decides what tools and knowledge are available for that review.
Model selection per task type. Enriching a Jira ticket with context from Confluence and Datadog does not require a frontier model. Reviewing a complex PR or generating code probably does. The harness should let you assign models per agent — opus for planning and review, lightweight open-source models like Qwen or GLM for enrichment and triage. Teams that get this right report meaningful cost savings without quality degradation on the simpler tasks.
Trigger design and timing. When should an enrichment agent fire? On ticket creation — before the engineer has even finished writing the description? On status change to "ready for development"? The harness controls this, and getting it wrong means the agent enriches tickets with no context, producing noise instead of signal. Getting it right means engineers find relevant duplicate tickets, related code paths, and implementation suggestions waiting for them before they start work.
Agents deserve breakpoints
There is a strong temptation to deploy an agent and let it run unsupervised. Resist it. New agents — baby agents, just born — require supervision. In the same way a debugger lets you set breakpoints in code, agent harnesses need breakpoints where a human reviews what happened before the agent proceeds.
The pattern looks like this: an agent drafts a tweet from a meeting transcript. Before it posts, a human sees a preview and chooses approve, reject, or edit. If the human rejects, they should be able to tell the agent why — stop using that phrase, change the tone, restructure the format — and have the agent rerun with that feedback incorporated. Over time, as confidence grows, you remove breakpoints. Eventually the agent earns cruise control. But the default should be supervised, not autonomous.
The same pattern applies to code. An agent creates a PR. CI fails. A review bot runs and flags the same failure CI already reported. That is the harness running redundant checks because no one configured the conditional: run the review bot only if CI passes. These are not prompt problems. They are orchestration problems, and the harness is where you solve them.
This is not a limitation. It is how agents mature. The harness must support graduated autonomy natively — not as an afterthought bolted on when something goes wrong in production.
The SDLC as a harness design problem
The most ambitious version of agent deployment is not a single agent doing a single task. It is an end-to-end SDLC where agents handle ticket enrichment, code generation, PR creation, code review, and CI coordination — with humans reviewing at decision points and writing code themselves when the task demands it.
The vision looks like this: a customer request arrives. One agent gathers industry knowledge and enriches the ticket with context from documentation, observability tools, and the codebase. A second agent analyzes the code to identify where a bug lives or how a feature should be implemented. A lead engineer reviews the enriched ticket and decides: is this something an agent can handle end-to-end, or does it need human coding? Either path ends with a PR. Review bots run. CI runs. Humans approve.
Every handoff in that chain is a harness problem. The agent that enriches the ticket needs access to Jira, Confluence, Datadog, and the relevant repos. The agent that writes code needs the right skills loaded for the right language. The review bot needs to know whether CI already passed. The lead engineer needs enough visibility into what the agents did to make a trust decision about the output.
None of this is about the prompt. The prompt for each step might be three sentences. The harness that connects them, manages their tool access, routes their output, and enforces review policies is thousands of lines of configuration and integration code.
Cloud-based agent sessions as harness infrastructure
Running multiple agents in parallel is the emerging expectation. Enterprise teams are telling developers to 4x their output — and the only concrete interpretation of that right now is "run four agent sessions at once." But running four concurrent coding agents locally exhausts memory and CPU on even high-end machines. The harness solves this by moving agent sessions to cloud-based compute.
The model looks like this: each agent session gets its own full environment — a dedicated server with the repo cloned, dependencies installed, databases running, the app bootable. The developer orchestrates from a lightweight client. When a session produces a PR, the harness provides a link not just to the diff but to the entire running environment. A teammate can open that environment, see the app running, fire up their preferred CLI tool — Claude Code, OpenCode, whatever — and resume where the previous agent left off. The harness logs every interaction across every user who touches that session.
This changes several things at once. Onboarding friction drops to near zero: no local environment setup, no missing env files, no "I spent the morning configuring my dev environment" standup updates. Preview environments become a byproduct of development rather than a CI/CD feature. And developers working across multiple clients or projects just switch between cloud sessions instead of maintaining parallel local setups.
For regulated industries, the same infrastructure addresses data loss prevention requirements. If the compute environment is hosted in a controlled region and the developer never pulls code to a local machine, you satisfy clean room requirements without the physical clean room. The harness manages the isolation boundary.
The key insight is that these cloud environments are not production infrastructure. They are dev infrastructure that the harness manages — ephemeral, shareable, auditable, and cheap enough to run dozens of them per team.
Measuring what the harness already knows
Enterprise teams are scrambling to measure developer productivity in an agent-augmented world. The dashboards being built internally track prompts sent, lines generated, tab completions, cost per developer — aggregated across whatever tools each person happens to use. The results are messy. Different tools report different metrics. A developer using only a CLI agent shows zero on every metric except cost. The dashboards cannot distinguish between a developer who merged a careful, well-tested PR and one who approved garbage.
This is a harness problem. If agent work flows through orchestration infrastructure, the infrastructure already has the data. Session duration. Number of human interactions per session. Commits produced. CI pass rate on agent-generated PRs. Cost per task. Which tools were used, and for how long. Whether a PR required rework after merge. The harness does not need to instrument anything extra — it just needs to surface what it already logs.
The measurement question also exposes a fairness problem. Teams that adopted agents early and already operate at high efficiency are being told to 4x on top of that baseline. The harness can help here too: if you can see that a developer is already running multiple parallel sessions, already hitting cost caps, already producing at the top of the distribution, you have evidence that the mandate does not apply equally. Without harness-level visibility, managers are guessing — pitting developers against each other using metrics that do not capture what actually matters.
The controversial version of this is a leaderboard. The less controversial version is personal benchmarking: how does my agent usage this month compare to last month? The harness supports both, and the choice of which to surface is a policy decision — which, like all policy decisions in agent systems, belongs in the harness.
Micro agents and the deployment question
The other dimension of the harness is resource efficiency. A coding agent VM might need two gigabytes of RAM. But a "cloud co-work" style agent — one that triages an inbox, drafts social posts, enriches CRM data — should not require that footprint. If the harness can compile agent definitions down to small, deterministic processes, you get micro agents: 20 MB of RAM, fast startup, cheap to schedule, easy to chain.
This matters because the interesting deployment model is not one agent running on a developer's laptop. It is dozens of micro agents running on a schedule, each with their own maturity level, each with their own breakpoints, all orchestrated by a harness that handles authentication, tool access, and human-in-the-loop coordination.
The distinction between "hard" and "soft" agents matters here too. A soft agent is a JSON definition you can modify and rerun without recompilation — good for iteration, good for early maturity stages. A hard agent is compiled, locked down, validated — good for production, good for agents that have earned trust. The harness should support both, and the transition between them should be a deliberate act, not an accident.
The harness is where users change agents
One underappreciated requirement: end users — not just developers — should be able to change agent behavior. If a daily planner agent is not working the way someone wants, they should be able to chat with it and request changes. The harness routes that feedback to the right place, whether that means updating context files the agent reads, modifying the agent definition, or firing up a coding agent to make a deeper change.
This is how you keep humans valuable. Even if the agent gets it right 99 times out of 100, the one time a human provides redirected feedback and the system records and reacts to it — that is what justifies the human in the loop. The harness must make that feedback path frictionless.
I've argued elsewhere that controllability is not optional for enterprise teams. That requirement is enforced inside the harness, not the prompt. The prompt cannot be inspected, versioned, audited, or rolled back across sessions in any meaningful way. The harness can. So the harness is where governance lives, where memory lives, where the integrations with your real systems live.
Token tracking and cost governance
One harness requirement that surfaces quickly in enterprise deployments: token usage visibility. When an organization runs agents across multiple models — frontier models for code generation, open-source models for enrichment — the harness must track consumption per agent, per model, per task type. Without this, teams cannot make informed decisions about model selection or justify costs to leadership.
This is especially acute when the harness supports bring-your-own-keys alongside platform credits. The harness needs to know which key was used, how many tokens were consumed, and whether the cost profile matches what the team expected. An unlimited plan that shows zero usage data is not transparency — it is a governance gap the harness should close.
Enterprise teams are already hitting cost caps — $2,000 per developer per month in some cases — and the highest performers are the ones bumping against them. The harness should surface who is hitting limits and why, so organizations can make informed decisions about raising caps rather than treating cost governance as a blunt tollgate. When enterprise agreements mean the tracked cost is not even real spend, the harness needs to distinguish between nominal and actual cost to avoid misleading dashboards.
If you are evaluating agent products, look at the har
Sources
Related Essays
Controllability Is Not Optional. Enterprise Teams Do Not Want Magic
Enterprise teams do not want magic agents. They want control over which submodules load, which tools run, and what the agent remembers — because they have been burned by black boxes before.
Agents Are Software, and Software Needs a Factory
People talk about agent harnesses as if the harness is the interesting part. It is not. The interesting part is the factory — sandboxing, orchestration, persistence, model translation.
The Framework Trap
LangChain pivoted to LangSmith. E2B sells the sandbox. The agent harness is not the product — it is the thing you give away. Monetization lives in the infrastructure underneath.
Key takeaways
- Structured LLM workflows broke under real customer needs. Teams are rewriting them as fully agentic systems with adaptive tool use.
- Autonomy requires a much more sophisticated harness — orchestration, multiplayer coordination, policy enforcement, observability.
- The prompt is not where the differentiation lives. The harness is the product, and the harness is where the context engineering burden concentrates.
- Agents should start supervised and earn autonomy. Breakpoints, human review steps, and maturity levels belong in the harness from day one.
- Micro agents — small, deterministic, low-resource — are a compelling deployment model when the harness handles scheduling, auth, and human-in-the-loop coordination.
- Multi-source-control orchestration, skill routing, and model selection per task type are harness problems that surface immediately in real enterprise deployments.
- Cloud-based dev environments that persist agent sessions solve the parallel-agent resource problem and create shareable, auditable workspaces — another harness responsibility.
- Developer productivity measurement is a harness problem. If agent sessions flow through orchestration infrastructure, usage data, session logs, and contribution audit trails come for free.
FAQ
Why are teams rewriting structured workflows as agentic systems?
Because deterministic step graphs were too rigid for what customers actually wanted. The rewrite gives the agent room to choose tools, decide on follow-up actions, and adapt to context — at the cost of needing a much more rigorous harness around it.
What does a sophisticated harness include?
Tool call management, persistence across sessions, multiplayer coordination between agents and humans, policy enforcement, and end-to-end observability. In practice it also includes source control integration across multiple providers, skill routing based on language or domain, model selection per task type, identity management for bot-authored contributions, cloud-based dev environments for parallel agent sessions, and developer activity dashboards built from session logs. None of that lives in the prompt. All of it is engineering work.
Why does human-in-the-loop matter for agent maturity?
New agents are unreliable. They need breakpoints — places where a human reviews, approves, edits, or redirects before the agent continues. Over time you remove breakpoints as confidence grows. The harness must support this graduated autonomy natively.
What are micro agents and why do they matter?
Micro agents are small, deterministic, low-resource agent processes — potentially running in 20 MB of RAM instead of multi-gigabyte VMs. They execute fast, chain together, and are cheap to schedule. The harness orchestrates them; the individual agent stays minimal.
How does model selection fit into the harness?
Different agent tasks have different cost-performance profiles. Ticket enrichment can run on lightweight open-source models; PR review and code generation benefit from frontier models. The harness should let you assign models per agent or task type, track token usage across all of them, and make it easy to experiment and swap.
How do cloud-based agent sessions change the developer workflow?
When agent sessions run in cloud-based dev environments instead of locally, developers can run multiple agents in parallel without exhausting local resources. Each session gets its own full compute environment — database, app server, editor access — and produces a shareable URL. A teammate can resume where the agent left off using their preferred CLI tool, and the harness logs the entire multi-user interaction history. This also solves onboarding friction: no local env setup, no "it works on my machine" problems.
Can the harness help measure developer productivity with agents?
If agent work flows through a harness, the harness already has the data: session duration, number of interactions, commits per session, cost per task, which tools were used. Surfacing this as dashboards gives teams visibility into how agents are being used without relying on gameable proxy metrics. The harness makes measurement a byproduct of orchestration rather than a separate instrumentation effort.