Context Engineering Is the Hard Problem

Key takeaways

Agent infrastructure stabilizes after the first week. Context engineering never stops requiring work.
Most coding agent failures are not model failures. They are environment failures — missing submodules, no LSP, no running app, no map of the codebase.
The harness — orchestration, memory, tool integrations, controllability — is the product. The prompt is cheap.
Workflow-scoped learning is tractable. General-purpose agent memory is still an unsolved research problem.
Per-user agent instances beat shared agents. People work differently, and a single configuration is the worst possible consultant.
The vendors that win will look more like Palantir than like SaaS — forward-deployed engineers doing context engineering alongside the customer.

FAQ

What is context engineering?

Context engineering is the work of supplying an AI agent with the structural, navigational, and operational information it needs to act correctly inside a real codebase or organization. It covers everything from agents.md files and LSP integration to feedback loops that turn every failed PR into institutional knowledge.

Why do coding agents fail on enterprise codebases?

Most failures are environment failures, not model failures. The agent runs in a bare container without submodules, language servers, or a running app, so it guesses confidently with incomplete information. The result is plausible-looking PRs that are obviously wrong to anyone who knows the codebase.

Why does the Palantir-style forward-deployed model fit AI agents?

Every customer's context is unique and changes weekly, so a generic self-serve agent ships shelfware. Forward-deployed engineers embed with the customer to do the context engineering, train the agent, and lock in policies. That hands-on loop is currently the only model that reliably delivers value for enterprise agent deployments.

There is a conversation happening inside every engineering organization deploying AI agents. It goes like this: the infrastructure was easier than we expected, and keeping the agent useful is a full-time job.

That sentence is the whole post. The plumbing works. Dev containers, CI/CD hooks, sandboxes, Slack integrations, MCP servers — all of it stabilizes relatively quickly. You wire it up, it rarely breaks, and you move on. The context that makes an agent actually effective never stops requiring work. Nobody is talking about this honestly, because "context engineering" does not demo as well as "autonomous AI software engineer."

I have spent the last few months talking to teams running real agents — startups, mid-stage companies, and engineering orgs at Block and Stripe. The pattern is identical. Models keep improving. Harnesses get more capable. The bottleneck is the same one it was a year ago: the agent does not understand the codebase, does not understand the organization, and cannot be trusted to act without a human re-supplying context every time.

The Demo Worked. The Codebase Did Not Cooperate.

A scene playing out at every engineering org operationalizing agents. A developer creates a ticket. The agent picks it up. The task is straightforward — swap one field for another in an existing type. The agent produces a PR. The PR is wrong in a way no human on the team would ever get wrong.

The agent created a brand new type, bolted the field onto it, and completely ignored the fact that the correct type already lived in a submodule two directories over. It did not check. It did not know to check. It had no idea the submodule was there.

This is not a model intelligence problem. This is a context problem. It is the single biggest reason enterprise agent deployments stall after the first week of excitement.

When your developers work locally, they have the full monorepo cloned, submodules included. They have a language server catching type errors as they type. They have a watcher streaming linter output. They have years of institutional knowledge about where things live. When an agent picks up a ticket, it gets a repo clone in a container. Maybe the submodules came along. Maybe they did not. It has no LSP. It has no watcher. It has no idea your types live in a shared submodule six other projects depend on. So it does what any reasonable actor does with incomplete information: it guesses. Confidently, because that is what language models do.

The result is a PR that looks plausible to someone who has never seen the codebase, and obviously wrong to anyone who has. That gap between demo and deployment is made entirely of context.

Context Has Three Layers, and Most Agents Have Zero

The coding-agent conversation has been dominated by model capabilities. Which model writes better code? Which CLI is faster? Real questions, wrong bottleneck.

The bottleneck is context, and it splits into three layers your developers take for granted.

Structural context. Does the agent know what your repo looks like? Not just the top-level directory, but the submodules, shared libraries, type definitions three levels deep. A well-maintained agents.md — a table of contents for your codebase — is worth more than a model upgrade. So is a repo-level config telling the agent which submodules to clone for which kinds of tasks.

Navigational context. When a developer encounters an unfamiliar type, they jump to definition. Agents, by default, use grep. Grep finds strings. LSP finds meaning. The difference between an agent that greps for "paymentId" and one that resolves the actual type definition is the difference between an intern who reads the code and an intern who writes fan fiction about the code.

Operational context. Your developers run the app while they code. They see TypeScript errors in real time. They watch tests fail and fix them before committing. An agent in a bare container with no app server, no watcher, and no test runner is coding blind. It is writing code it has never executed. Then it submits the PR and you are the test runner.

None of these are model problems. All of them are environment problems — software engineering applied to the agent's own development setup. Unsexy work. Load-bearing work.

The Submodule Problem Is the Whole Problem in Miniature

Monorepos with submodules are a specific pain point, but they illustrate a universal truth: enterprise codebases are not simple. They have shared dependencies, cross-repo type systems, internal packages, custom build tooling, and a hundred undocumented conventions living in the team's collective memory. An agent that cannot handle submodules cannot handle enterprise software.

The fix is not to simplify your codebase for the agent. The fix is to build agent infrastructure that respects the actual complexity of how your team works. Clone submodules by default. Provide LSP and language-aware tooling inside the agent's environment. Run the application in the agent's workspace so it can see its own errors. Give teams explicit, declarative control over what context the agent loads, per repo, per task type.

Not glamorous. Not a new model architecture or a clever prompting technique. The work that separates agents that ship code from agents that ship demos.

The Maintenance Burden Splits in Two

Once you get past the codebase environment problem, a second problem shows up. The maintenance burden of running agents is not one thing. It splits into two categories with opposite cost curves.

Infrastructure maintenance decreases over time. Where the agent runs, how it connects to your existing surfaces, how it gets triggered — one-and-done. You wire up the execution environment, connect it to your issue tracker, and it holds.

Context maintenance never stops. Every team I have spoken with describes the same loop: review agent output, update instruction files, tune the deterministic checks, adjust what the agent fetches and when. One engineer described his entire workflow as pure context engineering — creating docs, building verification steps, ensuring the agent pulled the right check at the right time. Another described it as a continuous feedback loop of reviewing PRs and updating configuration files. Forever.

This is why context engineering is hard to productize. Infrastructure is generic; you ship it once. Context is specific to your codebase, your team, your domain, your customers — and it changes every week. Every organization's context is unique, which means every customer needs a different version of the product.

If you are evaluating agent platforms, the question is not whether the infrastructure works. It does. The question is who owns the context, and how does it get better over time.

The Harness Is Becoming Everything

The prompt is no longer the interesting part. The harness — orchestration, tool integrations, persistence, the rules governing agent behavior — is where all the complexity lives now.

Teams building agentic systems describe the same evolution. The first version was a structured LLM workflow: deterministic steps, clear inputs, clear outputs. It worked, but it was too rigid for what customers needed. The rewrite is fully agentic — the agent chooses tools, decides on follow-up actions, adapts to context. That autonomy requires a much more sophisticated harness to keep it from going off the rails.

Same dynamic in coding agents. The prompt is cheap. The harness — tool call management, persistence across sessions, multiplayer coordination, policy enforcement, observability — is the product. And the harness is where the context engineering burden concentrates.

Memory Is Unsolved, and That Is Not a Product Opportunity Yet

The natural response to the context maintenance problem: make the agent learn. When you correct it, it should remember. Every correction you type as a code review comment is information that could have been loaded into context from the start.

Obviously the right vision. Also, today, an unsolved research problem.

The closest thing shipping is Claude Code's memory feature, an append-only file the agent self-updates. Other teams have tried journals, learning logs, self-updating instruction files. All hit the same wall: autoregressive transformers degrade as context documents grow. Compaction is lossy. The relationship between what an agent remembers and what it needs for a specific task is extraordinarily hard to model.

At the scale of an entire codebase, this is a genuine research problem. A question as simple as "what is the on-call schedule?" becomes nearly impossible to answer reliably when schedules are scattered across multiple team repos and the answer depends on who is asking and when.

Narrow to a single workflow, though, and it gets tractable. An agent that learns to run one guardrail check better over time is much more solvable than an agent that learns an entire codebase. The real opportunity is not general-purpose memory, but workflow-scoped learning that compounds. Pick the constraint, and the memory problem stops being intractable.

Each Person Needs Their Own Agent

Shared agents with a single configuration are fundamentally broken. The analogy is precise — a shared agent is the worst possible consultant, one who treats every client the same, does not listen, does not read the room, and is perfectly predictable in the worst way.

Each user gets their own agent instance. Not a shared bot with a single prompt, but a personal copy that learns through interactions with its user. A salesperson who wants aggressive double-tap follow-ups trains their agent to do that. A colleague who prefers a slower cadence trains theirs differently. The agent starts as an infant or a fresh college grad — book-smart but untrained on how this particular human works.

Seven out of ten people might coach their agent to get better. Three might coach it to get worse. The math shakes out — but only if there is an organizational layer watching the patterns. A manager or oversight agent should see discrepancies across instances, identify what consistently works, and lock in policies individuals cannot override. Individuals get autonomy within bounds. The organization gets convergence on what matters.

The hard part, again, is the learning mechanism. When a user spars with their agent, that conversation has to compact into something actionable. The agent cannot just append every interaction to an ever-growing context file. It needs to extract the behavioral change and persist it. Same compaction problem as before, applied to a human-AI working relationship.

Controllability Is Not Optional

There is a temptation in agent tooling to make everything magical. The agent figures out what it needs. The agent discovers the codebase. The agent learns as it goes.

Enterprise teams do not want magic. They want control.

They want to specify which submodules get loaded for which task types. They want to define the skills and tools available. They want to see what context the agent used and override it when the agent is wrong. They want the agent's long-term memory — whatever form it takes — to be something they can inspect, edit, and version control.

Not because enterprise teams are conservative. Because they have been burned by black boxes before, and they know any tool they cannot control will eventually produce a mess they have to clean up.

The pattern that works: context in, background execution, reviewable output, human approval. The human sets up the context. The agent does the work. The human reviews. The context improves over time, because every correction becomes institutional knowledge the agent carries forward.

The Build-Versus-Buy Tension Is Real

The uncomfortable truth for anyone building agent platforms: an engineer can ship a working agent harness in a week and a half. One CTO I spoke with tried to buy a solution. His engineer said "code is cheap, I'll just build it." He had a working system in days.

The barrier to building is genuinely lower than it has ever been, and engineers building agent tooling are, in some cases, building the thing that justifies their own role's evolution.

So what does a paid product offer that a week of engineering time does not? Two things.

First, scaling. An engineer can build a single-workflow agent in a week. When that team wants ten agents across ten workflows, the context maintenance, orchestration, and observability overhead compounds fast. The solo build breaks down at scale.

Second, the non-builder persona. Not everyone who needs an agent can build one. The CTO who vibe-coded a CRM wants an agent that qualifies sales leads, plugged into Slack and email and deal context. That is a product problem, not an engineering problem — and one the build-it-yourself crowd cannot solve for their non-technical colleagues.

The Forward-Deployed Model Is the Only One That Actually Works

The Palantir model — embedding engineers inside customer organizations to make the product work — keeps coming up. It makes sense. If every agent needs to be personalized, if context engineering is the hard problem, and if the learning loop requires hands-on iteration, then the vendor who shows up and does the work alongside the customer has a structural advantage.

The pitch is not "here is our agent, configure it yourself." It is "we will embed with your team, give you an infant agent, and help you raise it." The forward-deployed engineer is implementer and trainer — showing the customer how to modify the agent, how to monitor it, when to lock in a policy versus when to let individuals experiment.

Expensive. It requires a price point that justifies the human cost, or subsidized services in the early phase to build the product knowledge that eventually makes a self-serve version work. For the initial wave of enterprise agent deployments, it may be the only model that actually delivers value. The alternative — shipping a generic agent and hoping the customer figures out context engineering on their own — is how you get shelfware.

The Map Is Your Job

Every quarter, a new model writes marginally better code on benchmarks. And every quarter, enterprise teams are stuck on the same problems: the agent does not know where the types live, it cannot run the test suite, it ignores the conventions in the README, and nobody knows which agent told a customer the next event was in London.

The teams getting actual value are not the ones with the best model. They invested in context engineering. They built the agents.md files. They configured the LSP. They set up the VM environments so the agent can run the app. They created the feedback loops so every failed PR makes the next one better. They picked workflows narrow enough that learning compounds. They gave each user their own instance.

The hard part is not the AI. The hard part is the engineering around the AI — the infrastructure, the context, the controllability, the integration with the systems your team already uses. Agents are software. Like all software, they are only as good as the environment they run in.

The codebase is the territory. The agent needs a map. Your job is to draw it.

— Ry

Sources