Why does it matter whether review is a feature or a primitive?

A feature ships once and breaks the next time you expand to a new use case. A primitive composes — you add a new artifact type, register the verification surface, and your existing review infrastructure handles it.

What artifact types should the primitive support?

Code diffs, documents, structured records, spreadsheets, visual changes, and infrastructure changes at minimum. Each has a different verification surface. The contract should be artifact-agnostic.

How should review maturity work for new agents?

New agents — baby agents — should default to full human supervision at every step. Think of it like breakpoints in code. As trust is earned through repeated successful reviews, you remove checkpoints until the agent can run on cruise control. The user should be able to set the maturity level explicitly.

Why is a full environment better than a diff for reviewing agent work?

A diff shows you what changed. A full environment shows you what the change actually does. You can click around the app, run the terminal, verify behavior that no static diff could surface. This is especially important when non-developers — product managers, designers — need to review agent output. They know how to click around an app. They do not know how to read a diff.

How does review infrastructure connect to developer productivity measurement?

When agent sessions flow through a review primitive, every interaction gets logged — prompts, lines generated, cost, touches, commits, and outcomes. This audit trail is exactly the data enterprises need to measure developer output. The review layer becomes the measurement layer. Without it, companies end up with incomplete dashboards that show cost but not value.

Review Is Not a Screen. It Is a Primitive

Here is where most teams go wrong. They think about review as a product surface — a page in the UI where someone looks at a diff and clicks approve. That is the dessert. The vegetables are understanding what review actually requires at the infrastructure level.

An agent working on code produces diffs. An agent writing a marketing brief produces a document. An agent updating a spreadsheet produces numerical output. An agent modifying a website produces a visual change. These are all review artifacts, and they are fundamentally different in how they need to be presented, verified, and approved.

Build a review system that only handles code diffs and you have built a feature, not a primitive. Build a review system that accepts an artifact type as input and returns the appropriate verification surface and you have built something composable. Something that scales across use cases without requiring a new product for each one.

This is the API-level thinking that matters. What is the request? What is the response? What is the contract between the agent that produces work and the human who verifies it? Get that right and the UI becomes a composition exercise. Get it wrong and you rebuild from scratch every time you expand to a new use case.

The Diff Is Not Enough

The most interesting evolution in review infrastructure is the realization that a diff is the wrong artifact for most reviewers. A diff tells you what lines changed. It does not tell you what the change actually does.

The better primitive is a full environment. After an agent completes a coding task, instead of handing the reviewer a pull request and a log, you hand them a snapshotted computer — a cloud-based dev environment with the PR already applied, the app running, the terminal accessible. The reviewer can click around the application, verify behavior, run commands, inspect anything they want. It is the difference between reading a recipe and tasting the food.

This is not hypothetical. The infrastructure for this exists now. When an agent finishes a task, it produces a PR with a link to a shareable computer — a full VM spun up alongside the session with a desktop view, VS Code, terminal access, the whole application running. The reviewer does not pull a branch down to their local machine. They click a link and get a copy of the computer that was used to make the PR. They can resume it in seconds, fire up whatever CLI tool they prefer — even if the original work was done in a different one — and verify the actual behavior. Every interaction gets logged into an audit trail of the entire multi-user interaction with that computer, that PR, that task.

This matters enormously for multiplayer review. A developer reviews code by reading diffs. A product manager reviews code by clicking around the app. A QA engineer reviews code by running edge cases. All three of those people need to verify the same agent's output, but they need fundamentally different verification surfaces. A snapshotted environment serves all of them. A diff only serves one.

The product manager case is particularly important. Most PMs have a machine that cannot run the full product locally. They know how to click around an app. They do not know how to read a diff. When every PR from an agent comes with a shareable cloud computer, the PM gets access to dev without the pain of maintaining a dev environment. Review stops being gated on technical capability.

There is a practical infrastructure reason this matters too. Enterprise codebases are large. Some require 96 or 128 gigabytes of RAM just to run the full product locally. Most developer laptops cannot do that. When every PR from an agent comes with a suspended cloud computer provisioned with the right resources, the right environment variables, the right dependencies — review stops being gated on local machine capability. The environment is always correct because the whole team shares it rather than each person maintaining their own local setup.

This also solves a scaling problem that hits as soon as developers run multiple agent sessions in parallel. The highest performers are already running four concurrent coding sessions using git worktrees. Run four agents at once on a local machine and you run out of memory. Move those sessions to cloud-based computers and the constraint disappears. Each session gets its own isolated environment with the resources it needs, and each one produces a reviewable, shareable artifact when it completes.

This is where the primitive abstraction pays off. The review contract says: given an artifact type, return the appropriate verification surface. For code changes, the appropriate verification surface is not a diff view. It is a running environment where the reviewer can interact with the actual result.

The Identity Problem in Review

There is a subtle but important problem that surfaces when agents post reviews on behalf of humans. When an automated reviewer leaves a comment on a merge request, whose name appears on that comment matters. If the agent posts as the user who configured it rather than as a distinct bot identity, the review surface becomes misleading. Engineers see their VP's name on every code review comment and think they are being micromanaged. The signal — this was automated analysis — gets lost in the noise of misattributed identity.

This is a review primitive problem, not a cosmetic one. The verification surface needs to clearly distinguish between human-authored review and agent-authored review. When a human approves a PR, that carries the weight of a human decision. When an agent flags unused imports, that carries the weight of automated analysis. Collapsing those two into the same identity undermines both.

The fix is straightforward at the infrastructure level — dedicated bot accounts, service tokens, clear attribution. But it reveals something deeper about the review contract. The primitive does not just need to know the artifact type and the verification surface. It needs to know the reviewer identity and whether that identity is human or agent. That metadata determines how much weight the review carries and how it should be presented to the person making the final decision.

When CI and Review Agents Overlap

A related problem emerges when you have both continuous integration pipelines and review agents running against the same pull request. The CI pipeline runs tests, catches failures, reports them. The review agent analyzes the code, catches the same failures, reports them again. The human reviewer now sees the same problem flagged twice through two different channels — once through the CI system and once through the agent's review comment.

This is a coordination problem that the review primitive needs to solve. If the CI pipeline already caught a test failure and reported it, the review agent should either know about that result and skip redundant analysis, or the review surface should deduplicate the findings. Running the same check twice and presenting both results is not thoroughness. It is noise.

The better architecture is sequential — let CI run first, feed its results to the review agent as context, and let the agent focus on the things CI cannot catch. CI tells you the tests failed. The review agent tells you why the approach is wrong. Those are complementary signals, not redundant ones. But only if the review primitive is designed to compose them rather than run them in parallel and dump both outputs on the reviewer.

New Agents Deserve Breakpoints

There is a maturity dimension to review that almost nobody talks about. A brand new agent — a baby agent that just got born — requires full supervision. In the same way a debugger has breakpoints where execution stops for inspection, new agents deserve breakpoints at every step of their workflow.

The default deployment should require the human to check everything the agent does manually. Not because the agent is stupid, but because trust is earned, not assumed. You should not try to deploy a buttoned-up agent on day zero because it is not buttoned up yet. It starts broken.

This means the review primitive needs to understand maturity. A step that has been approved successfully fifty times in a row is a candidate for removing the breakpoint. A step that was just added yesterday gets full supervision. The user should be able to set this explicitly — lock a step to always require review, or mark it as trusted and let it run.

The practical version of this looks like a lead engineer triaging incoming tickets and making a binary decision: is this something I can hand entirely to an agent, or does it need human coding? The tickets that go to the agent still produce PRs that require review. But the review burden is different — you are reviewing output, not guiding implementation. Over time, as the agent earns trust on particular types of changes, even the review step can be relaxed. But you start with full breakpoints and earn your way to cruise control.

Eventually you can let them go on full cruise control. But the path to autonomy runs through review, not around it.

Review as a Feedback Channel

Review is not just approve or reject. The most interesting thing that happens during review is when the human wants to change the agent itself. You are reviewing a draft tweet and the agent used one of those LLM-isms — "the thing nobody is talking about" — and you do not just want to reject this tweet. You want to tell the agent to stop saying that forever.

This means the review surface needs to be a feedback channel back to the agent's definition. Not just thumbs up or thumbs down, but the ability to prompt the agent from the same interface where you review its work. The user should be able to say "stop doing this" and have the system actually change the agent's behavior, not just regenerate the current output.

There are two kinds of updates that flow from review feedback. The first is updating the agent's context — adding new instructions, examples, or constraints that change how it operates within its current architecture. The second is updating the agent's code — when the feedback requires structural changes that the agent cannot handle through context alone. Both should be possible from the review interface without the user having to understand which kind of change is needed.

A good example is when a review agent approves a PR that contains an obvious bug introduced by a different coding agent. The review surface should make it trivial to not just reject the PR but to feed that failure back into both agents — the one that wrote the buggy code and the one that failed to catch it. The review primitive becomes the correction mechanism for the entire agent pipeline, not just a gate on individual outputs.

The teams that build review as a one-way approval gate will find their agents plateau. The teams that build review as a bidirectional feedback loop will find their agents actually improve over time. It is the same pattern that shows up in supervising hundreds of agents — the review surface determines how much oversight scales.

Review Data Is the Measurement Layer

There is a problem that every enterprise is hitting right now and almost none of them have solved: how do you measure developer productivity in an agentic world? Companies are telling their engineering organizations to 4x output, but they do not actually have a way to measure the output. Individual engineering managers are scrambling to figure out how to quantify what 4x even means. The tools do not exist yet.

The dashboards that do exist are primitive. They track number of prompts, lines generated, tab completions, and cost — aggregated across whatever tools the developer happens to use. But different tools report different metrics. A developer using only a CLI tool shows up as zero prompts, zero lines, zero completions, and two thousand dollars in cost. The dashboard makes them look inactive when they might be the highest performer on the team.

This is where the review primitive becomes the measurement primitive. When all agent work flows through a review layer, every session gets logged. You know how many tasks were completed, how many commits were made, how many review cycles each PR required before approval, how much compute was consumed, and — critically — whether the merged code caused problems downstream. Lines of code and PR counts are gameable. Outcomes flowing through a review pipeline are much harder to fake.

The controversial version of this is a leaderboard — how does my AI usage compare to my teammates? The less controversial version is the same data used for self-benchmarking. Either way, the data only exists if the work flows through infrastructure that captures it. If developers are running agents locally with no telemetry, you get nothing. If they are running agents through a review primitive that logs every interaction, you get a full audit trail that answers the questions enterprises are desperately asking.

Every line of code arguably deserves some points. Code that gets merged and stays stable earns more. Code that breaks immediately after merge earns negative points. The review primitive is the natural place to compute this because it already sits at the junction between agent output and human judgment. It knows what was produced, who reviewed it, how many iterations it took, and what happened after it shipped.

The teams that treat review as just an approval gate will keep building separate dashboards to measure productivity. The teams that treat review as infrastructure will realize the measurement data was always there — they just need to surface it.

The Universal Inbox Problem

Once you have multiple agents producing work that requires human review, you have an inbox problem. The agent that drafts tweets needs approval. The agent that triages email needs confirmation. The agent that generates reports needs sign-off. Each of these is a different artifact type flowing into the same human's attention.

If each agent has its own review interface, the human is now managing ten different tools — which is exactly the problem agents were supposed to solve. The review primitive needs to compose into a singular surface where all pending human decisions queue up, regardless of which agent produced them or what artifact type they contain.

This is the abstraction that matters. Not a review screen per agent, but a review layer across all agents. Priority based on context — a message from a high-value prospect should surface above a routine content approval. Enrichment from connected systems so the human has the information they need to make a decision without leaving the review surface.

The real-world version of this is a consolidated view that pulls pending reviews across every source control system, every project management tool, every agent — and presents them in one place. When an engineering leader has repos in GitLab, Bitbucket, and GitHub simultaneously, the review primitive cannot be scoped to a single source control provider. It needs to compose across all of them, showing merge requests from GitLab next to pull requests from GitHub next to Bitbucket PRs, all in the same queue. The alternative is three different review surfaces for the same human, which defeats the purpose.

Review Determines What Ships

Review is the bottleneck. Not model quality. Not prompt engineering. Not integrations. The thing that separates an impressive demo from a production deployment is whether a human can look at what the agent produced, understand it, and approve it with confidence. And almost nobody is building review as a first-class primitive.

If you are building agent infrastructure, treat review as a layer in your architecture, not a screen in your app. Define the artifact types you support. Define the verification contract for each. Build in maturity tracking so new agents start supervised and earn autonomy. Make the review surface a feedback channel that improves the agent, not just a gate that approves its output. Compose them into product surfaces later. The teams that get review right will be the ones whose agents make it past the demo.

Sources

Review Is Not a Screen. It Is a Primitive

The Diff Is Not Enough

The Identity Problem in Review

When CI and Review Agents Overlap

New Agents Deserve Breakpoints

Review as a Feedback Channel

Review Data Is the Measurement Layer

The Universal Inbox Problem

Review Determines What Ships

Related Essays

Human Review Is Not a Limitation

Taste Does Not Scale With Token Throughput

Controllability Is Not Optional. Enterprise Teams Do Not Want Magic