OpenAI Harness Engineering | Ry Walker Research

Key takeaways

~1M lines of code shipped with 0 manually written over 5 months
3.5 PRs per engineer per day, scaling to 3-10 engineer-equivalents per person
Agent-to-agent code review eliminates human review bottleneck
The practice spun out Symphony, open-sourced by OpenAI and formalized as a spec in April 2026 (~25.2K GitHub stars by June 2026)

FAQ

What is OpenAI Harness Engineering?

OpenAI's internal engineering team that built a product with ~1M lines of code and zero manually written code. Codex agents run autonomously for 6+ hours per task, and agent-to-agent code review replaces human review.

How does Harness manage agent context?

Harness uses AGENTS.md as a table of contents pointing to a structured docs/ directory — not an encyclopedia file. This gives agents navigable, structured knowledge bases instead of monolithic context dumps.

What is the team's throughput?

Starting with 3 engineers growing to 7, the team merged ~1,500 PRs at 3.5 PRs per engineer per day. Each engineer operates at 3-10x capacity through autonomous agent delegation.

Executive Summary

OpenAI's Harness Engineering team represents the most extreme publicly documented case of agent-driven development. Over 5 months starting August 2025, a team of 3 engineers (growing to 7) shipped approximately 1 million lines of code with zero manually written code and ~1,500 merged PRs. Codex agents run autonomously for 6+ hours per task, and agent-to-agent code review eliminates the human review bottleneck entirely.

The practice has since produced a public artifact: OpenAI open-sourced Symphony, the issue-tracker-driven agent orchestrator that grew out of this work, and formalized it as a spec in April 2026 (~25.2K GitHub stars by June 2026). This profile covers the internal practice; see the Symphony profile for the open-source project.

Attribute	Value
Company	OpenAI
Type	Internal methodology
Agent Runtime	Codex
Public Documentation	February 2026 (blog post)
Open-Source Spinoff	Symphony (spec formalized April 2026)
Headquarters	San Francisco, CA

Product Overview

Harness is not a product — it's a methodology and team at OpenAI that treats "no manually-written code" as a core philosophy. Engineers act as orchestrators, delegating all implementation to Codex agents. The key innovation is not just using agents for coding, but building an entire engineering practice around the assumption that humans never write code directly.

Key Capabilities

Capability	Description
Zero manual code	All code written by Codex agents, no exceptions
6+ hour autonomy	Agents run for extended periods without human intervention
Agent-to-agent review	Code review performed by agents, not humans
Structured knowledge base	AGENTS.md as table of contents, docs/ directory as encyclopedia
UI verification	Chrome DevTools Protocol wired into agent runtime
Observability access	LogQL and PromQL exposed directly to agents

Technical Architecture

Context Management

The Harness team's key insight on context management: AGENTS.md should be a table of contents, not an encyclopedia. Rather than stuffing all project knowledge into a single file, they maintain a structured docs/ directory with AGENTS.md serving as a navigable map.

AGENTS.md (table of contents)
    ↓
docs/
  ├── architecture.md
  ├── conventions.md
  ├── api-reference.md
  └── ...

This pattern gives agents structured, discoverable knowledge without overwhelming their context windows.

Agent-to-Agent Code Review

Harness eliminates the human review bottleneck by having agents review each other's code. This is more radical than StrongDM's approach (which eliminates review entirely via behavioral validation) — Harness maintains the code review practice but removes humans from it.

UI Verification

The Chrome DevTools Protocol is wired directly into the agent runtime, allowing Codex agents to:

Render and inspect UI components
Verify visual correctness
Test interactive behavior
Debug rendering issues

Observability

Agents have direct access to production observability tools:

LogQL — Query logs in real time
PromQL — Query metrics and alerting data

This means agents can not only write code but verify its behavior in production.

Results

Key Metrics

All figures are as disclosed in OpenAI's February 2026 blog post; OpenAI has not published updated numbers for this team since.

Metric	Value
Lines of code	~1,000,000
Manually written code	0
Time period	5 months (Aug 2025 start)
PRs merged	~1,500
Starting team size	3 engineers
Final team size	7 engineers
PRs per engineer per day	3.5
Engineer multiplier	3-10x per person
Agent autonomy per task	6+ hours

Throughput Analysis

At 3.5 PRs per engineer per day with 7 engineers, the team sustains approximately 24.5 PRs per day — or roughly 500 PRs per month. Each engineer effectively operates at 3-10x capacity, meaning the 7-person team produces output equivalent to a 21-70 person team.

The practice is not isolated to one team: OpenAI has separately reported that nearly all of its engineers use Codex internally, merging roughly 70% more pull requests per week (as of late 2025).

Since Publication

Martin Fowler's site published an analysis (updated April 2026) that adds detail on how the team keeps a million agent-written lines coherent: a layered architecture enforced by custom linters and structural tests, plus recurring "garbage collection" passes that scan for drift and have agents suggest fixes. It quotes the team's conclusion: "Our most difficult challenges now center on designing environments, feedback loops, and control systems."

In April 2026, OpenAI open-sourced and formalized Symphony — the orchestration layer that grew out of this practice — as a spec, reporting that some internal teams saw a 500% increase in landed PRs in early use. As of June 2026 the repo has ~25.2K GitHub stars.

Key Insights

1. AGENTS.md as Table of Contents

The most transferable insight: structure your knowledge base as a navigable directory, not a monolithic file. AGENTS.md points to relevant docs, and agents can drill into what they need.

2. No Manual Code as Philosophy

This isn't "use agents when convenient" — it's "never write code manually, period." This constraint forces the team to invest in agent infrastructure, context management, and workflow design.

3. Agent-to-Agent Review Works

By removing humans from code review, the team eliminates what is typically the biggest bottleneck in agent-assisted development. The quality bar is maintained through agent review rather than human inspection.

4. Extended Autonomy is Viable

6+ hour autonomous agent sessions demonstrate that modern agents can handle complex, multi-step tasks without human intervention. This is significantly longer than most reported agent session lengths.

Strengths

Unprecedented scale — ~1M LOC with zero manual code is the most extreme case documented
Proven throughput — 3.5 PRs/engineer/day sustained over months
Full autonomy — 6+ hour sessions, agent-to-agent review, no human bottlenecks
Transferable insights — AGENTS.md pattern, docs/ structure, observability access are universally applicable
Officially documented — Published by OpenAI with analysis by Martin Fowler
Dogfooding — OpenAI eating their own cooking with Codex validates the product

Cautions

OpenAI advantage — Team has privileged access to Codex capabilities and can directly influence product direction
New product context — Building greenfield is easier for agents than modifying legacy code
Small team — 3-7 engineers may not represent patterns that scale to larger organizations
Codex-specific — Architecture and workflow designed around Codex's specific capabilities
Survivorship bias — We see the successful project, not the failed attempts or rejected approaches

Competitive Positioning

vs. Other In-House Agents

System	Comparison
Stripe Minions	Minions require human review; Harness uses agent-to-agent review
StrongDM Factory	StrongDM eliminates review; Harness replaces human review with agent review
Ramp Inspect	Inspect augments human engineers; Harness replaces manual coding entirely

Unique Position

Harness represents the most aggressive position on the "agent autonomy spectrum":

Conservative: Agents write code, humans review (Stripe, Ramp)
Moderate: Agents write code, behavioral validation replaces review (StrongDM)
Radical: Agents write code, agents review code, humans orchestrate (Harness)

What Developers Say

The Hacker News thread on the harness engineering post (296 points, 206 comments) was substantive and split — practitioners validated the harness concept while pushing back on the headline metrics:

"I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load" — stult

"Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before." — krackers

"Yeah I cannot see how 'we shipped 1 million lines of code in three weeks' is... something to be proud of haha" — Aperocky

The recurring skeptical theme: lines of code and PR counts are input metrics, not evidence of product quality — and the post describes a greenfield beta, not a battle-tested production system.

Bottom Line

OpenAI's Harness Engineering is a proof-of-concept for fully agent-driven software development. The "no manual code" philosophy, agent-to-agent review, and 6+ hour autonomy sessions represent the frontier of what's possible today.

Key metrics: ~1M LOC, 0 manual, ~1,500 PRs, 3.5 PRs/engineer/day, 3-10x multiplier.

Architecture pattern: AGENTS.md as TOC → structured docs/ → Codex for all implementation → agent-to-agent review → Chrome DevTools for UI verification → LogQL/PromQL for observability.

Recommended study for: Engineering leaders interested in the upper bound of agent-driven development. The AGENTS.md-as-TOC pattern is immediately applicable regardless of scale.

Not recommended for: Teams expecting to replicate this without OpenAI-level access to frontier models and infrastructure.

Outlook: The practice is already escaping OpenAI's walls — Symphony, the orchestrator that grew from this work, was open-sourced and formalized as a spec in April 2026 (~25.2K stars by June 2026). If Harness-style development becomes viable outside OpenAI, the economics of software engineering change fundamentally. The constraint is model capability — as frontier models improve, this approach becomes more accessible.

Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.

Sources