← Back to research
·26 min read·industry

In-House Coding Agents

Analysis of 12 in-house coding agents: how Stripe, Google, Meta, OpenAI, Uber, Spotify, Shopify, Coinbase, and other tech companies are building custom agents instead of buying off-the-shelf.

Key takeaways

  • Stripe Minions ship 1000+ merged PRs per week — fully unattended, human-reviewed but no human-written code
  • Shopify open-sourced 'Roast' — a Ruby DSL for structured AI workflows; their philosophy: non-determinism is the enemy of reliability
  • Coinbase's "Cloudbot" hit 5% of merged PRs and cut PR cycle time from 150h to 15h — multi-model, Slack-native, with Skills and MCPs
  • Block open-sourced Goose, which Stripe forked for Minions — proving the value of shared infrastructure
  • Uber saved 21,000 developer hours with LangGraph-powered Validator and Autocover agents — IDE-embedded, hybrid LLM + deterministic
  • Abnormal AI hit 13% of merged PRs — highest reported percentage, serving devs, security analysts, and MLEs
  • Google Agent Smith so popular internally that access had to be restricted — reportedly >25% of new production code
  • Meta REA delivered 2x model accuracy and 5x engineering output — 3 engineers covering 8 models via autonomous multi-week ML workflows
  • OpenAI Harness shipped ~1M lines with 0 manually written code — agent-to-agent review eliminates human bottleneck, 3.5 PRs/engineer/day
  • StrongDM goes further: no human code, no human review — code treated as opaque weights validated by behavior

FAQ

Why do companies build their own coding agents instead of buying?

Unique codebases, proprietary frameworks, compliance requirements, and the need for tight integration with internal tools. Vendor lock-in concerns also drive build decisions.

What percentage of PRs come from coding agents at top companies?

Ramp Inspect now handles >50% of merged PRs, Abnormal AI reports 13%, Coinbase 5%, and Stripe ships 1,000+ merged PRs per week. OpenAI Harness shipped ~1M LOC with zero manually written code.

What's the ROI threshold for building in-house coding agents?

Generally requires 1,000+ engineers, dedicated devex teams, and codebases exceeding 10M LOC. Smaller teams should buy off-the-shelf tools like Tembo, Codex, or Claude Code.

What architecture do in-house coding agents share?

Slack invocation, isolated sandbox execution, CI/CD integration, and PR-ready output. Human review policies vary from required (Stripe) to eliminated (StrongDM).

How do companies validate code written by AI agents?

Two approaches: traditional CI + human review (conservative) or behavioral validation against 'digital twin' test environments without human review (radical).

What is the emerging metric for AI-native engineering teams?

Percentage of PRs merged from background agents. Leaders now report 13-50%+, with OpenAI's Harness team reaching 100% agent-written code — suggesting this will become a standard engineering productivity metric.

Executive Summary

A new pattern is emerging at elite engineering organizations: instead of adopting off-the-shelf coding agents (Codex, Claude Code, Cursor), companies like Stripe, Google, Meta, OpenAI, Spotify, Shopify, Coinbase, Uber, and Block are building custom in-house systems tailored to their unique codebases and workflows. With 12 confirmed implementations now documented, this is no longer experimental — it's an industry shift.

Key Findings:

  • Stripe Minions produce 1,000+ merged PRs per week with zero human-written code
  • Google Agent Smith so popular internally that access had to be restricted — reportedly >25% of new production code [1]
  • Meta REA delivered 2x model accuracy, 5x engineering output — 3 engineers covering 8 models autonomously [2]
  • OpenAI Harness shipped ~1M LOC with zero manually written code — agent-to-agent review, 3.5 PRs/engineer/day [3]
  • Ramp Inspect now handles >50% of merged PRs (up from 30%), with 80% of Inspect itself written by Inspect [4]
  • Spotify reports best developers haven't written code since December — shipped 50+ features via internal "Honk" system powered by Claude Code [5]
  • Shopify open-sourced "Roast" — a Ruby DSL for structured AI workflows with the insight: "non-determinism is the enemy of reliability"
  • Coinbase hit 5% of all merged PRs and 10x PR cycle time reduction with "Cloudbot" — multi-model, Slack-native, with Skills and MCPs [6][7]
  • Block open-sourced Goose, which Stripe forked for Minions — proving shared agent infrastructure has value
  • Abnormal AI hit 13% of merged PRs from background agents [8]
  • StrongDM has eliminated human code review entirely, treating code as opaque weights validated by behavior
  • Common architecture: Slack invocation → isolated sandbox → CI loop → PR-ready output

Strategic Planning Assumptions:

  • By 2027, 30% of enterprises with 1000+ engineers will operate internal coding agent systems
  • By 2028, "Digital Twin" testing infrastructure (cloned third-party APIs) will be standard for agent validation
  • Human code review will become optional at organizations with mature validation infrastructure

Market Definition

In-house coding agents are custom-built systems that enable AI agents to write, test, and ship code within a company's specific codebase and toolchain — as opposed to off-the-shelf products like Codex, Claude Code, or Cursor.

Inclusion Criteria:

  • Publicly documented implementation details
  • Production deployment (not experimental)
  • Custom-built for company-specific constraints
  • Agent writes code end-to-end (not just autocomplete)

Exclusion Criteria:

  • Commercial products available for purchase
  • Internal tools without public documentation
  • Autocomplete/copilot-style assistance only

Comparison Matrix

VendorHuman ReviewAgent FoundationInvocationValidation ApproachDocumented Scale
Stripe MinionsRequiredGoose forkSlack, CLI, WebCI + MCP tools1,000+ PRs/week
Google Agent SmithNot disclosedAntigravityInternal chat, MobileNot disclosed>25% of new code (unverified)
Meta REANot disclosedConfuciusAutonomousThree-phase planning2x accuracy, 5x output
OpenAI HarnessAgent-to-agentCodexOrchestrationDevTools + LogQL/PromQL~1M LOC, 0 manual
StrongDM FactoryEliminatedCursor YOLOSpec-drivenDigital Twin Universe"$1K/day/engineer"
Ramp InspectRequiredOpenCodeSlack, Web, Chrome ext, Voice, MobileSandbox + CI>50% of merged PRs
BitriseRequiredCustom (Go)Internal evalProgrammatic checkpointsNot disclosed
Shopify RoastRequiredWorkflow DSLCLICog-based validationNot disclosed
CoinbaseContext-dependentCloudbot (multi-model)Slack, LinearSkills + MCPs5% of merged PRs, 10x cycle time reduction
Abnormal AINot disclosedCustomNot disclosedNot disclosed13% of merged PRs
UberNot disclosedLangGraphIDE, WorkflowHybrid (LLM + deterministic)21,000 hours saved
Spotify HonkNot disclosedClaude CodeSlackNot disclosed50+ features shipped

Company Implementations

Stripe Minions

The most detailed public case study of in-house coding agents.[9]

Overview

Stripe's "Minions" are fully unattended coding agents that produce over 1,000 merged pull requests per week. While humans review the code, they write none of it — Minions handle everything from start to finish. Built on a fork of Block's Goose agent[10], Minions integrate deeply with Stripe's existing developer infrastructure.

Strengths

  • Proven scale: 1,000+ PRs merged weekly, validated in production
  • Deep integration: MCP server ("Toolshed") with 400+ internal tools
  • Fast iteration: Isolated devboxes spin up in 10 seconds
  • Leverage existing infra: Uses same environments as human engineers
  • Parallel execution: Engineers run multiple Minions simultaneously

Cautions

  • Requires massive codebase investment: Stripe has hundreds of millions of LOC and years of devex tooling
  • Human review bottleneck: Agents don't merge; humans must still review
  • Ruby/Sorbet specific: Some patterns may not transfer to other stacks
  • Dedicated team required: "Leverage team" maintains the system

Architecture Details

Invocation: Slack (primary), CLI, web, internal tool integrations

Execution: Pre-warmed "devboxes" isolated from production and internet

Context: Central MCP server with docs, tickets, build status, Sourcegraph search

CI Loop: Local lint (under 5 seconds) → at most 2 CI rounds → auto-apply fixes

Key Stats

MetricValue
PRs merged/week1,000+
Devbox spin-up10 seconds
MCP tools400+
CI rounds (max)2

StrongDM Software Factory

The most radical approach: no human code, no human review.[11]

Overview

StrongDM's "Software Factory" takes in-house coding agents to their logical extreme. While Stripe still requires human code review, StrongDM has eliminated it entirely:

  • Code must not be written by humans
  • Code must not be reviewed by humans

Founded in July 2025 by Justin McCarthy (co-founder/CTO), Jay Taylor, and Navan Chauhan, the team started after observing that Claude 3.5's October 2024 revision enabled "compounding correctness" in long-horizon agentic workflows.

Strengths

  • No review bottleneck: Code ships without human inspection
  • Infinite testing scale: Digital Twin Universe enables volume testing
  • ML-inspired validation: Scenarios act as holdout sets, preventing reward hacking
  • Third-party API coverage: Behavioral clones of Okta, Jira, Slack, Google Docs
  • Clear success metric: "$1,000/day in tokens per engineer"

Cautions

  • Requires validation investment: DTU took significant engineering effort
  • Domain-specific fit: Works well for integration-heavy software; unclear for other domains
  • Opaque code: Teams must accept not reading/understanding generated code
  • Novel approach: Less battle-tested than human-reviewed workflows
  • Small team documented: 3-person AI team; scalability unclear

Architecture Details

Philosophy: Code treated as opaque weights — correctness inferred from behavior, not inspection

Validation Loop:

  1. Seed — Initial spec (PRD, sentences, screenshot, existing code)
  2. Validation — End-to-end harness against DTU and scenarios
  3. Feedback — Output samples fed back for self-correction

Digital Twin Universe (DTU):

  • Behavioral clones of third-party services
  • Test at volumes exceeding production limits
  • No rate limits, API costs, or abuse detection

Scenarios vs Tests:

  • Stored outside codebase (like ML holdout sets)
  • LLM-validated, not boolean pass/fail
  • "Satisfaction" metric: fraction of trajectories that satisfy user

Key Stats

MetricValue
Human code0%
Human review0%
Target token spend$1,000/day/engineer
DTU services6 (Okta, Jira, Slack, Docs, Drive, Sheets)

Bitrise

Vendor lock-in concerns drove custom agent development.[12]

Overview

Bitrise, the mobile CI/CD platform, built their own AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. While Claude Code performed best, its closed-source nature and Anthropic-only API lock-in posed unacceptable long-term risks. Their solution: a custom Go-based agent using Anthropic APIs but with full ownership.

Strengths

  • No vendor lock-in: Provider-agnostic architecture allows model switching mid-conversation
  • Programmatic checkpoints: Verification embedded directly in agent workflow (not bolted on)
  • Custom eval framework: Go-based benchmark system runs tests in parallel across agents
  • Multi-agent coordination: Sub-agents dynamically constructed and orchestrated in Go
  • Central logging: LLM messages stored in provider-agnostic format

Cautions

  • Maintenance overhead: Custom agent requires ongoing development investment
  • Anthropic-dependent: Still uses Anthropic APIs despite architectural flexibility
  • Scale undisclosed: No public metrics on throughput or adoption
  • Mobile CI focus: Agent optimized for their specific domain (build failures, PR reviews)

Architecture Details

Language: Go (matching Bitrise's core stack)

Eval Framework:

  • Declarative test case definition
  • Docker containers for isolated execution
  • Parallel agent execution (~10 min runs)
  • LLM Judges for subjective evaluation
  • Results to SQL database + Metabase dashboard

Agent Design:

  • Multiple sub-agents with injected tools/dependencies
  • Coordinated flow with programmatic result collection
  • Custom system prompts per use case
  • Provider-agnostic message storage

Benchmarking Findings

AgentVerdict
Claude CodeBest performance, but closed-source + Anthropic-only
CodexFast but lost chain-of-thought, mid-transition issues
Gemini10-min response times without reserved resources
OpenCode2x slower than Claude Code, TUI-coupled

Shopify Roast

Structured workflows for AI coding agents — open-sourced as Ruby gem.[13]

Overview

Shopify's "Roast" is a Ruby-based DSL for creating structured AI workflows. Rather than building a coding agent from scratch, Roast provides orchestration primitives ("cogs") that chain AI steps together, including a dedicated agent cog that runs local coding agents like Claude Code with filesystem access. Open-sourced in 2025 and now at v1.0 preview.

Strengths

  • Open source: Ruby gem (roast-ai) freely available
  • Multi-agent orchestration: Chain LLM calls, coding agents, shell commands, custom Ruby
  • Serial and parallel execution: Map operations with configurable parallelism
  • Composable: Modular scopes with parameters enable reuse
  • Claude Code integration: agent cog runs Claude Code CLI with full filesystem access

Cautions

  • Ruby-specific: Ecosystem limited to Ruby shops
  • Requires Claude Code: agent cog depends on external CLI installation
  • Workflow complexity: DSL learning curve for non-trivial pipelines
  • v1.0 transition: Breaking changes from v0.x YAML syntax

Architecture Details

Core Cogs:

  • chat — Cloud LLM calls (OpenAI, Anthropic, Gemini)
  • agent — Local coding agents with filesystem access
  • ruby — Custom Ruby code execution
  • cmd — Shell commands with output capture
  • map — Collection processing (serial/parallel)
  • repeat — Iteration until conditions met
  • call — Reusable workflow invocation

Example Workflow:

execute do
  cmd(:recent_changes) { "git diff --name-only HEAD~5..HEAD" }
  agent(:review) do
    files = cmd!(:recent_changes).lines
    "Review these files: #{files.join("\n")}"
  end
  chat(:summary) { "Summarize: #{agent!(:review).response}" }
end

Philosophy: Shopify's insight is that non-determinism is the enemy of reliability for AI agents. Roast provides structure to keep agents on track rather than letting them run unconstrained.


Ramp Inspect

Background coding agent for async task completion.[14]

Overview

Ramp built an internal tool called "Inspect" that runs coding agents in the background, now responsible for >50% of merged pull requests (up from 30% in early 2026).[4] Notably, 80% of Inspect itself was written by Inspect — the agent building itself. Ramp published one of the most detailed public architecture specs for an in-house coding agent.[14] The approach has also been validated by an open-source reimplementation.[15]

Architecture: OpenCode serves as the agent runtime, running inside Modal sandboxes. State management uses Cloudflare Durable Objects (each session gets its own SQLite DB) with the Cloudflare Agents SDK providing WebSocket hibernation for streaming. GPT 5.2 handles repo classification for routing tasks.

Key capabilities: Child session spawning (agents spawn sub-agents for research or decomposition), 30-minute image rebuild cadence (sandboxes start from snapshots), warm sandbox preloading (starts spinning up when user begins typing), voice input, and mobile access (resume sessions from phone/couch). PRs are opened on behalf of the user's GitHub token, preventing self-approval.

Strengths

  • >50% of merged PRs: Highest reported adoption rate, achieved through organic engineer uptake (up from 30%)
  • Self-building: 80% of Inspect itself was written by Inspect
  • Background execution: Agents work while developers focus elsewhere
  • Browser in sandbox: Agents can run browser automation within isolated environments
  • Multi-repo support: Can work across multiple repositories in a single session
  • Multi-surface access: Slack, web, Chrome extension, voice input, mobile
  • Voice input: Engineers can talk to sessions
  • PR attribution via GitHub auth: PRs opened on behalf of user token, not app — prevents self-approval
  • Proven pattern: Inspired open-source clone with 500+ GitHub stars
Loading tweet...

Cautions

  • Human review required: Still has review bottleneck
  • Single-tenant design: Trust boundaries needed for multi-tenant deployment

Open-Source Validation

Cole Murray's "Background Agents" project implements Ramp's architecture:

  • Control plane: Cloudflare Workers + Durable Objects
  • Data plane: Modal cloud sandboxes
  • Agent runtime: OpenCode
  • Features: Multiplayer sessions, commit attribution

Coinbase Cloudbot

5% of merged PRs, 10x PR cycle time reduction, multi-model architecture with Skills and MCPs.[6][7]

Overview

Coinbase's Engineering VP Chintan Turakhia built "Cloudbot" — a multi-model coding agent serving 1,000+ engineers. Originally built by two engineers, it now integrates deeply with Coinbase's toolchain via Skills and MCPs (Datadog, Sentry, Amplitude, Snowflake). The name "Cloudbot" is intentional — "it's actually using all sorts of underlying models. It's not something that is specific to Claude."[7] Chintan also created the "Super Builder" role — dedicated hires to drive AI adoption across the org.

Strengths

  • Proven adoption: 5% of all merged PRs — significant production impact
  • 10x cycle time reduction: PR cycle time dropped from ~150 hours to ~15 hours[7]
  • Multi-model: Explicitly model-agnostic, uses "all sorts of underlying models"
  • Skills + MCPs: Connects to Datadog, Sentry, Amplitude, Snowflake databases
  • Multi-repo: Works across multiple codebases
  • Plan mode: Creates plans in Linear tickets before executing code
  • Explain mode: Debugging and investigation capability
  • PR workflow: Creates PR → Cursor deep link to branch → QR code for one-off mobile build
  • Live feedback capture: Audio/video → LLM → bug identification → Linear ticket → PR, fully automated
  • Company-wide speedrun: 800 engineers generated 300-400 PRs in 30 minutes[7]
  • Surge events: Teams ship 3-4x more PR volume during intensive coding sessions
  • Familiar UX: Slack-native, tag agent in any thread to invoke
  • Viral adoption: Slack channels key to driving adoption organically
  • Risk-based review: Evidence suggests low-risk changes ship with minimal review — Chintan pushed a fix to production during a 30-minute user call ("just reload, it's fixed") and fires off 200+ agent PRs overnight. PR cycle time went from ~150h to ~15h average, implying compressed or optional review for low-risk fixes (copy changes, simple bugs) while higher-risk changes still get full review[7]

Cautions

  • Not open-source: Internal tooling only, limited external documentation
  • Crypto-specific patterns: May include domain-specific integrations not transferable

Architecture Details

Invocation: Slack (tag agent in any thread), Linear tickets

Context: Skills + MCPs (Datadog, Sentry, Amplitude, Snowflake), Linear for task context

Modes: Plan (creates plan in Linear before coding), Explain (debugging/investigation)

Output: PR with Cursor deep link + QR code for mobile testing

Multi-repo: Works across multiple codebases

Product context: Team rebuilding Coinbase Wallet into consumer social app ("Base app") using React Native[7]

Tooling: Company uses Cursor extensively; Chintan personally drove adoption starting early 2025

Origins: Started as "Cloudbot" — two-engineer project that scaled to 1,000+ engineers

Key Stats

MetricValue
PR cycle time (before)~150 hours
PR cycle time (after)~15 hours
Improvement10x reduction
Engineers served1,000+
Company-wide speedrun800 engineers, 300-400 PRs in 30 min
Surge event multiplier3-4x PR volume
Merged PR share5%

Abnormal AI

13% of PRs now from background agents — highest publicly reported percentage. [8]

Overview

Abnormal AI (cybersecurity company) reported in February 2026 that 13% of their PRs now come from in-house background agents — the highest percentage publicly disclosed. The system serves full-stack engineers, security analysts, and MLEs, with the team actively migrating from a GitHub Actions-powered backend to Modal for execution.

Strengths

  • Highest reported adoption: 13% of PRs (vs. Coinbase 5%, Stripe volume-based)
  • Broad use cases: Full-stack features, security/infra patches, agent tooling
  • Self-improving: The agent/dev tools build themselves
  • Multi-persona: Serves engineers, security analysts, and MLEs

Cautions

  • Limited technical detail: Architecture not fully disclosed (blog post WIP)
  • Infrastructure in flux: Currently migrating from GHA to Modal

Uber

Saved 21,000 developer hours with LangGraph-powered AI agents. [16]

Overview

Uber's Developer Platform Team presented at LangChain's Interrupt event detailing how they've deployed agentic tools across an engineering organization supporting 5,000 developers and a codebase with hundreds of millions of lines. Using LangGraph for orchestration, they've built reusable, domain-specific agents for testing, validation, and workflow assistance.

Key Tools

Validator: IDE-embedded agent that flags security vulnerabilities and best-practice violations in real time. Proposes fixes that can be accepted with one click or routed to an agentic assistant for context-aware resolution. Uses a hybrid architecture: LLM for complex issues, deterministic tools (static linters) for common patterns.

Autocover: Generative test-authoring tool that scaffolds, generates, executes, and mutates test cases. Can execute up to 100 tests concurrently, boosting throughput 2-3x faster than other AI coding tools. Increased test coverage by 10%, saving an estimated 21,000 developer hours.

Picasso: Workflow platform with conversational AI agents integrated with organizational knowledge.

Strengths

  • Massive scale: 5,000 developers, hundreds of millions of LOC
  • Proven ROI: 21,000 developer hours saved, 10% test coverage increase
  • Hybrid architecture: LLM + deterministic tools for best of both worlds
  • Reusable primitives: Security team can contribute rules without deep LangGraph knowledge
  • Domain expertise: Specialized agents outperform generic AI coding tools

Cautions

  • Infrastructure investment: Requires dedicated platform team to maintain
  • LangGraph dependency: Tightly coupled to LangChain ecosystem
  • Enterprise context: Patterns may not transfer to smaller teams

Key Lessons (from Uber's team)

  1. Encapsulation enables reuse — Clear interfaces let teams extend without central coordination
  2. Domain expert agents outperform generic tools — Specialized context beats general-purpose AI
  3. Determinism still matters — Linters and build tools work better deterministically, orchestrated by agents
  4. Solve narrow problems first — Tightly scoped solutions get reused in broader workflows

Spotify Honk

Best developers haven't written a line of code since December — AI-first development via internal "Honk" system. [5]

Overview

During Spotify's Q4 2025 earnings call (February 2026), co-CEO Gustav Söderström revealed that the company's best developers "have not written a single line of code since December" thanks to an internal AI coding system called Honk. The system, powered by Claude Code, enables remote real-time code deployment via Slack — engineers can fix bugs and ship features from their phones during their morning commute.

Key Results

  • 50+ features shipped throughout 2025 using AI-assisted development
  • Record earnings with stock jumping 14.7% on the announcement
  • Recent AI-powered launches: Prompted Playlists, Page Match (audiobooks), About This Song

How Honk Works

"An engineer at Spotify on their morning commute from Slack on their cell phone can tell Claude to fix a bug or add a new feature to the iOS app. And once Claude finishes that work, the engineer then gets a new version of the app, pushed to them on Slack on their phone, so that he can then merge it to production, all before they even arrive at the office."

— Gustav Söderström, Spotify co-CEO

Strengths

  • Mobile-first workflow: Ship from Slack on your phone, review and merge before arriving at office
  • Claude Code integration: Built on proven agentic coding infrastructure
  • Quantified impact: 50+ features, direct contribution to record earnings
  • Leadership buy-in: Announced publicly by co-CEO during earnings call

Cautions

  • Limited technical details: Architecture and tooling not publicly documented (unlike Stripe, Shopify)
  • "Best developers" qualifier: May not represent all engineering workflows
  • Consumer vs. enterprise: Spotify's codebase is primarily consumer-facing, different constraints than B2B

Key Lessons

  1. Slack is the universal agent interface — Matches pattern at Stripe, Coinbase, Ramp
  2. Executives publicly quantifying AI impact — "50+ features" and record earnings tied to AI adoption
  3. Mobile-native development workflows — Engineers reviewing/merging code from their phones
  4. "Best developers" redefining what developers do — Senior engineers as AI orchestrators, not code writers

Google Agent Smith

Internal agentic coding platform so popular that access had to be restricted.[1]

Overview

Google's Agent Smith is built on the company's existing Antigravity agentic coding platform, alongside related tools like Jetski and Cider. The system works asynchronously — engineers give instructions, the agent works in the background, and they check in from their phones. Agent Smith became so popular internally that Google had to restrict access to manage demand.

At a March 2026 town hall, Sergey Brin told employees that agents are a "big focus" this year. One external analysis claims more than 25% of new code shipped to production is generated by Agent Smith, though this figure is unverified. The Pragmatic Engineer survey (February 2026) noted Google's internal tools are causing a drop in external AI coding tool usage at companies with 10,000+ employees.[17]

Strengths

  • Deep infrastructure integration: Built on existing Antigravity platform with years of internal investment
  • Asynchronous mobile-friendly workflow: Engineers check progress from their phones
  • Rich context: Access to internal chat, employee profiles, and documents
  • Executive sponsorship: Sergey Brin publicly identifying agents as major focus
  • Proven demand: Popularity required access throttling

Cautions

  • No official documentation: Everything known comes from reporting, not Google publications
  • Unverified metrics: The >25% production code claim lacks official confirmation
  • Closed ecosystem: No open-source components or transferable architecture details

Key Stats

MetricValue
Production code share>25% (unverified)
Internal demandAccess restricted due to popularity
Executive sponsorSergey Brin (March 2026)
Related toolsAntigravity, Jetski, Cider

Meta REA

Autonomous ML experimentation: 2x accuracy, 5x engineering output, multi-week workflows.[2]

Overview

Meta's Ranking Engineer Agent (REA) is a specialized autonomous system for ML experimentation on ads ranking models, built on the internal Confucius framework. On its first production rollout, REA delivered 2x model accuracy improvement and 5x engineering output — 3 engineers delivered improvements for 8 models where historically each model required 2 dedicated engineers.

REA uses a hibernate-and-wake mechanism for multi-week autonomous workflows, a dual-source hypothesis engine (historical insights DB + ML research agent), and three-phase planning (Validation → Combination → Exploitation). Meta is also running company-wide "AI Transformation Week" pushing Claude Code and internal tools, with CTO Andrew Bosworth taking over the "AI for Work" initiative.

Strengths

  • Quantified production results: 2x accuracy, 5x output are concrete metrics
  • Multi-week autonomy: Hibernate-and-wake handles ML's long feedback loops
  • Extraordinary leverage: 3 engineers doing work of 16
  • Domain-specialized: Purpose-built for ML experimentation, not generic coding
  • Officially documented: Published on Meta's engineering blog

Cautions

  • Domain-specific: Purpose-built for ML ranking, not general-purpose coding
  • Meta-scale infrastructure required: Massive compute and data dependencies
  • Not open-source: Confucius framework referenced in arXiv but REA is internal

Key Stats

MetricValue
Model accuracy improvement2x
Engineering output multiplier5x
Engineers required3 (down from ~16 equivalent)
Models improved8
Workflow durationMulti-week (hibernate-and-wake)

OpenAI Harness

~1M lines of code, 0 manually written, agent-to-agent review — the most extreme case documented.[3]

Overview

OpenAI's Harness Engineering team shipped approximately 1 million lines of code over 5 months (starting August 2025) with zero manually written code. A team of 3 engineers growing to 7 merged ~1,500 PRs at 3.5 PRs per engineer per day, with each engineer operating at 3-10x capacity. Codex agents run autonomously for 6+ hours per task, and agent-to-agent code review eliminates the human review bottleneck entirely.

The team's key insight on context management: AGENTS.md should be a table of contents, not an encyclopedia — pointing to a structured docs/ directory that agents can navigate. They also wired Chrome DevTools Protocol into the agent runtime for UI verification and exposed LogQL and PromQL directly to agents for observability.

Strengths

  • Most extreme documented scale: ~1M LOC with zero manual code
  • Sustained throughput: 3.5 PRs/engineer/day over months
  • Agent-to-agent review: Eliminates human review bottleneck entirely
  • Extended autonomy: 6+ hour Codex sessions per task
  • Transferable patterns: AGENTS.md-as-TOC, structured docs/, observability access
  • Third-party validation: Martin Fowler published detailed analysis

Cautions

  • OpenAI advantage: Privileged access to Codex capabilities
  • Greenfield project: New product is easier for agents than legacy code
  • Small team: 3-7 engineers may not represent larger org patterns
  • Codex-specific: Workflow designed around Codex's specific capabilities

Key Stats

MetricValue
Lines of code~1,000,000
Manually written code0
PRs merged~1,500
Team size3 → 7 engineers
PRs per engineer per day3.5
Engineer multiplier3-10x
Agent autonomy per task6+ hours

Other Suspected Implementations

CompanyEvidenceLikely ApproachConfidence
AmazonQ Developer (external), internal ML platformInternal agents likely predate Q DeveloperMedium
NetflixHeavy automation cultureIntegrated with deployment platformLow

Note: Google, Meta, Coinbase, and Uber have been moved to confirmed company implementations (see sections above).


Common Architecture Patterns

Invocation Surface

PatternAdoptionNotes
SlackUniversalPrimary interface for all documented systems
CLICommonSecondary for power users
Web UICommonVisibility and management (Stripe, Ramp)
Chrome extensionRamp, StripeVisual editing and in-page invocation
VoiceRampTalk to sessions hands-free
MobileRamp, SpotifyResume/monitor sessions from phone
Linear ticketsCoinbaseAgent triggered from tickets, plan mode
Internal toolsStripe onlyDeep integration (ticketing, docs, feature flags)

Execution Environment

  • Isolated sandboxes — Pre-warmed for fast spin-up (Stripe: 10 seconds)
  • Same as human dev environment — Reduces agent-specific edge cases
  • Network isolation — No production access, no internet (security)
  • Parallelization — Multiple agents without git worktree conflicts

Framework Codification

These patterns are now being codified into reusable frameworks. LangChain's Open SWE is an open-source framework specifically designed for building internal coding agents, validating that the architecture patterns documented here — sandboxed execution, context management, CI integration — are mature enough to standardize.[18]

Validation Spectrum

ApproachHuman ReviewTest StrategyExample
ConservativeRequiredTraditional CI + lintStripe, Ramp
RadicalEliminatedBehavioral validation + DTUStrongDM

Strategic Recommendations

For Engineering Leaders

Build when:

  • Codebase exceeds 10M LOC with proprietary frameworks
  • Existing devex team can be redirected (3+ engineers)
  • Compliance requires full control over code flow
  • Organization has 1,000+ engineers (ROI threshold)

Buy when:

  • Standard tech stack (Python, TypeScript, common frameworks)
  • Team under 100 engineers
  • Need agents in weeks, not quarters
  • Enterprise integrations required (Jira, signed commits, BYOK)

Wait when:

  • Unclear ROI or codebase fit
  • No dedicated devex resources
  • Vendor market still maturing

For Devex Teams

  1. Start with invocation surface — Slack integration provides immediate value
  2. Invest in sandboxing — Pre-warmed environments are universal pattern
  3. Consider MCP adoption — Emerging standard for tool integration
  4. Evaluate validation requirements — Decide on human review early

For Vendors

  • Reference architecture exists — Elite companies have defined the pattern
  • Middle market opportunity — Stripe-like capabilities without Stripe-level investment
  • Validation innovation — Digital Twin approach may become differentiator

Market Outlook

Near-Term (2026-2027)

  • More companies will publish in-house implementations
  • Open-source clones will mature (Background Agents, etc.)
  • Vendor solutions will close gap with in-house systems

Medium-Term (2027-2028)

  • Digital Twin testing will become standard practice
  • Human code review will become optional at mature orgs
  • "Token spend per engineer" will emerge as productivity metric

Long-Term (2028+)

  • In-house vs. vendor distinction may blur
  • Validation infrastructure becomes the moat, not the agent
  • "Grown software" philosophy spreads beyond early adopters

Bottom Line

A spectrum of approaches is now documented:

Stripe Minions: 1,000+ merged PRs per week with no human-written code, but human review required. The pattern — Slack invocation, isolated sandboxes, MCP context, CI integration, PR output — is well-documented and replicable.

Bitrise: Built custom Go agent to avoid vendor lock-in with Claude Code. Key insight: programmatic checkpoints embedded in agent workflow beat bolted-on validation. Shows that even mid-size companies can build if the fit is right.

Shopify Roast: Rather than a full agent, open-sourced orchestration primitives. Philosophy: non-determinism is the enemy of reliability — structure keeps agents on track. Useful for companies wanting custom workflows without building agents from scratch.

StrongDM Software Factory: The radical end — no human code, no human review. Code treated as opaque weights validated purely by behavior. Digital Twin Universe enables testing at scale against cloned third-party services.

All require investment, but the threshold varies. Shopify's approach (DSL + existing agents) is lighter than Bitrise's (full custom agent), which is lighter than Stripe's (agents + massive devex infra). Most companies should still buy, not build.

The interesting middle ground: platforms like Tembo that provide Stripe-like orchestration without Stripe-level investment.


Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.