In-House Coding Agents: Build vs Buy | Ry Walker Research

Key takeaways

Stripe Minions ship 1000+ merged PRs per week — fully unattended, human-reviewed but no human-written code
Shopify open-sourced 'Roast' — a Ruby DSL for structured AI workflows; their philosophy: non-determinism is the enemy of reliability
Coinbase hit 5% of all merged PRs from in-house agents — Slack-native, built by two engineers, started as 'Claudebot'
Block open-sourced Goose, which Stripe forked for Minions — proving the value of shared infrastructure
Uber saved 21,000 developer hours with LangGraph-powered Validator and Autocover agents — IDE-embedded, hybrid LLM + deterministic
Abnormal AI hit 13% of merged PRs — highest reported percentage, serving devs, security analysts, and MLEs
StrongDM goes further: no human code, no human review — code treated as opaque weights validated by behavior

FAQ

Why do companies build their own coding agents instead of buying?

Unique codebases, proprietary frameworks, compliance requirements, and the need for tight integration with internal tools. Vendor lock-in concerns also drive build decisions.

What percentage of PRs come from coding agents at top companies?

Abnormal AI reports 13%, Coinbase 5%, and Stripe ships 1,000+ merged PRs per week. These are the highest publicly reported figures as of early 2026.

What's the ROI threshold for building in-house coding agents?

Generally requires 1,000+ engineers, dedicated devex teams, and codebases exceeding 10M LOC. Smaller teams should buy off-the-shelf tools like Tembo, Codex, or Claude Code.

What architecture do in-house coding agents share?

Slack invocation, isolated sandbox execution, CI/CD integration, and PR-ready output. Human review policies vary from required (Stripe) to eliminated (StrongDM).

How do companies validate code written by AI agents?

Two approaches: traditional CI + human review (conservative) or behavioral validation against 'digital twin' test environments without human review (radical).

What is the emerging metric for AI-native engineering teams?

Percentage of PRs merged from background agents. Leaders publicly report 5-13%, suggesting this will become a standard engineering productivity metric.

Executive Summary

A new pattern is emerging at elite engineering organizations: instead of adopting off-the-shelf coding agents (Codex, Claude Code, Cursor), companies like Stripe, Spotify, Shopify, Coinbase, Uber, and Block are building custom in-house systems tailored to their unique codebases and workflows. Google, Meta, and Amazon are suspected to have similar internal systems, though details remain undisclosed.

Key Findings:

Stripe Minions produce 1,000+ merged PRs per week with zero human-written code
Spotify reports best developers haven't written code since December — shipped 50+ features via internal "Honk" system powered by Claude Code ^[1]
Shopify open-sourced "Roast" — a Ruby DSL for structured AI workflows with the insight: "non-determinism is the enemy of reliability"
Coinbase hit 5% of all merged PRs from in-house agents — Slack-native framework built by two engineers ^[2]
Block open-sourced Goose, which Stripe forked for Minions — proving shared agent infrastructure has value
Abnormal AI hit 13% of merged PRs from background agents — highest reported percentage ^[3]
StrongDM has eliminated human code review entirely, treating code as opaque weights validated by behavior
Common architecture: Slack invocation → isolated sandbox → CI loop → PR-ready output

Strategic Planning Assumptions:

By 2027, 30% of enterprises with 1000+ engineers will operate internal coding agent systems
By 2028, "Digital Twin" testing infrastructure (cloned third-party APIs) will be standard for agent validation
Human code review will become optional at organizations with mature validation infrastructure

Market Definition

In-house coding agents are custom-built systems that enable AI agents to write, test, and ship code within a company's specific codebase and toolchain — as opposed to off-the-shelf products like Codex, Claude Code, or Cursor.

Inclusion Criteria:

Publicly documented implementation details
Production deployment (not experimental)
Custom-built for company-specific constraints
Agent writes code end-to-end (not just autocomplete)

Exclusion Criteria:

Commercial products available for purchase
Internal tools without public documentation
Autocomplete/copilot-style assistance only

Comparison Matrix

Vendor	Human Review	Agent Foundation	Invocation	Validation Approach	Documented Scale
Stripe Minions	Required	Goose fork	Slack, CLI, Web	CI + MCP tools	1,000+ PRs/week
StrongDM Factory	Eliminated	Cursor YOLO	Spec-driven	Digital Twin Universe	"$1K/day/engineer"
Ramp Inspect	Required	Unknown	Slack	Sandbox + CI	Not disclosed
Bitrise	Required	Custom (Go)	Internal eval	Programmatic checkpoints	Not disclosed
Shopify Roast	Required	Workflow DSL	CLI	Cog-based validation	Not disclosed
Coinbase	Required	"Claudebot"	Slack	Not disclosed	5% of merged PRs
Abnormal AI	Not disclosed	Custom	Not disclosed	Not disclosed	13% of merged PRs
Uber	Not disclosed	LangGraph	IDE, Workflow	Hybrid (LLM + deterministic)	21,000 hours saved
Spotify Honk	Not disclosed	Claude Code	Slack	Not disclosed	50+ features shipped

Company Implementations

Stripe Minions

The most detailed public case study of in-house coding agents.^[4]

Overview

Stripe's "Minions" are fully unattended coding agents that produce over 1,000 merged pull requests per week. While humans review the code, they write none of it — Minions handle everything from start to finish. Built on a fork of Block's Goose agent^[5], Minions integrate deeply with Stripe's existing developer infrastructure.

Strengths

Proven scale: 1,000+ PRs merged weekly, validated in production
Deep integration: MCP server ("Toolshed") with 400+ internal tools
Fast iteration: Isolated devboxes spin up in 10 seconds
Leverage existing infra: Uses same environments as human engineers
Parallel execution: Engineers run multiple Minions simultaneously

Cautions

Requires massive codebase investment: Stripe has hundreds of millions of LOC and years of devex tooling
Human review bottleneck: Agents don't merge; humans must still review
Ruby/Sorbet specific: Some patterns may not transfer to other stacks
Dedicated team required: "Leverage team" maintains the system

Architecture Details

Invocation: Slack (primary), CLI, web, internal tool integrations

Execution: Pre-warmed "devboxes" isolated from production and internet

Context: Central MCP server with docs, tickets, build status, Sourcegraph search

CI Loop: Local lint (under 5 seconds) → at most 2 CI rounds → auto-apply fixes

Key Stats

Metric	Value
PRs merged/week	1,000+
Devbox spin-up	10 seconds
MCP tools	400+
CI rounds (max)	2

StrongDM Software Factory

The most radical approach: no human code, no human review.^[6]

Overview

StrongDM's "Software Factory" takes in-house coding agents to their logical extreme. While Stripe still requires human code review, StrongDM has eliminated it entirely:

Code must not be written by humans

Code must not be reviewed by humans

Founded in July 2025 by Justin McCarthy (co-founder/CTO), Jay Taylor, and Navan Chauhan, the team started after observing that Claude 3.5's October 2024 revision enabled "compounding correctness" in long-horizon agentic workflows.

Strengths

No review bottleneck: Code ships without human inspection
Infinite testing scale: Digital Twin Universe enables volume testing
ML-inspired validation: Scenarios act as holdout sets, preventing reward hacking
Third-party API coverage: Behavioral clones of Okta, Jira, Slack, Google Docs
Clear success metric: "$1,000/day in tokens per engineer"

Cautions

Requires validation investment: DTU took significant engineering effort
Domain-specific fit: Works well for integration-heavy software; unclear for other domains
Opaque code: Teams must accept not reading/understanding generated code
Novel approach: Less battle-tested than human-reviewed workflows
Small team documented: 3-person AI team; scalability unclear

Architecture Details

Philosophy: Code treated as opaque weights — correctness inferred from behavior, not inspection

Validation Loop:

Seed — Initial spec (PRD, sentences, screenshot, existing code)
Validation — End-to-end harness against DTU and scenarios
Feedback — Output samples fed back for self-correction

Digital Twin Universe (DTU):

Behavioral clones of third-party services
Test at volumes exceeding production limits
No rate limits, API costs, or abuse detection

Scenarios vs Tests:

Stored outside codebase (like ML holdout sets)
LLM-validated, not boolean pass/fail
"Satisfaction" metric: fraction of trajectories that satisfy user

Key Stats

Metric	Value
Human code	0%
Human review	0%
Target token spend	$1,000/day/engineer
DTU services	6 (Okta, Jira, Slack, Docs, Drive, Sheets)

Bitrise

Vendor lock-in concerns drove custom agent development.^[7]

Overview

Bitrise, the mobile CI/CD platform, built their own AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. While Claude Code performed best, its closed-source nature and Anthropic-only API lock-in posed unacceptable long-term risks. Their solution: a custom Go-based agent using Anthropic APIs but with full ownership.

Strengths

No vendor lock-in: Provider-agnostic architecture allows model switching mid-conversation
Programmatic checkpoints: Verification embedded directly in agent workflow (not bolted on)
Custom eval framework: Go-based benchmark system runs tests in parallel across agents
Multi-agent coordination: Sub-agents dynamically constructed and orchestrated in Go
Central logging: LLM messages stored in provider-agnostic format

Cautions

Maintenance overhead: Custom agent requires ongoing development investment
Anthropic-dependent: Still uses Anthropic APIs despite architectural flexibility
Scale undisclosed: No public metrics on throughput or adoption
Mobile CI focus: Agent optimized for their specific domain (build failures, PR reviews)

Architecture Details

Language: Go (matching Bitrise's core stack)

Eval Framework:

Declarative test case definition
Docker containers for isolated execution
Parallel agent execution (~10 min runs)
LLM Judges for subjective evaluation
Results to SQL database + Metabase dashboard

Agent Design:

Multiple sub-agents with injected tools/dependencies
Coordinated flow with programmatic result collection
Custom system prompts per use case
Provider-agnostic message storage

Benchmarking Findings

Agent	Verdict
Claude Code	Best performance, but closed-source + Anthropic-only
Codex	Fast but lost chain-of-thought, mid-transition issues
Gemini	10-min response times without reserved resources
OpenCode	2x slower than Claude Code, TUI-coupled

Shopify Roast

Structured workflows for AI coding agents — open-sourced as Ruby gem.^[8]

Overview

Shopify's "Roast" is a Ruby-based DSL for creating structured AI workflows. Rather than building a coding agent from scratch, Roast provides orchestration primitives ("cogs") that chain AI steps together, including a dedicated agent cog that runs local coding agents like Claude Code with filesystem access. Open-sourced in 2025 and now at v1.0 preview.

Strengths

Open source: Ruby gem (roast-ai) freely available
Multi-agent orchestration: Chain LLM calls, coding agents, shell commands, custom Ruby
Serial and parallel execution: Map operations with configurable parallelism
Composable: Modular scopes with parameters enable reuse
Claude Code integration: agent cog runs Claude Code CLI with full filesystem access

Cautions

Ruby-specific: Ecosystem limited to Ruby shops
Requires Claude Code: agent cog depends on external CLI installation
Workflow complexity: DSL learning curve for non-trivial pipelines
v1.0 transition: Breaking changes from v0.x YAML syntax

Architecture Details

Core Cogs:

chat — Cloud LLM calls (OpenAI, Anthropic, Gemini)
agent — Local coding agents with filesystem access
ruby — Custom Ruby code execution
cmd — Shell commands with output capture
map — Collection processing (serial/parallel)
repeat — Iteration until conditions met
call — Reusable workflow invocation

Example Workflow:

execute do
  cmd(:recent_changes) { "git diff --name-only HEAD~5..HEAD" }
  agent(:review) do
    files = cmd!(:recent_changes).lines
    "Review these files: #{files.join("\n")}"
  end
  chat(:summary) { "Summarize: #{agent!(:review).response}" }
end

Philosophy: Shopify's insight is that non-determinism is the enemy of reliability for AI agents. Roast provides structure to keep agents on track rather than letting them run unconstrained.

Ramp Inspect

Background coding agent for async task completion.^[9]

Overview

Ramp built an internal tool called "Inspect" that runs coding agents in the background. Less public detail than Stripe or StrongDM, but the approach has been validated by an open-source reimplementation.^[10]

Strengths

Background execution: Agents work while developers focus elsewhere
Browser in sandbox: Agents can run browser automation within isolated environments
Multi-repo support: Can work across multiple repositories in a single session
Proven pattern: Inspired open-source clone with 500+ GitHub stars
Familiar UX: Slack-based invocation meets developers where they are
PR-ready output: Delivers completed pull requests for review

Loading tweet...

Cautions

Limited public detail: Architecture largely inferred from open-source clone
Human review required: Still has review bottleneck
Single-tenant design: Trust boundaries needed for multi-tenant deployment

Open-Source Validation

Cole Murray's "Background Agents" project implements Ramp's architecture:

Control plane: Cloudflare Workers + Durable Objects
Data plane: Modal cloud sandboxes
Agent runtime: OpenCode
Features: Multiplayer sessions, commit attribution

Coinbase

5% of merged PRs now from in-house background agents.

Overview

Coinbase's Engineering VP Chintan Turakhia announced in February 2026 that the company had reached a significant milestone: 5% of all merged PRs now come from in-house background agents. Built the previous year by two engineers, the system uses a Slack-native framework with the same tools and context as human engineers.

Strengths

Proven adoption: 5% of all merged PRs — significant production impact
Familiar UX: Slack-native, tag agent in any thread to invoke
Lightweight build: Two engineers built the initial version
Full workflow: Agent plans, debugs, and ships PR end-to-end

Cautions

Limited public detail: Architecture and validation approach not disclosed
Scale context unclear: 5% impact varies depending on Coinbase's PR volume
Evolution unknown: Original "Claudebot" name suggests Anthropic dependency

Architecture Details

Invocation: Slack (tag agent in any thread)

Context: Same tools and context as human engineers

Output: PR-ready (plans, debugs, ships)

Origins: Started as "Claudebot" — two-engineer project that scaled to production

Abnormal AI

13% of PRs now from background agents — highest publicly reported percentage. ^[3]

Overview

Abnormal AI (cybersecurity company) reported in February 2026 that 13% of their PRs now come from in-house background agents — the highest percentage publicly disclosed. The system serves full-stack engineers, security analysts, and MLEs, with the team actively migrating from a GitHub Actions-powered backend to Modal for execution.

Strengths

Highest reported adoption: 13% of PRs (vs. Coinbase 5%, Stripe volume-based)
Broad use cases: Full-stack features, security/infra patches, agent tooling
Self-improving: The agent/dev tools build themselves
Multi-persona: Serves engineers, security analysts, and MLEs

Cautions

Limited technical detail: Architecture not fully disclosed (blog post WIP)
Infrastructure in flux: Currently migrating from GHA to Modal

Uber

Saved 21,000 developer hours with LangGraph-powered AI agents. ^[11]

Overview

Uber's Developer Platform Team presented at LangChain's Interrupt event detailing how they've deployed agentic tools across an engineering organization supporting 5,000 developers and a codebase with hundreds of millions of lines. Using LangGraph for orchestration, they've built reusable, domain-specific agents for testing, validation, and workflow assistance.

Key Tools

Validator: IDE-embedded agent that flags security vulnerabilities and best-practice violations in real time. Proposes fixes that can be accepted with one click or routed to an agentic assistant for context-aware resolution. Uses a hybrid architecture: LLM for complex issues, deterministic tools (static linters) for common patterns.

Autocover: Generative test-authoring tool that scaffolds, generates, executes, and mutates test cases. Can execute up to 100 tests concurrently, boosting throughput 2-3x faster than other AI coding tools. Increased test coverage by 10%, saving an estimated 21,000 developer hours.

Picasso: Workflow platform with conversational AI agents integrated with organizational knowledge.

Strengths

Massive scale: 5,000 developers, hundreds of millions of LOC
Proven ROI: 21,000 developer hours saved, 10% test coverage increase
Hybrid architecture: LLM + deterministic tools for best of both worlds
Reusable primitives: Security team can contribute rules without deep LangGraph knowledge
Domain expertise: Specialized agents outperform generic AI coding tools

Cautions

Infrastructure investment: Requires dedicated platform team to maintain
LangGraph dependency: Tightly coupled to LangChain ecosystem
Enterprise context: Patterns may not transfer to smaller teams

Key Lessons (from Uber's team)

Encapsulation enables reuse — Clear interfaces let teams extend without central coordination
Domain expert agents outperform generic tools — Specialized context beats general-purpose AI
Determinism still matters — Linters and build tools work better deterministically, orchestrated by agents
Solve narrow problems first — Tightly scoped solutions get reused in broader workflows

Spotify Honk

Best developers haven't written a line of code since December — AI-first development via internal "Honk" system. ^[1]

Overview

During Spotify's Q4 2025 earnings call (February 2026), co-CEO Gustav Söderström revealed that the company's best developers "have not written a single line of code since December" thanks to an internal AI coding system called Honk. The system, powered by Claude Code, enables remote real-time code deployment via Slack — engineers can fix bugs and ship features from their phones during their morning commute.

Key Results

50+ features shipped throughout 2025 using AI-assisted development
Record earnings with stock jumping 14.7% on the announcement
Recent AI-powered launches: Prompted Playlists, Page Match (audiobooks), About This Song

How Honk Works

"An engineer at Spotify on their morning commute from Slack on their cell phone can tell Claude to fix a bug or add a new feature to the iOS app. And once Claude finishes that work, the engineer then gets a new version of the app, pushed to them on Slack on their phone, so that he can then merge it to production, all before they even arrive at the office."

— Gustav Söderström, Spotify co-CEO

Strengths

Mobile-first workflow: Ship from Slack on your phone, review and merge before arriving at office
Claude Code integration: Built on proven agentic coding infrastructure
Quantified impact: 50+ features, direct contribution to record earnings
Leadership buy-in: Announced publicly by co-CEO during earnings call

Cautions

Limited technical details: Architecture and tooling not publicly documented (unlike Stripe, Shopify)
"Best developers" qualifier: May not represent all engineering workflows
Consumer vs. enterprise: Spotify's codebase is primarily consumer-facing, different constraints than B2B

Key Lessons

Slack is the universal agent interface — Matches pattern at Stripe, Coinbase, Ramp
Executives publicly quantifying AI impact — "50+ features" and record earnings tied to AI adoption
Mobile-native development workflows — Engineers reviewing/merging code from their phones
"Best developers" redefining what developers do — Senior engineers as AI orchestrators, not code writers

Other Suspected Implementations

Company	Evidence	Likely Approach	Confidence
Google	Massive internal tooling, AI research leadership	Integrated with Piper monorepo, internal LLMs	High
Meta	Code Llama, internal AI infra	Code Llama fine-tuned on internal patterns	High
Amazon	Q Developer (external), internal ML platform	Internal agents likely predate Q Developer	Medium
Netflix	Heavy automation culture	Integrated with deployment platform	Low

Note: Coinbase and Uber have been moved to confirmed company implementations (see sections above).

Common Architecture Patterns

Invocation Surface

Pattern	Adoption	Notes
Slack	Universal	Primary interface for all documented systems
CLI	Common	Secondary for power users
Web UI	Common	Visibility and management
Internal tools	Stripe only	Deep integration (ticketing, docs, feature flags)

Execution Environment

Isolated sandboxes — Pre-warmed for fast spin-up (Stripe: 10 seconds)
Same as human dev environment — Reduces agent-specific edge cases
Network isolation — No production access, no internet (security)
Parallelization — Multiple agents without git worktree conflicts

Validation Spectrum

Approach	Human Review	Test Strategy	Example
Conservative	Required	Traditional CI + lint	Stripe, Ramp
Radical	Eliminated	Behavioral validation + DTU	StrongDM

Strategic Recommendations

For Engineering Leaders

Build when:

Codebase exceeds 10M LOC with proprietary frameworks
Existing devex team can be redirected (3+ engineers)
Compliance requires full control over code flow
Organization has 1,000+ engineers (ROI threshold)

Buy when:

Standard tech stack (Python, TypeScript, common frameworks)
Team under 100 engineers
Need agents in weeks, not quarters
Enterprise integrations required (Jira, signed commits, BYOK)

Wait when:

Unclear ROI or codebase fit
No dedicated devex resources
Vendor market still maturing

For Devex Teams

Start with invocation surface — Slack integration provides immediate value
Invest in sandboxing — Pre-warmed environments are universal pattern
Consider MCP adoption — Emerging standard for tool integration
Evaluate validation requirements — Decide on human review early

For Vendors

Reference architecture exists — Elite companies have defined the pattern
Middle market opportunity — Stripe-like capabilities without Stripe-level investment
Validation innovation — Digital Twin approach may become differentiator

Market Outlook

Near-Term (2026-2027)

More companies will publish in-house implementations
Open-source clones will mature (Background Agents, etc.)
Vendor solutions will close gap with in-house systems

Medium-Term (2027-2028)

Digital Twin testing will become standard practice
Human code review will become optional at mature orgs
"Token spend per engineer" will emerge as productivity metric

Long-Term (2028+)

In-house vs. vendor distinction may blur
Validation infrastructure becomes the moat, not the agent
"Grown software" philosophy spreads beyond early adopters

Bottom Line

A spectrum of approaches is now documented:

Stripe Minions: 1,000+ merged PRs per week with no human-written code, but human review required. The pattern — Slack invocation, isolated sandboxes, MCP context, CI integration, PR output — is well-documented and replicable.

Bitrise: Built custom Go agent to avoid vendor lock-in with Claude Code. Key insight: programmatic checkpoints embedded in agent workflow beat bolted-on validation. Shows that even mid-size companies can build if the fit is right.

Shopify Roast: Rather than a full agent, open-sourced orchestration primitives. Philosophy: non-determinism is the enemy of reliability — structure keeps agents on track. Useful for companies wanting custom workflows without building agents from scratch.

StrongDM Software Factory: The radical end — no human code, no human review. Code treated as opaque weights validated purely by behavior. Digital Twin Universe enables testing at scale against cloned third-party services.

All require investment, but the threshold varies. Shopify's approach (DSL + existing agents) is lighter than Bitrise's (full custom agent), which is lighter than Stripe's (agents + massive devex infra). Most companies should still buy, not build.

The interesting middle ground: platforms like Tembo that provide Stripe-like orchestration without Stripe-level investment.

Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.

Sources