← Back to research
·18 min read·industry

In-House Coding Agents

How Stripe, Uber, Spotify, Shopify, Coinbase, and other tech companies are building custom in-house coding agents instead of buying off-the-shelf tools.

Key takeaways

  • Stripe Minions ship 1000+ merged PRs per week — fully unattended, human-reviewed but no human-written code
  • Shopify open-sourced 'Roast' — a Ruby DSL for structured AI workflows; their philosophy: non-determinism is the enemy of reliability
  • Coinbase hit 5% of all merged PRs from in-house agents — Slack-native, built by two engineers, started as 'Claudebot'
  • Block open-sourced Goose, which Stripe forked for Minions — proving the value of shared infrastructure
  • Uber saved 21,000 developer hours with LangGraph-powered Validator and Autocover agents — IDE-embedded, hybrid LLM + deterministic
  • Abnormal AI hit 13% of merged PRs — highest reported percentage, serving devs, security analysts, and MLEs
  • StrongDM goes further: no human code, no human review — code treated as opaque weights validated by behavior

FAQ

Why do companies build their own coding agents instead of buying?

Unique codebases, proprietary frameworks, compliance requirements, and the need for tight integration with internal tools. Vendor lock-in concerns also drive build decisions.

What percentage of PRs come from coding agents at top companies?

Abnormal AI reports 13%, Coinbase 5%, and Stripe ships 1,000+ merged PRs per week. These are the highest publicly reported figures as of early 2026.

What's the ROI threshold for building in-house coding agents?

Generally requires 1,000+ engineers, dedicated devex teams, and codebases exceeding 10M LOC. Smaller teams should buy off-the-shelf tools like Tembo, Codex, or Claude Code.

What architecture do in-house coding agents share?

Slack invocation, isolated sandbox execution, CI/CD integration, and PR-ready output. Human review policies vary from required (Stripe) to eliminated (StrongDM).

How do companies validate code written by AI agents?

Two approaches: traditional CI + human review (conservative) or behavioral validation against 'digital twin' test environments without human review (radical).

What is the emerging metric for AI-native engineering teams?

Percentage of PRs merged from background agents. Leaders publicly report 5-13%, suggesting this will become a standard engineering productivity metric.

Executive Summary

A new pattern is emerging at elite engineering organizations: instead of adopting off-the-shelf coding agents (Codex, Claude Code, Cursor), companies like Stripe, Spotify, Shopify, Coinbase, Uber, and Block are building custom in-house systems tailored to their unique codebases and workflows. Google, Meta, and Amazon are suspected to have similar internal systems, though details remain undisclosed.

Key Findings:

  • Stripe Minions produce 1,000+ merged PRs per week with zero human-written code
  • Spotify reports best developers haven't written code since December — shipped 50+ features via internal "Honk" system powered by Claude Code [1]
  • Shopify open-sourced "Roast" — a Ruby DSL for structured AI workflows with the insight: "non-determinism is the enemy of reliability"
  • Coinbase hit 5% of all merged PRs from in-house agents — Slack-native framework built by two engineers [2]
  • Block open-sourced Goose, which Stripe forked for Minions — proving shared agent infrastructure has value
  • Abnormal AI hit 13% of merged PRs from background agents — highest reported percentage [3]
  • StrongDM has eliminated human code review entirely, treating code as opaque weights validated by behavior
  • Common architecture: Slack invocation → isolated sandbox → CI loop → PR-ready output

Strategic Planning Assumptions:

  • By 2027, 30% of enterprises with 1000+ engineers will operate internal coding agent systems
  • By 2028, "Digital Twin" testing infrastructure (cloned third-party APIs) will be standard for agent validation
  • Human code review will become optional at organizations with mature validation infrastructure

Market Definition

In-house coding agents are custom-built systems that enable AI agents to write, test, and ship code within a company's specific codebase and toolchain — as opposed to off-the-shelf products like Codex, Claude Code, or Cursor.

Inclusion Criteria:

  • Publicly documented implementation details
  • Production deployment (not experimental)
  • Custom-built for company-specific constraints
  • Agent writes code end-to-end (not just autocomplete)

Exclusion Criteria:

  • Commercial products available for purchase
  • Internal tools without public documentation
  • Autocomplete/copilot-style assistance only

Comparison Matrix

VendorHuman ReviewAgent FoundationInvocationValidation ApproachDocumented Scale
Stripe MinionsRequiredGoose forkSlack, CLI, WebCI + MCP tools1,000+ PRs/week
StrongDM FactoryEliminatedCursor YOLOSpec-drivenDigital Twin Universe"$1K/day/engineer"
Ramp InspectRequiredUnknownSlackSandbox + CINot disclosed
BitriseRequiredCustom (Go)Internal evalProgrammatic checkpointsNot disclosed
Shopify RoastRequiredWorkflow DSLCLICog-based validationNot disclosed
CoinbaseRequired"Claudebot"SlackNot disclosed5% of merged PRs
Abnormal AINot disclosedCustomNot disclosedNot disclosed13% of merged PRs
UberNot disclosedLangGraphIDE, WorkflowHybrid (LLM + deterministic)21,000 hours saved
Spotify HonkNot disclosedClaude CodeSlackNot disclosed50+ features shipped

Company Implementations

Stripe Minions

The most detailed public case study of in-house coding agents.[4]

Overview

Stripe's "Minions" are fully unattended coding agents that produce over 1,000 merged pull requests per week. While humans review the code, they write none of it — Minions handle everything from start to finish. Built on a fork of Block's Goose agent[5], Minions integrate deeply with Stripe's existing developer infrastructure.

Strengths

  • Proven scale: 1,000+ PRs merged weekly, validated in production
  • Deep integration: MCP server ("Toolshed") with 400+ internal tools
  • Fast iteration: Isolated devboxes spin up in 10 seconds
  • Leverage existing infra: Uses same environments as human engineers
  • Parallel execution: Engineers run multiple Minions simultaneously

Cautions

  • Requires massive codebase investment: Stripe has hundreds of millions of LOC and years of devex tooling
  • Human review bottleneck: Agents don't merge; humans must still review
  • Ruby/Sorbet specific: Some patterns may not transfer to other stacks
  • Dedicated team required: "Leverage team" maintains the system

Architecture Details

Invocation: Slack (primary), CLI, web, internal tool integrations

Execution: Pre-warmed "devboxes" isolated from production and internet

Context: Central MCP server with docs, tickets, build status, Sourcegraph search

CI Loop: Local lint (under 5 seconds) → at most 2 CI rounds → auto-apply fixes

Key Stats

MetricValue
PRs merged/week1,000+
Devbox spin-up10 seconds
MCP tools400+
CI rounds (max)2

StrongDM Software Factory

The most radical approach: no human code, no human review.[6]

Overview

StrongDM's "Software Factory" takes in-house coding agents to their logical extreme. While Stripe still requires human code review, StrongDM has eliminated it entirely:

  • Code must not be written by humans
  • Code must not be reviewed by humans

Founded in July 2025 by Justin McCarthy (co-founder/CTO), Jay Taylor, and Navan Chauhan, the team started after observing that Claude 3.5's October 2024 revision enabled "compounding correctness" in long-horizon agentic workflows.

Strengths

  • No review bottleneck: Code ships without human inspection
  • Infinite testing scale: Digital Twin Universe enables volume testing
  • ML-inspired validation: Scenarios act as holdout sets, preventing reward hacking
  • Third-party API coverage: Behavioral clones of Okta, Jira, Slack, Google Docs
  • Clear success metric: "$1,000/day in tokens per engineer"

Cautions

  • Requires validation investment: DTU took significant engineering effort
  • Domain-specific fit: Works well for integration-heavy software; unclear for other domains
  • Opaque code: Teams must accept not reading/understanding generated code
  • Novel approach: Less battle-tested than human-reviewed workflows
  • Small team documented: 3-person AI team; scalability unclear

Architecture Details

Philosophy: Code treated as opaque weights — correctness inferred from behavior, not inspection

Validation Loop:

  1. Seed — Initial spec (PRD, sentences, screenshot, existing code)
  2. Validation — End-to-end harness against DTU and scenarios
  3. Feedback — Output samples fed back for self-correction

Digital Twin Universe (DTU):

  • Behavioral clones of third-party services
  • Test at volumes exceeding production limits
  • No rate limits, API costs, or abuse detection

Scenarios vs Tests:

  • Stored outside codebase (like ML holdout sets)
  • LLM-validated, not boolean pass/fail
  • "Satisfaction" metric: fraction of trajectories that satisfy user

Key Stats

MetricValue
Human code0%
Human review0%
Target token spend$1,000/day/engineer
DTU services6 (Okta, Jira, Slack, Docs, Drive, Sheets)

Bitrise

Vendor lock-in concerns drove custom agent development.[7]

Overview

Bitrise, the mobile CI/CD platform, built their own AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. While Claude Code performed best, its closed-source nature and Anthropic-only API lock-in posed unacceptable long-term risks. Their solution: a custom Go-based agent using Anthropic APIs but with full ownership.

Strengths

  • No vendor lock-in: Provider-agnostic architecture allows model switching mid-conversation
  • Programmatic checkpoints: Verification embedded directly in agent workflow (not bolted on)
  • Custom eval framework: Go-based benchmark system runs tests in parallel across agents
  • Multi-agent coordination: Sub-agents dynamically constructed and orchestrated in Go
  • Central logging: LLM messages stored in provider-agnostic format

Cautions

  • Maintenance overhead: Custom agent requires ongoing development investment
  • Anthropic-dependent: Still uses Anthropic APIs despite architectural flexibility
  • Scale undisclosed: No public metrics on throughput or adoption
  • Mobile CI focus: Agent optimized for their specific domain (build failures, PR reviews)

Architecture Details

Language: Go (matching Bitrise's core stack)

Eval Framework:

  • Declarative test case definition
  • Docker containers for isolated execution
  • Parallel agent execution (~10 min runs)
  • LLM Judges for subjective evaluation
  • Results to SQL database + Metabase dashboard

Agent Design:

  • Multiple sub-agents with injected tools/dependencies
  • Coordinated flow with programmatic result collection
  • Custom system prompts per use case
  • Provider-agnostic message storage

Benchmarking Findings

AgentVerdict
Claude CodeBest performance, but closed-source + Anthropic-only
CodexFast but lost chain-of-thought, mid-transition issues
Gemini10-min response times without reserved resources
OpenCode2x slower than Claude Code, TUI-coupled

Shopify Roast

Structured workflows for AI coding agents — open-sourced as Ruby gem.[8]

Overview

Shopify's "Roast" is a Ruby-based DSL for creating structured AI workflows. Rather than building a coding agent from scratch, Roast provides orchestration primitives ("cogs") that chain AI steps together, including a dedicated agent cog that runs local coding agents like Claude Code with filesystem access. Open-sourced in 2025 and now at v1.0 preview.

Strengths

  • Open source: Ruby gem (roast-ai) freely available
  • Multi-agent orchestration: Chain LLM calls, coding agents, shell commands, custom Ruby
  • Serial and parallel execution: Map operations with configurable parallelism
  • Composable: Modular scopes with parameters enable reuse
  • Claude Code integration: agent cog runs Claude Code CLI with full filesystem access

Cautions

  • Ruby-specific: Ecosystem limited to Ruby shops
  • Requires Claude Code: agent cog depends on external CLI installation
  • Workflow complexity: DSL learning curve for non-trivial pipelines
  • v1.0 transition: Breaking changes from v0.x YAML syntax

Architecture Details

Core Cogs:

  • chat — Cloud LLM calls (OpenAI, Anthropic, Gemini)
  • agent — Local coding agents with filesystem access
  • ruby — Custom Ruby code execution
  • cmd — Shell commands with output capture
  • map — Collection processing (serial/parallel)
  • repeat — Iteration until conditions met
  • call — Reusable workflow invocation

Example Workflow:

execute do
  cmd(:recent_changes) { "git diff --name-only HEAD~5..HEAD" }
  agent(:review) do
    files = cmd!(:recent_changes).lines
    "Review these files: #{files.join("\n")}"
  end
  chat(:summary) { "Summarize: #{agent!(:review).response}" }
end

Philosophy: Shopify's insight is that non-determinism is the enemy of reliability for AI agents. Roast provides structure to keep agents on track rather than letting them run unconstrained.


Ramp Inspect

Background coding agent for async task completion.[9]

Overview

Ramp built an internal tool called "Inspect" that runs coding agents in the background. Less public detail than Stripe or StrongDM, but the approach has been validated by an open-source reimplementation.[10]

Strengths

  • Background execution: Agents work while developers focus elsewhere
  • Browser in sandbox: Agents can run browser automation within isolated environments
  • Multi-repo support: Can work across multiple repositories in a single session
  • Proven pattern: Inspired open-source clone with 500+ GitHub stars
  • Familiar UX: Slack-based invocation meets developers where they are
  • PR-ready output: Delivers completed pull requests for review
Loading tweet...

Cautions

  • Limited public detail: Architecture largely inferred from open-source clone
  • Human review required: Still has review bottleneck
  • Single-tenant design: Trust boundaries needed for multi-tenant deployment

Open-Source Validation

Cole Murray's "Background Agents" project implements Ramp's architecture:

  • Control plane: Cloudflare Workers + Durable Objects
  • Data plane: Modal cloud sandboxes
  • Agent runtime: OpenCode
  • Features: Multiplayer sessions, commit attribution

Coinbase

5% of merged PRs now from in-house background agents.

Overview

Coinbase's Engineering VP Chintan Turakhia announced in February 2026 that the company had reached a significant milestone: 5% of all merged PRs now come from in-house background agents. Built the previous year by two engineers, the system uses a Slack-native framework with the same tools and context as human engineers.

Strengths

  • Proven adoption: 5% of all merged PRs — significant production impact
  • Familiar UX: Slack-native, tag agent in any thread to invoke
  • Lightweight build: Two engineers built the initial version
  • Full workflow: Agent plans, debugs, and ships PR end-to-end

Cautions

  • Limited public detail: Architecture and validation approach not disclosed
  • Scale context unclear: 5% impact varies depending on Coinbase's PR volume
  • Evolution unknown: Original "Claudebot" name suggests Anthropic dependency

Architecture Details

Invocation: Slack (tag agent in any thread)

Context: Same tools and context as human engineers

Output: PR-ready (plans, debugs, ships)

Origins: Started as "Claudebot" — two-engineer project that scaled to production


Abnormal AI

13% of PRs now from background agents — highest publicly reported percentage. [3]

Overview

Abnormal AI (cybersecurity company) reported in February 2026 that 13% of their PRs now come from in-house background agents — the highest percentage publicly disclosed. The system serves full-stack engineers, security analysts, and MLEs, with the team actively migrating from a GitHub Actions-powered backend to Modal for execution.

Strengths

  • Highest reported adoption: 13% of PRs (vs. Coinbase 5%, Stripe volume-based)
  • Broad use cases: Full-stack features, security/infra patches, agent tooling
  • Self-improving: The agent/dev tools build themselves
  • Multi-persona: Serves engineers, security analysts, and MLEs

Cautions

  • Limited technical detail: Architecture not fully disclosed (blog post WIP)
  • Infrastructure in flux: Currently migrating from GHA to Modal

Uber

Saved 21,000 developer hours with LangGraph-powered AI agents. [11]

Overview

Uber's Developer Platform Team presented at LangChain's Interrupt event detailing how they've deployed agentic tools across an engineering organization supporting 5,000 developers and a codebase with hundreds of millions of lines. Using LangGraph for orchestration, they've built reusable, domain-specific agents for testing, validation, and workflow assistance.

Key Tools

Validator: IDE-embedded agent that flags security vulnerabilities and best-practice violations in real time. Proposes fixes that can be accepted with one click or routed to an agentic assistant for context-aware resolution. Uses a hybrid architecture: LLM for complex issues, deterministic tools (static linters) for common patterns.

Autocover: Generative test-authoring tool that scaffolds, generates, executes, and mutates test cases. Can execute up to 100 tests concurrently, boosting throughput 2-3x faster than other AI coding tools. Increased test coverage by 10%, saving an estimated 21,000 developer hours.

Picasso: Workflow platform with conversational AI agents integrated with organizational knowledge.

Strengths

  • Massive scale: 5,000 developers, hundreds of millions of LOC
  • Proven ROI: 21,000 developer hours saved, 10% test coverage increase
  • Hybrid architecture: LLM + deterministic tools for best of both worlds
  • Reusable primitives: Security team can contribute rules without deep LangGraph knowledge
  • Domain expertise: Specialized agents outperform generic AI coding tools

Cautions

  • Infrastructure investment: Requires dedicated platform team to maintain
  • LangGraph dependency: Tightly coupled to LangChain ecosystem
  • Enterprise context: Patterns may not transfer to smaller teams

Key Lessons (from Uber's team)

  1. Encapsulation enables reuse — Clear interfaces let teams extend without central coordination
  2. Domain expert agents outperform generic tools — Specialized context beats general-purpose AI
  3. Determinism still matters — Linters and build tools work better deterministically, orchestrated by agents
  4. Solve narrow problems first — Tightly scoped solutions get reused in broader workflows

Spotify Honk

Best developers haven't written a line of code since December — AI-first development via internal "Honk" system. [1]

Overview

During Spotify's Q4 2025 earnings call (February 2026), co-CEO Gustav Söderström revealed that the company's best developers "have not written a single line of code since December" thanks to an internal AI coding system called Honk. The system, powered by Claude Code, enables remote real-time code deployment via Slack — engineers can fix bugs and ship features from their phones during their morning commute.

Key Results

  • 50+ features shipped throughout 2025 using AI-assisted development
  • Record earnings with stock jumping 14.7% on the announcement
  • Recent AI-powered launches: Prompted Playlists, Page Match (audiobooks), About This Song

How Honk Works

"An engineer at Spotify on their morning commute from Slack on their cell phone can tell Claude to fix a bug or add a new feature to the iOS app. And once Claude finishes that work, the engineer then gets a new version of the app, pushed to them on Slack on their phone, so that he can then merge it to production, all before they even arrive at the office."

— Gustav Söderström, Spotify co-CEO

Strengths

  • Mobile-first workflow: Ship from Slack on your phone, review and merge before arriving at office
  • Claude Code integration: Built on proven agentic coding infrastructure
  • Quantified impact: 50+ features, direct contribution to record earnings
  • Leadership buy-in: Announced publicly by co-CEO during earnings call

Cautions

  • Limited technical details: Architecture and tooling not publicly documented (unlike Stripe, Shopify)
  • "Best developers" qualifier: May not represent all engineering workflows
  • Consumer vs. enterprise: Spotify's codebase is primarily consumer-facing, different constraints than B2B

Key Lessons

  1. Slack is the universal agent interface — Matches pattern at Stripe, Coinbase, Ramp
  2. Executives publicly quantifying AI impact — "50+ features" and record earnings tied to AI adoption
  3. Mobile-native development workflows — Engineers reviewing/merging code from their phones
  4. "Best developers" redefining what developers do — Senior engineers as AI orchestrators, not code writers

Other Suspected Implementations

CompanyEvidenceLikely ApproachConfidence
GoogleMassive internal tooling, AI research leadershipIntegrated with Piper monorepo, internal LLMsHigh
MetaCode Llama, internal AI infraCode Llama fine-tuned on internal patternsHigh
AmazonQ Developer (external), internal ML platformInternal agents likely predate Q DeveloperMedium
NetflixHeavy automation cultureIntegrated with deployment platformLow

Note: Coinbase and Uber have been moved to confirmed company implementations (see sections above).


Common Architecture Patterns

Invocation Surface

PatternAdoptionNotes
SlackUniversalPrimary interface for all documented systems
CLICommonSecondary for power users
Web UICommonVisibility and management
Internal toolsStripe onlyDeep integration (ticketing, docs, feature flags)

Execution Environment

  • Isolated sandboxes — Pre-warmed for fast spin-up (Stripe: 10 seconds)
  • Same as human dev environment — Reduces agent-specific edge cases
  • Network isolation — No production access, no internet (security)
  • Parallelization — Multiple agents without git worktree conflicts

Validation Spectrum

ApproachHuman ReviewTest StrategyExample
ConservativeRequiredTraditional CI + lintStripe, Ramp
RadicalEliminatedBehavioral validation + DTUStrongDM

Strategic Recommendations

For Engineering Leaders

Build when:

  • Codebase exceeds 10M LOC with proprietary frameworks
  • Existing devex team can be redirected (3+ engineers)
  • Compliance requires full control over code flow
  • Organization has 1,000+ engineers (ROI threshold)

Buy when:

  • Standard tech stack (Python, TypeScript, common frameworks)
  • Team under 100 engineers
  • Need agents in weeks, not quarters
  • Enterprise integrations required (Jira, signed commits, BYOK)

Wait when:

  • Unclear ROI or codebase fit
  • No dedicated devex resources
  • Vendor market still maturing

For Devex Teams

  1. Start with invocation surface — Slack integration provides immediate value
  2. Invest in sandboxing — Pre-warmed environments are universal pattern
  3. Consider MCP adoption — Emerging standard for tool integration
  4. Evaluate validation requirements — Decide on human review early

For Vendors

  • Reference architecture exists — Elite companies have defined the pattern
  • Middle market opportunity — Stripe-like capabilities without Stripe-level investment
  • Validation innovation — Digital Twin approach may become differentiator

Market Outlook

Near-Term (2026-2027)

  • More companies will publish in-house implementations
  • Open-source clones will mature (Background Agents, etc.)
  • Vendor solutions will close gap with in-house systems

Medium-Term (2027-2028)

  • Digital Twin testing will become standard practice
  • Human code review will become optional at mature orgs
  • "Token spend per engineer" will emerge as productivity metric

Long-Term (2028+)

  • In-house vs. vendor distinction may blur
  • Validation infrastructure becomes the moat, not the agent
  • "Grown software" philosophy spreads beyond early adopters

Bottom Line

A spectrum of approaches is now documented:

Stripe Minions: 1,000+ merged PRs per week with no human-written code, but human review required. The pattern — Slack invocation, isolated sandboxes, MCP context, CI integration, PR output — is well-documented and replicable.

Bitrise: Built custom Go agent to avoid vendor lock-in with Claude Code. Key insight: programmatic checkpoints embedded in agent workflow beat bolted-on validation. Shows that even mid-size companies can build if the fit is right.

Shopify Roast: Rather than a full agent, open-sourced orchestration primitives. Philosophy: non-determinism is the enemy of reliability — structure keeps agents on track. Useful for companies wanting custom workflows without building agents from scratch.

StrongDM Software Factory: The radical end — no human code, no human review. Code treated as opaque weights validated purely by behavior. Digital Twin Universe enables testing at scale against cloned third-party services.

All require investment, but the threshold varies. Shopify's approach (DSL + existing agents) is lighter than Bitrise's (full custom agent), which is lighter than Stripe's (agents + massive devex infra). Most companies should still buy, not build.

The interesting middle ground: platforms like Tembo that provide Stripe-like orchestration without Stripe-level investment.


Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.