Key takeaways
- Stripe Minions ship 1000+ merged PRs per week — fully unattended, human-reviewed but no human-written code
- Shopify open-sourced 'Roast' — a Ruby DSL for structured AI workflows; their philosophy: non-determinism is the enemy of reliability
- Coinbase hit 5% of all merged PRs from in-house agents — Slack-native, built by two engineers, started as 'Claudebot'
- Block open-sourced Goose, which Stripe forked for Minions — proving the value of shared infrastructure
- Uber saved 21,000 developer hours with LangGraph-powered Validator and Autocover agents — IDE-embedded, hybrid LLM + deterministic
- Abnormal AI hit 13% of merged PRs — highest reported percentage, serving devs, security analysts, and MLEs
- StrongDM goes further: no human code, no human review — code treated as opaque weights validated by behavior
FAQ
Why do companies build their own coding agents instead of buying?
Unique codebases, proprietary frameworks, compliance requirements, and the need for tight integration with internal tools. Vendor lock-in concerns also drive build decisions.
What percentage of PRs come from coding agents at top companies?
Abnormal AI reports 13%, Coinbase 5%, and Stripe ships 1,000+ merged PRs per week. These are the highest publicly reported figures as of early 2026.
What's the ROI threshold for building in-house coding agents?
Generally requires 1,000+ engineers, dedicated devex teams, and codebases exceeding 10M LOC. Smaller teams should buy off-the-shelf tools like Tembo, Codex, or Claude Code.
What architecture do in-house coding agents share?
Slack invocation, isolated sandbox execution, CI/CD integration, and PR-ready output. Human review policies vary from required (Stripe) to eliminated (StrongDM).
How do companies validate code written by AI agents?
Two approaches: traditional CI + human review (conservative) or behavioral validation against 'digital twin' test environments without human review (radical).
What is the emerging metric for AI-native engineering teams?
Percentage of PRs merged from background agents. Leaders publicly report 5-13%, suggesting this will become a standard engineering productivity metric.
Executive Summary
A new pattern is emerging at elite engineering organizations: instead of adopting off-the-shelf coding agents (Codex, Claude Code, Cursor), companies like Stripe, Spotify, Shopify, Coinbase, Uber, and Block are building custom in-house systems tailored to their unique codebases and workflows. Google, Meta, and Amazon are suspected to have similar internal systems, though details remain undisclosed.
Key Findings:
- Stripe Minions produce 1,000+ merged PRs per week with zero human-written code
- Spotify reports best developers haven't written code since December — shipped 50+ features via internal "Honk" system powered by Claude Code [1]
- Shopify open-sourced "Roast" — a Ruby DSL for structured AI workflows with the insight: "non-determinism is the enemy of reliability"
- Coinbase hit 5% of all merged PRs from in-house agents — Slack-native framework built by two engineers [2]
- Block open-sourced Goose, which Stripe forked for Minions — proving shared agent infrastructure has value
- Abnormal AI hit 13% of merged PRs from background agents — highest reported percentage [3]
- StrongDM has eliminated human code review entirely, treating code as opaque weights validated by behavior
- Common architecture: Slack invocation → isolated sandbox → CI loop → PR-ready output
Strategic Planning Assumptions:
- By 2027, 30% of enterprises with 1000+ engineers will operate internal coding agent systems
- By 2028, "Digital Twin" testing infrastructure (cloned third-party APIs) will be standard for agent validation
- Human code review will become optional at organizations with mature validation infrastructure
Market Definition
In-house coding agents are custom-built systems that enable AI agents to write, test, and ship code within a company's specific codebase and toolchain — as opposed to off-the-shelf products like Codex, Claude Code, or Cursor.
Inclusion Criteria:
- Publicly documented implementation details
- Production deployment (not experimental)
- Custom-built for company-specific constraints
- Agent writes code end-to-end (not just autocomplete)
Exclusion Criteria:
- Commercial products available for purchase
- Internal tools without public documentation
- Autocomplete/copilot-style assistance only
Comparison Matrix
| Vendor | Human Review | Agent Foundation | Invocation | Validation Approach | Documented Scale |
|---|---|---|---|---|---|
| Stripe Minions | Required | Goose fork | Slack, CLI, Web | CI + MCP tools | 1,000+ PRs/week |
| StrongDM Factory | Eliminated | Cursor YOLO | Spec-driven | Digital Twin Universe | "$1K/day/engineer" |
| Ramp Inspect | Required | Unknown | Slack | Sandbox + CI | Not disclosed |
| Bitrise | Required | Custom (Go) | Internal eval | Programmatic checkpoints | Not disclosed |
| Shopify Roast | Required | Workflow DSL | CLI | Cog-based validation | Not disclosed |
| Coinbase | Required | "Claudebot" | Slack | Not disclosed | 5% of merged PRs |
| Abnormal AI | Not disclosed | Custom | Not disclosed | Not disclosed | 13% of merged PRs |
| Uber | Not disclosed | LangGraph | IDE, Workflow | Hybrid (LLM + deterministic) | 21,000 hours saved |
| Spotify Honk | Not disclosed | Claude Code | Slack | Not disclosed | 50+ features shipped |
Company Implementations
Stripe Minions
The most detailed public case study of in-house coding agents.[4]
Overview
Stripe's "Minions" are fully unattended coding agents that produce over 1,000 merged pull requests per week. While humans review the code, they write none of it — Minions handle everything from start to finish. Built on a fork of Block's Goose agent[5], Minions integrate deeply with Stripe's existing developer infrastructure.
Strengths
- Proven scale: 1,000+ PRs merged weekly, validated in production
- Deep integration: MCP server ("Toolshed") with 400+ internal tools
- Fast iteration: Isolated devboxes spin up in 10 seconds
- Leverage existing infra: Uses same environments as human engineers
- Parallel execution: Engineers run multiple Minions simultaneously
Cautions
- Requires massive codebase investment: Stripe has hundreds of millions of LOC and years of devex tooling
- Human review bottleneck: Agents don't merge; humans must still review
- Ruby/Sorbet specific: Some patterns may not transfer to other stacks
- Dedicated team required: "Leverage team" maintains the system
Architecture Details
Invocation: Slack (primary), CLI, web, internal tool integrations
Execution: Pre-warmed "devboxes" isolated from production and internet
Context: Central MCP server with docs, tickets, build status, Sourcegraph search
CI Loop: Local lint (under 5 seconds) → at most 2 CI rounds → auto-apply fixes
Key Stats
| Metric | Value |
|---|---|
| PRs merged/week | 1,000+ |
| Devbox spin-up | 10 seconds |
| MCP tools | 400+ |
| CI rounds (max) | 2 |
StrongDM Software Factory
The most radical approach: no human code, no human review.[6]
Overview
StrongDM's "Software Factory" takes in-house coding agents to their logical extreme. While Stripe still requires human code review, StrongDM has eliminated it entirely:
- Code must not be written by humans
- Code must not be reviewed by humans
Founded in July 2025 by Justin McCarthy (co-founder/CTO), Jay Taylor, and Navan Chauhan, the team started after observing that Claude 3.5's October 2024 revision enabled "compounding correctness" in long-horizon agentic workflows.
Strengths
- No review bottleneck: Code ships without human inspection
- Infinite testing scale: Digital Twin Universe enables volume testing
- ML-inspired validation: Scenarios act as holdout sets, preventing reward hacking
- Third-party API coverage: Behavioral clones of Okta, Jira, Slack, Google Docs
- Clear success metric: "$1,000/day in tokens per engineer"
Cautions
- Requires validation investment: DTU took significant engineering effort
- Domain-specific fit: Works well for integration-heavy software; unclear for other domains
- Opaque code: Teams must accept not reading/understanding generated code
- Novel approach: Less battle-tested than human-reviewed workflows
- Small team documented: 3-person AI team; scalability unclear
Architecture Details
Philosophy: Code treated as opaque weights — correctness inferred from behavior, not inspection
Validation Loop:
- Seed — Initial spec (PRD, sentences, screenshot, existing code)
- Validation — End-to-end harness against DTU and scenarios
- Feedback — Output samples fed back for self-correction
Digital Twin Universe (DTU):
- Behavioral clones of third-party services
- Test at volumes exceeding production limits
- No rate limits, API costs, or abuse detection
Scenarios vs Tests:
- Stored outside codebase (like ML holdout sets)
- LLM-validated, not boolean pass/fail
- "Satisfaction" metric: fraction of trajectories that satisfy user
Key Stats
| Metric | Value |
|---|---|
| Human code | 0% |
| Human review | 0% |
| Target token spend | $1,000/day/engineer |
| DTU services | 6 (Okta, Jira, Slack, Docs, Drive, Sheets) |
Bitrise
Vendor lock-in concerns drove custom agent development.[7]
Overview
Bitrise, the mobile CI/CD platform, built their own AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. While Claude Code performed best, its closed-source nature and Anthropic-only API lock-in posed unacceptable long-term risks. Their solution: a custom Go-based agent using Anthropic APIs but with full ownership.
Strengths
- No vendor lock-in: Provider-agnostic architecture allows model switching mid-conversation
- Programmatic checkpoints: Verification embedded directly in agent workflow (not bolted on)
- Custom eval framework: Go-based benchmark system runs tests in parallel across agents
- Multi-agent coordination: Sub-agents dynamically constructed and orchestrated in Go
- Central logging: LLM messages stored in provider-agnostic format
Cautions
- Maintenance overhead: Custom agent requires ongoing development investment
- Anthropic-dependent: Still uses Anthropic APIs despite architectural flexibility
- Scale undisclosed: No public metrics on throughput or adoption
- Mobile CI focus: Agent optimized for their specific domain (build failures, PR reviews)
Architecture Details
Language: Go (matching Bitrise's core stack)
Eval Framework:
- Declarative test case definition
- Docker containers for isolated execution
- Parallel agent execution (~10 min runs)
- LLM Judges for subjective evaluation
- Results to SQL database + Metabase dashboard
Agent Design:
- Multiple sub-agents with injected tools/dependencies
- Coordinated flow with programmatic result collection
- Custom system prompts per use case
- Provider-agnostic message storage
Benchmarking Findings
| Agent | Verdict |
|---|---|
| Claude Code | Best performance, but closed-source + Anthropic-only |
| Codex | Fast but lost chain-of-thought, mid-transition issues |
| Gemini | 10-min response times without reserved resources |
| OpenCode | 2x slower than Claude Code, TUI-coupled |
Shopify Roast
Structured workflows for AI coding agents — open-sourced as Ruby gem.[8]
Overview
Shopify's "Roast" is a Ruby-based DSL for creating structured AI workflows. Rather than building a coding agent from scratch, Roast provides orchestration primitives ("cogs") that chain AI steps together, including a dedicated agent cog that runs local coding agents like Claude Code with filesystem access. Open-sourced in 2025 and now at v1.0 preview.
Strengths
- Open source: Ruby gem (
roast-ai) freely available - Multi-agent orchestration: Chain LLM calls, coding agents, shell commands, custom Ruby
- Serial and parallel execution: Map operations with configurable parallelism
- Composable: Modular scopes with parameters enable reuse
- Claude Code integration:
agentcog runs Claude Code CLI with full filesystem access
Cautions
- Ruby-specific: Ecosystem limited to Ruby shops
- Requires Claude Code:
agentcog depends on external CLI installation - Workflow complexity: DSL learning curve for non-trivial pipelines
- v1.0 transition: Breaking changes from v0.x YAML syntax
Architecture Details
Core Cogs:
chat— Cloud LLM calls (OpenAI, Anthropic, Gemini)agent— Local coding agents with filesystem accessruby— Custom Ruby code executioncmd— Shell commands with output capturemap— Collection processing (serial/parallel)repeat— Iteration until conditions metcall— Reusable workflow invocation
Example Workflow:
execute do
cmd(:recent_changes) { "git diff --name-only HEAD~5..HEAD" }
agent(:review) do
files = cmd!(:recent_changes).lines
"Review these files: #{files.join("\n")}"
end
chat(:summary) { "Summarize: #{agent!(:review).response}" }
end
Philosophy: Shopify's insight is that non-determinism is the enemy of reliability for AI agents. Roast provides structure to keep agents on track rather than letting them run unconstrained.
Ramp Inspect
Background coding agent for async task completion.[9]
Overview
Ramp built an internal tool called "Inspect" that runs coding agents in the background. Less public detail than Stripe or StrongDM, but the approach has been validated by an open-source reimplementation.[10]
Strengths
- Background execution: Agents work while developers focus elsewhere
- Browser in sandbox: Agents can run browser automation within isolated environments
- Multi-repo support: Can work across multiple repositories in a single session
- Proven pattern: Inspired open-source clone with 500+ GitHub stars
- Familiar UX: Slack-based invocation meets developers where they are
- PR-ready output: Delivers completed pull requests for review
Cautions
- Limited public detail: Architecture largely inferred from open-source clone
- Human review required: Still has review bottleneck
- Single-tenant design: Trust boundaries needed for multi-tenant deployment
Open-Source Validation
Cole Murray's "Background Agents" project implements Ramp's architecture:
- Control plane: Cloudflare Workers + Durable Objects
- Data plane: Modal cloud sandboxes
- Agent runtime: OpenCode
- Features: Multiplayer sessions, commit attribution
Coinbase
5% of merged PRs now from in-house background agents.
Overview
Coinbase's Engineering VP Chintan Turakhia announced in February 2026 that the company had reached a significant milestone: 5% of all merged PRs now come from in-house background agents. Built the previous year by two engineers, the system uses a Slack-native framework with the same tools and context as human engineers.
Strengths
- Proven adoption: 5% of all merged PRs — significant production impact
- Familiar UX: Slack-native, tag agent in any thread to invoke
- Lightweight build: Two engineers built the initial version
- Full workflow: Agent plans, debugs, and ships PR end-to-end
Cautions
- Limited public detail: Architecture and validation approach not disclosed
- Scale context unclear: 5% impact varies depending on Coinbase's PR volume
- Evolution unknown: Original "Claudebot" name suggests Anthropic dependency
Architecture Details
Invocation: Slack (tag agent in any thread)
Context: Same tools and context as human engineers
Output: PR-ready (plans, debugs, ships)
Origins: Started as "Claudebot" — two-engineer project that scaled to production
Abnormal AI
13% of PRs now from background agents — highest publicly reported percentage. [3]
Overview
Abnormal AI (cybersecurity company) reported in February 2026 that 13% of their PRs now come from in-house background agents — the highest percentage publicly disclosed. The system serves full-stack engineers, security analysts, and MLEs, with the team actively migrating from a GitHub Actions-powered backend to Modal for execution.
Strengths
- Highest reported adoption: 13% of PRs (vs. Coinbase 5%, Stripe volume-based)
- Broad use cases: Full-stack features, security/infra patches, agent tooling
- Self-improving: The agent/dev tools build themselves
- Multi-persona: Serves engineers, security analysts, and MLEs
Cautions
- Limited technical detail: Architecture not fully disclosed (blog post WIP)
- Infrastructure in flux: Currently migrating from GHA to Modal
Uber
Saved 21,000 developer hours with LangGraph-powered AI agents. [11]
Overview
Uber's Developer Platform Team presented at LangChain's Interrupt event detailing how they've deployed agentic tools across an engineering organization supporting 5,000 developers and a codebase with hundreds of millions of lines. Using LangGraph for orchestration, they've built reusable, domain-specific agents for testing, validation, and workflow assistance.
Key Tools
Validator: IDE-embedded agent that flags security vulnerabilities and best-practice violations in real time. Proposes fixes that can be accepted with one click or routed to an agentic assistant for context-aware resolution. Uses a hybrid architecture: LLM for complex issues, deterministic tools (static linters) for common patterns.
Autocover: Generative test-authoring tool that scaffolds, generates, executes, and mutates test cases. Can execute up to 100 tests concurrently, boosting throughput 2-3x faster than other AI coding tools. Increased test coverage by 10%, saving an estimated 21,000 developer hours.
Picasso: Workflow platform with conversational AI agents integrated with organizational knowledge.
Strengths
- Massive scale: 5,000 developers, hundreds of millions of LOC
- Proven ROI: 21,000 developer hours saved, 10% test coverage increase
- Hybrid architecture: LLM + deterministic tools for best of both worlds
- Reusable primitives: Security team can contribute rules without deep LangGraph knowledge
- Domain expertise: Specialized agents outperform generic AI coding tools
Cautions
- Infrastructure investment: Requires dedicated platform team to maintain
- LangGraph dependency: Tightly coupled to LangChain ecosystem
- Enterprise context: Patterns may not transfer to smaller teams
Key Lessons (from Uber's team)
- Encapsulation enables reuse — Clear interfaces let teams extend without central coordination
- Domain expert agents outperform generic tools — Specialized context beats general-purpose AI
- Determinism still matters — Linters and build tools work better deterministically, orchestrated by agents
- Solve narrow problems first — Tightly scoped solutions get reused in broader workflows
Spotify Honk
Best developers haven't written a line of code since December — AI-first development via internal "Honk" system. [1]
Overview
During Spotify's Q4 2025 earnings call (February 2026), co-CEO Gustav Söderström revealed that the company's best developers "have not written a single line of code since December" thanks to an internal AI coding system called Honk. The system, powered by Claude Code, enables remote real-time code deployment via Slack — engineers can fix bugs and ship features from their phones during their morning commute.
Key Results
- 50+ features shipped throughout 2025 using AI-assisted development
- Record earnings with stock jumping 14.7% on the announcement
- Recent AI-powered launches: Prompted Playlists, Page Match (audiobooks), About This Song
How Honk Works
"An engineer at Spotify on their morning commute from Slack on their cell phone can tell Claude to fix a bug or add a new feature to the iOS app. And once Claude finishes that work, the engineer then gets a new version of the app, pushed to them on Slack on their phone, so that he can then merge it to production, all before they even arrive at the office."
— Gustav Söderström, Spotify co-CEO
Strengths
- Mobile-first workflow: Ship from Slack on your phone, review and merge before arriving at office
- Claude Code integration: Built on proven agentic coding infrastructure
- Quantified impact: 50+ features, direct contribution to record earnings
- Leadership buy-in: Announced publicly by co-CEO during earnings call
Cautions
- Limited technical details: Architecture and tooling not publicly documented (unlike Stripe, Shopify)
- "Best developers" qualifier: May not represent all engineering workflows
- Consumer vs. enterprise: Spotify's codebase is primarily consumer-facing, different constraints than B2B
Key Lessons
- Slack is the universal agent interface — Matches pattern at Stripe, Coinbase, Ramp
- Executives publicly quantifying AI impact — "50+ features" and record earnings tied to AI adoption
- Mobile-native development workflows — Engineers reviewing/merging code from their phones
- "Best developers" redefining what developers do — Senior engineers as AI orchestrators, not code writers
Other Suspected Implementations
| Company | Evidence | Likely Approach | Confidence |
|---|---|---|---|
| Massive internal tooling, AI research leadership | Integrated with Piper monorepo, internal LLMs | High | |
| Meta | Code Llama, internal AI infra | Code Llama fine-tuned on internal patterns | High |
| Amazon | Q Developer (external), internal ML platform | Internal agents likely predate Q Developer | Medium |
| Netflix | Heavy automation culture | Integrated with deployment platform | Low |
Note: Coinbase and Uber have been moved to confirmed company implementations (see sections above).
Common Architecture Patterns
Invocation Surface
| Pattern | Adoption | Notes |
|---|---|---|
| Slack | Universal | Primary interface for all documented systems |
| CLI | Common | Secondary for power users |
| Web UI | Common | Visibility and management |
| Internal tools | Stripe only | Deep integration (ticketing, docs, feature flags) |
Execution Environment
- Isolated sandboxes — Pre-warmed for fast spin-up (Stripe: 10 seconds)
- Same as human dev environment — Reduces agent-specific edge cases
- Network isolation — No production access, no internet (security)
- Parallelization — Multiple agents without git worktree conflicts
Validation Spectrum
| Approach | Human Review | Test Strategy | Example |
|---|---|---|---|
| Conservative | Required | Traditional CI + lint | Stripe, Ramp |
| Radical | Eliminated | Behavioral validation + DTU | StrongDM |
Strategic Recommendations
For Engineering Leaders
Build when:
- Codebase exceeds 10M LOC with proprietary frameworks
- Existing devex team can be redirected (3+ engineers)
- Compliance requires full control over code flow
- Organization has 1,000+ engineers (ROI threshold)
Buy when:
- Standard tech stack (Python, TypeScript, common frameworks)
- Team under 100 engineers
- Need agents in weeks, not quarters
- Enterprise integrations required (Jira, signed commits, BYOK)
Wait when:
- Unclear ROI or codebase fit
- No dedicated devex resources
- Vendor market still maturing
For Devex Teams
- Start with invocation surface — Slack integration provides immediate value
- Invest in sandboxing — Pre-warmed environments are universal pattern
- Consider MCP adoption — Emerging standard for tool integration
- Evaluate validation requirements — Decide on human review early
For Vendors
- Reference architecture exists — Elite companies have defined the pattern
- Middle market opportunity — Stripe-like capabilities without Stripe-level investment
- Validation innovation — Digital Twin approach may become differentiator
Market Outlook
Near-Term (2026-2027)
- More companies will publish in-house implementations
- Open-source clones will mature (Background Agents, etc.)
- Vendor solutions will close gap with in-house systems
Medium-Term (2027-2028)
- Digital Twin testing will become standard practice
- Human code review will become optional at mature orgs
- "Token spend per engineer" will emerge as productivity metric
Long-Term (2028+)
- In-house vs. vendor distinction may blur
- Validation infrastructure becomes the moat, not the agent
- "Grown software" philosophy spreads beyond early adopters
Bottom Line
A spectrum of approaches is now documented:
Stripe Minions: 1,000+ merged PRs per week with no human-written code, but human review required. The pattern — Slack invocation, isolated sandboxes, MCP context, CI integration, PR output — is well-documented and replicable.
Bitrise: Built custom Go agent to avoid vendor lock-in with Claude Code. Key insight: programmatic checkpoints embedded in agent workflow beat bolted-on validation. Shows that even mid-size companies can build if the fit is right.
Shopify Roast: Rather than a full agent, open-sourced orchestration primitives. Philosophy: non-determinism is the enemy of reliability — structure keeps agents on track. Useful for companies wanting custom workflows without building agents from scratch.
StrongDM Software Factory: The radical end — no human code, no human review. Code treated as opaque weights validated purely by behavior. Digital Twin Universe enables testing at scale against cloned third-party services.
All require investment, but the threshold varies. Shopify's approach (DSL + existing agents) is lighter than Bitrise's (full custom agent), which is lighter than Stripe's (agents + massive devex infra). Most companies should still buy, not build.
The interesting middle ground: platforms like Tembo that provide Stripe-like orchestration without Stripe-level investment.
Research by Ry Walker Research • methodology
Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.
Sources
- [1] Spotify Honk AI System (TechCrunch)
- [2] Coinbase Internal Agent Announcement
- [3] Abnormal AI Background Agents (13% PRs)
- [4] Stripe Minions Blog Post
- [5] Block Goose Agent
- [6] StrongDM Software Factory
- [7] Bitrise AI Coding Agent Blog Series
- [8] Shopify Roast (GitHub)
- [9] Ramp Background Agent Blog Post
- [10] Background Agents (Open Source Ramp Clone)
- [11] Uber AI Agents Save 21,000 Developer Hours