Bitrise AI Coding Agent | Ry Walker Research

Key takeaways

Built custom Go agent to avoid vendor lock-in despite Claude Code performing best
Programmatic checkpoints embed verification directly in agent workflow
Provider-agnostic architecture allows model switching mid-conversation

FAQ

Why did Bitrise build their own coding agent?

Despite Claude Code performing best in benchmarks, its closed-source nature and Anthropic-only API posed unacceptable long-term vendor lock-in risks for Bitrise.

What makes Bitrise's agent different?

Programmatic checkpoints embed verification directly in agent workflows, and the provider-agnostic architecture allows switching LLM providers mid-conversation.

What agents did Bitrise evaluate before building their own?

Claude Code (best performance but closed-source), Codex (fast but inconsistent), Gemini (slow response times), and OpenCode (open-source but 2x slower).

Executive Summary

Bitrise, the mobile CI/CD platform, built a custom Go-based AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. Despite Claude Code performing best, its closed-source nature and Anthropic-only API posed unacceptable vendor lock-in risks. The result: a provider-agnostic agent with programmatic checkpoints that embed verification directly into workflows.

Attribute	Value
Company	Bitrise
Language	Go
Foundation	Custom (uses Anthropic APIs)
Public Documentation	November 2025
Headquarters	Budapest, Hungary

Product Overview

Bitrise's AI coding agent is designed for their specific domain: mobile CI/CD. The agent handles build failures, PR reviews, and automated fixes. Key innovation is the programmatic checkpoint system — verification and validation embedded directly in the agent workflow, not bolted on after the fact.

Key Capabilities

Capability	Description
Provider-agnostic	Switch LLM providers mid-conversation if needed
Programmatic checkpoints	Verification embedded in workflow, not separate
Central logging	LLM messages stored in provider-agnostic format
Multi-agent coordination	Sub-agents dynamically constructed and orchestrated
Custom eval framework	Go-based benchmarking runs tests in parallel

Use Cases

Use Case	Description
PR review	AI-generated review comments on pull requests
Build fixer	Automatic resolution of failing CI builds
Log analysis	Summarize failed build logs and suggest fixes
Dependency updates	Automated dependency management

Technical Architecture

Bitrise built their own evaluation framework in Go to benchmark agents before building their own.

Eval Framework Flow

Agents (declarative list)
    ↓
Test Cases (declarative definition)
    ↓
Docker Containers (parallel execution)
├── Install AI coding agent
├── Clone source Git repository
├── Apply patches, install dependencies
└── Run agent (~10 min)
    ↓
Verification
├── Programmatic checks (go test ./...)
└── LLM Judges (subjective evaluation)
    ↓
SQL Database → Metabase Dashboard

Agent Architecture

Multiple Sub-Agents
├── Dynamically constructed in Go
├── Injected tools/dependencies
└── Custom system prompts per use case
    ↓
Coordinated Flow
├── Programmatic result collection
└── Provider-agnostic message storage
    ↓
Output (PR comments, fixes, etc.)

Key Technical Details

Aspect	Detail
Language	Go (matching Bitrise core stack)
LLM Provider	Anthropic APIs (but provider-agnostic design)
Eval Runtime	~10 minutes per agent in parallel Docker containers
Verification	Programmatic checks + LLM Judges
Storage	Provider-agnostic format for model switching

Benchmarking Findings

Bitrise's evaluation revealed significant differences across agents:

Agent	Performance	Issues
Claude Code	Best overall	Closed-source, Anthropic-only API
Codex	Fast responses	Lost chain-of-thought, mid-transition issues
Gemini	Variable	10-minute response times without reserved resources
OpenCode	Open-source	2x slower than Claude Code, TUI-coupled

Post-Benchmark Updates

Since initial benchmarking:

Sonnet 4.5 — Better context-handling, but limited performance gains
Haiku 4.5 — Comparable to Sonnet at lower cost
GPT-5-Codex — Promising but couldn't outperform Anthropic models
OpenCode — Archived; successor "Crush" in development by Charm team

Strengths

No vendor lock-in — Provider-agnostic architecture allows model switching mid-conversation
Programmatic checkpoints — Verification embedded in workflow, not bolted on (essential for production-grade AI features)
Custom eval framework — Go-based benchmark system enables systematic comparison
Multi-agent coordination — Sub-agents dynamically constructed with injected tools/dependencies
Central logging — LLM messages stored in format that survives provider changes

Cautions

Maintenance overhead — Custom agent requires ongoing development investment
Anthropic-dependent — Still uses Anthropic APIs despite architectural flexibility
Scale undisclosed — No public metrics on throughput or adoption percentage
Mobile CI focus — Agent optimized for specific domain (build failures, PR reviews); may not generalize
Not for sale — Internal tooling, not a product (Bitrise AI features are separate)

Competitive Positioning

vs. Other In-House Agents

System	Differentiation
Stripe Minions	Stripe uses Goose fork; Bitrise built from scratch
Ramp Inspect	Ramp uses Modal; Bitrise uses Docker containers
Commercial agents	Bitrise accepts maintenance cost for flexibility

Build vs. Buy Decision

Bitrise explicitly chose build over buy because:

Closed-source risk (Claude Code)
API lock-in risk (Anthropic-only)
Need for programmatic checkpoints
Domain-specific customization (mobile CI)

Ideal Customer Profile

This is internal tooling, not a product for sale. However, the approach is worth studying if:

Good fit for similar build:

Specific domain with unique workflows (mobile, CI/CD, etc.)
Strong Go engineering team
Long-term concerns about vendor lock-in
Need for programmatic checkpoints in agent workflows

Poor fit:

General-purpose coding needs
Limited engineering resources for maintenance
Short-term project with disposable code
No specific vendor lock-in concerns

Viability Assessment

Factor	Assessment
Documentation Quality	Good (4-part blog series)
Replicability	Medium (requires Go expertise)
Benchmarking Rigor	High (systematic comparison)
Architecture Maturity	Medium (newer than Stripe/Ramp)
Domain Specificity	High (mobile CI/CD focused)

Bitrise's detailed blog series provides valuable insights for teams evaluating build vs. buy decisions, particularly around vendor lock-in concerns.

Bottom Line

Bitrise demonstrates that even mid-size companies can build custom coding agents when domain fit and vendor lock-in concerns justify the investment. The key insight: programmatic checkpoints embedded in agent workflows beat bolted-on validation.

Key decision factors: Closed-source risk, API lock-in, need for deep integration with proprietary workflows.

Recommended study for: Teams evaluating build vs. buy, organizations with specific domain workflows, engineers concerned about vendor lock-in.

Not recommended for: General-purpose coding needs, teams without Go expertise, organizations comfortable with Anthropic lock-in.

Outlook: Bitrise's approach validates that the "build" option remains viable for companies with specific domain needs and technical capability, even as commercial agents improve.

Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.

Sources