Key takeaways
- Built custom Go agent to avoid vendor lock-in despite Claude Code performing best
- Programmatic checkpoints embed verification directly in agent workflow
- Provider-agnostic architecture allows model switching mid-conversation
FAQ
Why did Bitrise build their own coding agent?
Despite Claude Code performing best in benchmarks, its closed-source nature and Anthropic-only API posed unacceptable long-term vendor lock-in risks for Bitrise.
What makes Bitrise's agent different?
Programmatic checkpoints embed verification directly in agent workflows, and the provider-agnostic architecture allows switching LLM providers mid-conversation.
What agents did Bitrise evaluate before building their own?
Claude Code (best performance but closed-source), Codex (fast but inconsistent), Gemini (slow response times), and OpenCode (open-source but 2x slower).
Executive Summary
Bitrise, the mobile CI/CD platform, built a custom Go-based AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. Despite Claude Code performing best, its closed-source nature and Anthropic-only API posed unacceptable vendor lock-in risks. The result: a provider-agnostic agent with programmatic checkpoints that embed verification directly into workflows.
| Attribute | Value |
|---|---|
| Company | Bitrise |
| Language | Go |
| Foundation | Custom (uses Anthropic APIs) |
| Public Documentation | November 2025 |
| Headquarters | Budapest, Hungary |
Product Overview
Bitrise's AI coding agent is designed for their specific domain: mobile CI/CD. The agent handles build failures, PR reviews, and automated fixes. Key innovation is the programmatic checkpoint system — verification and validation embedded directly in the agent workflow, not bolted on after the fact.
Key Capabilities
| Capability | Description |
|---|---|
| Provider-agnostic | Switch LLM providers mid-conversation if needed |
| Programmatic checkpoints | Verification embedded in workflow, not separate |
| Central logging | LLM messages stored in provider-agnostic format |
| Multi-agent coordination | Sub-agents dynamically constructed and orchestrated |
| Custom eval framework | Go-based benchmarking runs tests in parallel |
Use Cases
| Use Case | Description |
|---|---|
| PR review | AI-generated review comments on pull requests |
| Build fixer | Automatic resolution of failing CI builds |
| Log analysis | Summarize failed build logs and suggest fixes |
| Dependency updates | Automated dependency management |
Technical Architecture
Bitrise built their own evaluation framework in Go to benchmark agents before building their own.
Eval Framework Flow
Agents (declarative list)
↓
Test Cases (declarative definition)
↓
Docker Containers (parallel execution)
├── Install AI coding agent
├── Clone source Git repository
├── Apply patches, install dependencies
└── Run agent (~10 min)
↓
Verification
├── Programmatic checks (go test ./...)
└── LLM Judges (subjective evaluation)
↓
SQL Database → Metabase Dashboard
Agent Architecture
Multiple Sub-Agents
├── Dynamically constructed in Go
├── Injected tools/dependencies
└── Custom system prompts per use case
↓
Coordinated Flow
├── Programmatic result collection
└── Provider-agnostic message storage
↓
Output (PR comments, fixes, etc.)
Key Technical Details
| Aspect | Detail |
|---|---|
| Language | Go (matching Bitrise core stack) |
| LLM Provider | Anthropic APIs (but provider-agnostic design) |
| Eval Runtime | ~10 minutes per agent in parallel Docker containers |
| Verification | Programmatic checks + LLM Judges |
| Storage | Provider-agnostic format for model switching |
Benchmarking Findings
Bitrise's evaluation revealed significant differences across agents:
| Agent | Performance | Issues |
|---|---|---|
| Claude Code | Best overall | Closed-source, Anthropic-only API |
| Codex | Fast responses | Lost chain-of-thought, mid-transition issues |
| Gemini | Variable | 10-minute response times without reserved resources |
| OpenCode | Open-source | 2x slower than Claude Code, TUI-coupled |
Post-Benchmark Updates
Since initial benchmarking:
- Sonnet 4.5 — Better context-handling, but limited performance gains
- Haiku 4.5 — Comparable to Sonnet at lower cost
- GPT-5-Codex — Promising but couldn't outperform Anthropic models
- OpenCode — Archived; successor "Crush" in development by Charm team
Strengths
- No vendor lock-in — Provider-agnostic architecture allows model switching mid-conversation
- Programmatic checkpoints — Verification embedded in workflow, not bolted on (essential for production-grade AI features)
- Custom eval framework — Go-based benchmark system enables systematic comparison
- Multi-agent coordination — Sub-agents dynamically constructed with injected tools/dependencies
- Central logging — LLM messages stored in format that survives provider changes
Cautions
- Maintenance overhead — Custom agent requires ongoing development investment
- Anthropic-dependent — Still uses Anthropic APIs despite architectural flexibility
- Scale undisclosed — No public metrics on throughput or adoption percentage
- Mobile CI focus — Agent optimized for specific domain (build failures, PR reviews); may not generalize
- Not for sale — Internal tooling, not a product (Bitrise AI features are separate)
Competitive Positioning
vs. Other In-House Agents
| System | Differentiation |
|---|---|
| Stripe Minions | Stripe uses Goose fork; Bitrise built from scratch |
| Ramp Inspect | Ramp uses Modal; Bitrise uses Docker containers |
| Commercial agents | Bitrise accepts maintenance cost for flexibility |
Build vs. Buy Decision
Bitrise explicitly chose build over buy because:
- Closed-source risk (Claude Code)
- API lock-in risk (Anthropic-only)
- Need for programmatic checkpoints
- Domain-specific customization (mobile CI)
Ideal Customer Profile
This is internal tooling, not a product for sale. However, the approach is worth studying if:
Good fit for similar build:
- Specific domain with unique workflows (mobile, CI/CD, etc.)
- Strong Go engineering team
- Long-term concerns about vendor lock-in
- Need for programmatic checkpoints in agent workflows
Poor fit:
- General-purpose coding needs
- Limited engineering resources for maintenance
- Short-term project with disposable code
- No specific vendor lock-in concerns
Viability Assessment
| Factor | Assessment |
|---|---|
| Documentation Quality | Good (4-part blog series) |
| Replicability | Medium (requires Go expertise) |
| Benchmarking Rigor | High (systematic comparison) |
| Architecture Maturity | Medium (newer than Stripe/Ramp) |
| Domain Specificity | High (mobile CI/CD focused) |
Bitrise's detailed blog series provides valuable insights for teams evaluating build vs. buy decisions, particularly around vendor lock-in concerns.
Bottom Line
Bitrise demonstrates that even mid-size companies can build custom coding agents when domain fit and vendor lock-in concerns justify the investment. The key insight: programmatic checkpoints embedded in agent workflows beat bolted-on validation.
Key decision factors: Closed-source risk, API lock-in, need for deep integration with proprietary workflows.
Recommended study for: Teams evaluating build vs. buy, organizations with specific domain workflows, engineers concerned about vendor lock-in.
Not recommended for: General-purpose coding needs, teams without Go expertise, organizations comfortable with Anthropic lock-in.
Outlook: Bitrise's approach validates that the "build" option remains viable for companies with specific domain needs and technical capability, even as commercial agents improve.
Research by Ry Walker Research • methodology
Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.