Key takeaways
- Built custom Go agent to avoid vendor lock-in despite Claude Code performing best
- Programmatic checkpoints embed verification directly in agent workflow
- Provider-agnostic architecture allows model switching mid-conversation
FAQ
Why did Bitrise build their own coding agent?
Despite Claude Code performing best in benchmarks, its closed-source nature and Anthropic-only API posed unacceptable long-term vendor lock-in risks for Bitrise.
What makes Bitrise's agent different?
Programmatic checkpoints embed verification directly in agent workflows, and the provider-agnostic architecture allows switching LLM providers mid-conversation.
What agents did Bitrise evaluate before building their own?
Claude Code (best performance but closed-source), Codex (fast but inconsistent), Gemini (slow response times), and OpenCode (open-source but 2x slower).
Executive Summary
Bitrise, the mobile CI/CD platform, built a custom Go-based AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. Despite Claude Code performing best, its closed-source nature and Anthropic-only API posed unacceptable vendor lock-in risks. The result: a provider-agnostic agent with programmatic checkpoints that embed verification directly into workflows. [1] The four-part engineering series concluded in February 2026 with the agent absorbed into a unified internal AI Platform, and a stripped-down version of the agent framework was open-sourced as bitrise-ai-core (MIT, "released for educational purposes"). [2] [3]
| Attribute | Value |
|---|---|
| Company | Bitrise |
| Language | Go |
| Foundation | Custom (uses Anthropic APIs primarily; OpenAI and Google also supported) |
| Public Documentation | November 2025 – February 2026 (four-part series) |
| Open Source | Partial — bitrise-ai-core, stripped-down educational release (MIT) |
| Headquarters | Budapest, Hungary |
Product Overview
Bitrise's AI coding agent is designed for their specific domain: mobile CI/CD. The agent handles build failures, PR reviews, and automated fixes. Key innovation is the programmatic checkpoint system — verification and validation embedded directly in the agent workflow, not bolted on after the fact. [1]
In production, the agent runs sandboxed on ephemeral virtual machines in the Bitrise Build Hub, pre-installed with developer tools, where it can access customer source code and run commands like npm test or xcodebuild test. Instead of an interactive approval flow, it operates autonomously against a pre-defined allowlist of tools. [4]
Key Capabilities
| Capability | Description |
|---|---|
| Provider-agnostic | Switch LLM providers mid-conversation if needed |
| Programmatic checkpoints | Verification embedded in workflow, not separate |
| Central logging | LLM messages stored in provider-agnostic format |
| Multi-agent coordination | Sub-agents dynamically constructed and orchestrated |
| Custom eval framework | Go-based benchmarking runs tests in parallel |
| Sandboxed execution | Ephemeral VMs with tool allowlists, no interactive approval |
| Prompt caching | Implemented across Anthropic, OpenAI, and Gemini providers |
Use Cases
| Use Case | Description |
|---|---|
| PR review | AI-generated review comments on pull requests |
| Build fixer | Automatic resolution of failing CI builds |
| Log analysis | Summarize failed build logs and suggest fixes |
| Dependency updates | Automated dependency management |
Technical Architecture
Bitrise built their own evaluation framework in Go to benchmark agents before building their own.
Eval Framework Flow
Agents (declarative list)
↓
Test Cases (declarative definition)
↓
Docker Containers (parallel execution)
├── Install AI coding agent
├── Clone source [Git](/research/t/git) repository
├── Apply patches, install dependencies
└── Run agent (~10 min)
↓
Verification
├── Programmatic checks (go test ./...)
└── LLM Judges (subjective evaluation)
↓
SQL Database → Metabase Dashboard
Agent Architecture
Multiple Sub-Agents
├── Dynamically constructed in Go
├── Injected tools/dependencies
└── Custom system prompts per use case
↓
Coordinated Flow
├── Programmatic result collection
└── Provider-agnostic message storage
↓
Output (PR comments, fixes, etc.)
Key Technical Details
| Aspect | Detail |
|---|---|
| Language | Go (matching Bitrise core stack) |
| LLM Provider | Anthropic APIs (but provider-agnostic design) |
| Eval Runtime | ~10 minutes per agent in parallel Docker containers |
| Verification | Programmatic checks + LLM Judges |
| Storage | Provider-agnostic format for model switching |
AI Platform Consolidation (February 2026)
The final post in the series (published February 2, 2026) describes how the coding agent evolved from a standalone feature into a foundational component of a unified internal AI Platform: [2]
| Component | Detail |
|---|---|
| Custom LLM proxy | Routes traffic between customer virtual keys and provider APIs; tracks token usage, enforces budgets |
| Two agent types | Sandboxed agents in VMs/containers with code access; central agents on Kubernetes for instant responses |
| Observability layer | Metrics on requests, token usage, costs, error rates |
| Testing framework | E2E statistical testing with baseline tracking and regression detection |
| Pricing model | Average feature cost calculated internally; no variable token billing exposed to customers |
A stripped-down version of the internal agent framework is published on GitHub as bitrise-ai-core (Go, MIT license), explicitly "released for educational purposes." [3]
Benchmarking Findings
Bitrise's evaluation revealed significant differences across agents: [1]
| Agent | Performance | Issues |
|---|---|---|
| Claude Code | Best overall | Closed-source, Anthropic-only API |
| Codex | Fast responses | Lost chain-of-thought, mid-transition issues |
| Gemini | Variable | 10-minute response times without reserved resources |
| OpenCode | Open-source | 2x slower than Claude Code, TUI-coupled |
Post-Benchmark Updates
Since initial benchmarking (per the November 2025 post): [1]
- Sonnet 4.5 — Better context-handling, but limited performance gains
- Haiku 4.5 — Comparable to Sonnet at lower cost
- GPT-5-Codex — Promising but couldn't outperform Anthropic models
- OpenCode — Archived; successor "Crush" in development by Charm team
One benchmarking gotcha worth noting: prompt caching introduced unexpected determinism into eval runs, requiring a --cache-bust flag for accurate performance measurement. [4]
Strengths
- No vendor lock-in — Provider-agnostic architecture allows model switching mid-conversation
- Programmatic checkpoints — Verification embedded in workflow, not bolted on (essential for production-grade AI features)
- Custom eval framework — Go-based benchmark system enables systematic comparison
- Multi-agent coordination — Sub-agents dynamically constructed with injected tools/dependencies
- Central logging — LLM messages stored in format that survives provider changes
Cautions
- Maintenance overhead — Custom agent requires ongoing development investment
- Anthropic-dependent — Anthropic models remain primary despite architectural flexibility (OpenAI and Google managed APIs also supported) [2]
- Scale undisclosed — No public metrics on throughput or adoption percentage as of June 2026; the February 2026 platform post disclosed none
- Mobile CI focus — Agent optimized for specific domain (build failures, PR reviews); may not generalize
- Not for sale — Internal tooling, not a product; the open-sourced bitrise-ai-core is a stripped-down educational release, not the production agent [3] (customer-facing Bitrise AI features are separate [5] [6])
What Developers Say
No substantive third-party practitioner commentary on Bitrise's agent was found as of June 11, 2026. The Hacker News submission of the announcement post drew minimal engagement (2 points, no comments), and we found no notable practitioner threads on X discussing hands-on experience. [7] The public record remains essentially first-party: Bitrise's own four-part engineering series. This section will be updated if practitioner discussion emerges.
Competitive Positioning
vs. Other In-House Agents
| System | Differentiation |
|---|---|
| Stripe Minions | Stripe uses Goose fork; Bitrise built from scratch |
| Ramp Inspect | Ramp uses Modal; Bitrise uses Docker containers |
| Commercial agents | Bitrise accepts maintenance cost for flexibility |
Build vs. Buy Decision
Bitrise explicitly chose build over buy because:
- Closed-source risk (Claude Code)
- API lock-in risk (Anthropic-only)
- Need for programmatic checkpoints
- Domain-specific customization (mobile CI)
Ideal Customer Profile
This is internal tooling, not a product for sale. However, the approach is worth studying if:
Good fit for similar build:
- Specific domain with unique workflows (mobile, CI/CD, etc.)
- Strong Go engineering team
- Long-term concerns about vendor lock-in
- Need for programmatic checkpoints in agent workflows
Poor fit:
- General-purpose coding needs
- Limited engineering resources for maintenance
- Short-term project with disposable code
- No specific vendor lock-in concerns
Viability Assessment
| Factor | Assessment |
|---|---|
| Documentation Quality | Good (4-part blog series, completed February 2026) |
| Replicability | Medium-High (requires Go expertise; bitrise-ai-core provides an educational starting point) |
| Benchmarking Rigor | High (systematic comparison, statistical e2e testing with regression detection) |
| Architecture Maturity | Medium-High (consolidated into a unified internal AI Platform in early 2026) |
| Domain Specificity | High (mobile CI/CD focused) |
Bitrise's detailed blog series — covering the coding agent, a browser-integrated AI copilot, the Build Hub sandboxed agent, and the unifying AI Platform — provides valuable insights for teams evaluating build vs. buy decisions, particularly around vendor lock-in concerns. [8] [2]
Bottom Line
Bitrise demonstrates that even mid-size companies can build custom coding agents when domain fit and vendor lock-in concerns justify the investment. The key insight: programmatic checkpoints embedded in agent workflows beat bolted-on validation.
Key decision factors: Closed-source risk, API lock-in, need for deep integration with proprietary workflows.
Recommended study for: Teams evaluating build vs. buy, organizations with specific domain workflows, engineers concerned about vendor lock-in.
Not recommended for: General-purpose coding needs, teams without Go expertise, organizations comfortable with Anthropic lock-in.
Outlook: Bitrise's approach validates that the "build" option remains viable for companies with specific domain needs and technical capability, even as commercial agents improve. The early-2026 consolidation into a shared internal AI Platform — LLM proxy, budget enforcement, two agent runtimes, statistical regression testing — shows the in-house bet maturing from a single agent into infrastructure, and the educational bitrise-ai-core release lowers the bar for teams who want to study the pattern. [2] [3]
Research by Ry Walker Research • methodology
Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.
Sources
- [1] Why we ditched frontier AI agents and built our own (Bitrise Blog, Nov 2025)
- [2] Building Bitrise's AI platform: Scaling AI features across teams (Bitrise Blog, Feb 2026)
- [3] bitrise-ai-core (GitHub)
- [4] How we brought AI to the Bitrise Build Hub (Bitrise Blog, Nov 2025)
- [5] Bitrise AI Platform
- [6] Bitrise Pricing (AI Features)
- [7] Choosing the best AI coding agent for Bitrise (Hacker News)
- [8] Building Bitrise's context-aware, browser-integrated AI copilot (Bitrise Blog)
- [9] Bitrise - Google Cloud AI Agent Finder