← Back to research
·6 min read·company

Bitrise AI Coding Agent

Bitrise built a custom Go-based coding agent after benchmarking Claude Code, Codex, and Gemini — vendor lock-in concerns drove the decision.

Key takeaways

  • Built custom Go agent to avoid vendor lock-in despite Claude Code performing best
  • Programmatic checkpoints embed verification directly in agent workflow
  • Provider-agnostic architecture allows model switching mid-conversation

FAQ

Why did Bitrise build their own coding agent?

Despite Claude Code performing best in benchmarks, its closed-source nature and Anthropic-only API posed unacceptable long-term vendor lock-in risks for Bitrise.

What makes Bitrise's agent different?

Programmatic checkpoints embed verification directly in agent workflows, and the provider-agnostic architecture allows switching LLM providers mid-conversation.

What agents did Bitrise evaluate before building their own?

Claude Code (best performance but closed-source), Codex (fast but inconsistent), Gemini (slow response times), and OpenCode (open-source but 2x slower).

Executive Summary

Bitrise, the mobile CI/CD platform, built a custom Go-based AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. Despite Claude Code performing best, its closed-source nature and Anthropic-only API posed unacceptable vendor lock-in risks. The result: a provider-agnostic agent with programmatic checkpoints that embed verification directly into workflows.

AttributeValue
CompanyBitrise
LanguageGo
FoundationCustom (uses Anthropic APIs)
Public DocumentationNovember 2025
HeadquartersBudapest, Hungary

Product Overview

Bitrise's AI coding agent is designed for their specific domain: mobile CI/CD. The agent handles build failures, PR reviews, and automated fixes. Key innovation is the programmatic checkpoint system — verification and validation embedded directly in the agent workflow, not bolted on after the fact.

Key Capabilities

CapabilityDescription
Provider-agnosticSwitch LLM providers mid-conversation if needed
Programmatic checkpointsVerification embedded in workflow, not separate
Central loggingLLM messages stored in provider-agnostic format
Multi-agent coordinationSub-agents dynamically constructed and orchestrated
Custom eval frameworkGo-based benchmarking runs tests in parallel

Use Cases

Use CaseDescription
PR reviewAI-generated review comments on pull requests
Build fixerAutomatic resolution of failing CI builds
Log analysisSummarize failed build logs and suggest fixes
Dependency updatesAutomated dependency management

Technical Architecture

Bitrise built their own evaluation framework in Go to benchmark agents before building their own.

Eval Framework Flow

Agents (declarative list)
    ↓
Test Cases (declarative definition)
    ↓
Docker Containers (parallel execution)
├── Install AI coding agent
├── Clone source Git repository
├── Apply patches, install dependencies
└── Run agent (~10 min)
    ↓
Verification
├── Programmatic checks (go test ./...)
└── LLM Judges (subjective evaluation)
    ↓
SQL Database → Metabase Dashboard

Agent Architecture

Multiple Sub-Agents
├── Dynamically constructed in Go
├── Injected tools/dependencies
└── Custom system prompts per use case
    ↓
Coordinated Flow
├── Programmatic result collection
└── Provider-agnostic message storage
    ↓
Output (PR comments, fixes, etc.)

Key Technical Details

AspectDetail
LanguageGo (matching Bitrise core stack)
LLM ProviderAnthropic APIs (but provider-agnostic design)
Eval Runtime~10 minutes per agent in parallel Docker containers
VerificationProgrammatic checks + LLM Judges
StorageProvider-agnostic format for model switching

Benchmarking Findings

Bitrise's evaluation revealed significant differences across agents:

AgentPerformanceIssues
Claude CodeBest overallClosed-source, Anthropic-only API
CodexFast responsesLost chain-of-thought, mid-transition issues
GeminiVariable10-minute response times without reserved resources
OpenCodeOpen-source2x slower than Claude Code, TUI-coupled

Post-Benchmark Updates

Since initial benchmarking:

  • Sonnet 4.5 — Better context-handling, but limited performance gains
  • Haiku 4.5 — Comparable to Sonnet at lower cost
  • GPT-5-Codex — Promising but couldn't outperform Anthropic models
  • OpenCode — Archived; successor "Crush" in development by Charm team

Strengths

  • No vendor lock-in — Provider-agnostic architecture allows model switching mid-conversation
  • Programmatic checkpoints — Verification embedded in workflow, not bolted on (essential for production-grade AI features)
  • Custom eval framework — Go-based benchmark system enables systematic comparison
  • Multi-agent coordination — Sub-agents dynamically constructed with injected tools/dependencies
  • Central logging — LLM messages stored in format that survives provider changes

Cautions

  • Maintenance overhead — Custom agent requires ongoing development investment
  • Anthropic-dependent — Still uses Anthropic APIs despite architectural flexibility
  • Scale undisclosed — No public metrics on throughput or adoption percentage
  • Mobile CI focus — Agent optimized for specific domain (build failures, PR reviews); may not generalize
  • Not for sale — Internal tooling, not a product (Bitrise AI features are separate)

Competitive Positioning

vs. Other In-House Agents

SystemDifferentiation
Stripe MinionsStripe uses Goose fork; Bitrise built from scratch
Ramp InspectRamp uses Modal; Bitrise uses Docker containers
Commercial agentsBitrise accepts maintenance cost for flexibility

Build vs. Buy Decision

Bitrise explicitly chose build over buy because:

  1. Closed-source risk (Claude Code)
  2. API lock-in risk (Anthropic-only)
  3. Need for programmatic checkpoints
  4. Domain-specific customization (mobile CI)

Ideal Customer Profile

This is internal tooling, not a product for sale. However, the approach is worth studying if:

Good fit for similar build:

  • Specific domain with unique workflows (mobile, CI/CD, etc.)
  • Strong Go engineering team
  • Long-term concerns about vendor lock-in
  • Need for programmatic checkpoints in agent workflows

Poor fit:

  • General-purpose coding needs
  • Limited engineering resources for maintenance
  • Short-term project with disposable code
  • No specific vendor lock-in concerns

Viability Assessment

FactorAssessment
Documentation QualityGood (4-part blog series)
ReplicabilityMedium (requires Go expertise)
Benchmarking RigorHigh (systematic comparison)
Architecture MaturityMedium (newer than Stripe/Ramp)
Domain SpecificityHigh (mobile CI/CD focused)

Bitrise's detailed blog series provides valuable insights for teams evaluating build vs. buy decisions, particularly around vendor lock-in concerns.


Bottom Line

Bitrise demonstrates that even mid-size companies can build custom coding agents when domain fit and vendor lock-in concerns justify the investment. The key insight: programmatic checkpoints embedded in agent workflows beat bolted-on validation.

Key decision factors: Closed-source risk, API lock-in, need for deep integration with proprietary workflows.

Recommended study for: Teams evaluating build vs. buy, organizations with specific domain workflows, engineers concerned about vendor lock-in.

Not recommended for: General-purpose coding needs, teams without Go expertise, organizations comfortable with Anthropic lock-in.

Outlook: Bitrise's approach validates that the "build" option remains viable for companies with specific domain needs and technical capability, even as commercial agents improve.


Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.