← Back to research
·8 min read·company

Bitrise AI Coding Agent

Bitrise built a custom Go-based coding agent after benchmarking Claude Code, Codex, and Gemini — vendor lock-in concerns drove the decision.

Key takeaways

  • Built custom Go agent to avoid vendor lock-in despite Claude Code performing best
  • Programmatic checkpoints embed verification directly in agent workflow
  • Provider-agnostic architecture allows model switching mid-conversation

FAQ

Why did Bitrise build their own coding agent?

Despite Claude Code performing best in benchmarks, its closed-source nature and Anthropic-only API posed unacceptable long-term vendor lock-in risks for Bitrise.

What makes Bitrise's agent different?

Programmatic checkpoints embed verification directly in agent workflows, and the provider-agnostic architecture allows switching LLM providers mid-conversation.

What agents did Bitrise evaluate before building their own?

Claude Code (best performance but closed-source), Codex (fast but inconsistent), Gemini (slow response times), and OpenCode (open-source but 2x slower).

Executive Summary

Bitrise, the mobile CI/CD platform, built a custom Go-based AI coding agent after extensively benchmarking Claude Code, Codex, Gemini, and OpenCode. Despite Claude Code performing best, its closed-source nature and Anthropic-only API posed unacceptable vendor lock-in risks. The result: a provider-agnostic agent with programmatic checkpoints that embed verification directly into workflows. [1] The four-part engineering series concluded in February 2026 with the agent absorbed into a unified internal AI Platform, and a stripped-down version of the agent framework was open-sourced as bitrise-ai-core (MIT, "released for educational purposes"). [2] [3]

AttributeValue
CompanyBitrise
LanguageGo
FoundationCustom (uses Anthropic APIs primarily; OpenAI and Google also supported)
Public DocumentationNovember 2025 – February 2026 (four-part series)
Open SourcePartial — bitrise-ai-core, stripped-down educational release (MIT)
HeadquartersBudapest, Hungary

Product Overview

Bitrise's AI coding agent is designed for their specific domain: mobile CI/CD. The agent handles build failures, PR reviews, and automated fixes. Key innovation is the programmatic checkpoint system — verification and validation embedded directly in the agent workflow, not bolted on after the fact. [1]

In production, the agent runs sandboxed on ephemeral virtual machines in the Bitrise Build Hub, pre-installed with developer tools, where it can access customer source code and run commands like npm test or xcodebuild test. Instead of an interactive approval flow, it operates autonomously against a pre-defined allowlist of tools. [4]

Key Capabilities

CapabilityDescription
Provider-agnosticSwitch LLM providers mid-conversation if needed
Programmatic checkpointsVerification embedded in workflow, not separate
Central loggingLLM messages stored in provider-agnostic format
Multi-agent coordinationSub-agents dynamically constructed and orchestrated
Custom eval frameworkGo-based benchmarking runs tests in parallel
Sandboxed executionEphemeral VMs with tool allowlists, no interactive approval
Prompt cachingImplemented across Anthropic, OpenAI, and Gemini providers

Use Cases

Use CaseDescription
PR reviewAI-generated review comments on pull requests
Build fixerAutomatic resolution of failing CI builds
Log analysisSummarize failed build logs and suggest fixes
Dependency updatesAutomated dependency management

Technical Architecture

Bitrise built their own evaluation framework in Go to benchmark agents before building their own.

Eval Framework Flow

Agents (declarative list)
    ↓
Test Cases (declarative definition)
    ↓
Docker Containers (parallel execution)
├── Install AI coding agent
├── Clone source [Git](/research/t/git) repository
├── Apply patches, install dependencies
└── Run agent (~10 min)
    ↓
Verification
├── Programmatic checks (go test ./...)
└── LLM Judges (subjective evaluation)
    ↓
SQL Database → Metabase Dashboard

Agent Architecture

Multiple Sub-Agents
├── Dynamically constructed in Go
├── Injected tools/dependencies
└── Custom system prompts per use case
    ↓
Coordinated Flow
├── Programmatic result collection
└── Provider-agnostic message storage
    ↓
Output (PR comments, fixes, etc.)

Key Technical Details

AspectDetail
LanguageGo (matching Bitrise core stack)
LLM ProviderAnthropic APIs (but provider-agnostic design)
Eval Runtime~10 minutes per agent in parallel Docker containers
VerificationProgrammatic checks + LLM Judges
StorageProvider-agnostic format for model switching

AI Platform Consolidation (February 2026)

The final post in the series (published February 2, 2026) describes how the coding agent evolved from a standalone feature into a foundational component of a unified internal AI Platform: [2]

ComponentDetail
Custom LLM proxyRoutes traffic between customer virtual keys and provider APIs; tracks token usage, enforces budgets
Two agent typesSandboxed agents in VMs/containers with code access; central agents on Kubernetes for instant responses
Observability layerMetrics on requests, token usage, costs, error rates
Testing frameworkE2E statistical testing with baseline tracking and regression detection
Pricing modelAverage feature cost calculated internally; no variable token billing exposed to customers

A stripped-down version of the internal agent framework is published on GitHub as bitrise-ai-core (Go, MIT license), explicitly "released for educational purposes." [3]


Benchmarking Findings

Bitrise's evaluation revealed significant differences across agents: [1]

AgentPerformanceIssues
Claude CodeBest overallClosed-source, Anthropic-only API
CodexFast responsesLost chain-of-thought, mid-transition issues
GeminiVariable10-minute response times without reserved resources
OpenCodeOpen-source2x slower than Claude Code, TUI-coupled

Post-Benchmark Updates

Since initial benchmarking (per the November 2025 post): [1]

  • Sonnet 4.5 — Better context-handling, but limited performance gains
  • Haiku 4.5 — Comparable to Sonnet at lower cost
  • GPT-5-Codex — Promising but couldn't outperform Anthropic models
  • OpenCode — Archived; successor "Crush" in development by Charm team

One benchmarking gotcha worth noting: prompt caching introduced unexpected determinism into eval runs, requiring a --cache-bust flag for accurate performance measurement. [4]


Strengths

  • No vendor lock-in — Provider-agnostic architecture allows model switching mid-conversation
  • Programmatic checkpoints — Verification embedded in workflow, not bolted on (essential for production-grade AI features)
  • Custom eval framework — Go-based benchmark system enables systematic comparison
  • Multi-agent coordination — Sub-agents dynamically constructed with injected tools/dependencies
  • Central logging — LLM messages stored in format that survives provider changes

Cautions

  • Maintenance overhead — Custom agent requires ongoing development investment
  • Anthropic-dependent — Anthropic models remain primary despite architectural flexibility (OpenAI and Google managed APIs also supported) [2]
  • Scale undisclosed — No public metrics on throughput or adoption percentage as of June 2026; the February 2026 platform post disclosed none
  • Mobile CI focus — Agent optimized for specific domain (build failures, PR reviews); may not generalize
  • Not for sale — Internal tooling, not a product; the open-sourced bitrise-ai-core is a stripped-down educational release, not the production agent [3] (customer-facing Bitrise AI features are separate [5] [6])

What Developers Say

No substantive third-party practitioner commentary on Bitrise's agent was found as of June 11, 2026. The Hacker News submission of the announcement post drew minimal engagement (2 points, no comments), and we found no notable practitioner threads on X discussing hands-on experience. [7] The public record remains essentially first-party: Bitrise's own four-part engineering series. This section will be updated if practitioner discussion emerges.


Competitive Positioning

vs. Other In-House Agents

SystemDifferentiation
Stripe MinionsStripe uses Goose fork; Bitrise built from scratch
Ramp InspectRamp uses Modal; Bitrise uses Docker containers
Commercial agentsBitrise accepts maintenance cost for flexibility

Build vs. Buy Decision

Bitrise explicitly chose build over buy because:

  1. Closed-source risk (Claude Code)
  2. API lock-in risk (Anthropic-only)
  3. Need for programmatic checkpoints
  4. Domain-specific customization (mobile CI)

Ideal Customer Profile

This is internal tooling, not a product for sale. However, the approach is worth studying if:

Good fit for similar build:

  • Specific domain with unique workflows (mobile, CI/CD, etc.)
  • Strong Go engineering team
  • Long-term concerns about vendor lock-in
  • Need for programmatic checkpoints in agent workflows

Poor fit:

  • General-purpose coding needs
  • Limited engineering resources for maintenance
  • Short-term project with disposable code
  • No specific vendor lock-in concerns

Viability Assessment

FactorAssessment
Documentation QualityGood (4-part blog series, completed February 2026)
ReplicabilityMedium-High (requires Go expertise; bitrise-ai-core provides an educational starting point)
Benchmarking RigorHigh (systematic comparison, statistical e2e testing with regression detection)
Architecture MaturityMedium-High (consolidated into a unified internal AI Platform in early 2026)
Domain SpecificityHigh (mobile CI/CD focused)

Bitrise's detailed blog series — covering the coding agent, a browser-integrated AI copilot, the Build Hub sandboxed agent, and the unifying AI Platform — provides valuable insights for teams evaluating build vs. buy decisions, particularly around vendor lock-in concerns. [8] [2]


Bottom Line

Bitrise demonstrates that even mid-size companies can build custom coding agents when domain fit and vendor lock-in concerns justify the investment. The key insight: programmatic checkpoints embedded in agent workflows beat bolted-on validation.

Key decision factors: Closed-source risk, API lock-in, need for deep integration with proprietary workflows.

Recommended study for: Teams evaluating build vs. buy, organizations with specific domain workflows, engineers concerned about vendor lock-in.

Not recommended for: General-purpose coding needs, teams without Go expertise, organizations comfortable with Anthropic lock-in.

Outlook: Bitrise's approach validates that the "build" option remains viable for companies with specific domain needs and technical capability, even as commercial agents improve. The early-2026 consolidation into a shared internal AI Platform — LLM proxy, budget enforcement, two agent runtimes, statistical regression testing — shows the in-house bet maturing from a single agent into infrastructure, and the educational bitrise-ai-core release lowers the bar for teams who want to study the pattern. [2] [3]


Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.