← Back to research
·10 min read·company

Cloudflare Internal Agents

Cloudflare's internal AI engineering stack serves 3,683 engineers (93% of R&D) with 47.95M AI requests in 30 days. The flagship in-house build is a multi-agent AI Code Reviewer — up to 7 specialized reviewers plus a supervisor — running 131,246 reviews across 48,095 merge requests at a $1.19 average cost per review, with 100% CI coverage on internal GitLab.

Key takeaways

  • 3,683 active internal users of AI coding tools — 93% of R&D and 60% of the whole company — generating 47.95M AI requests and 241B tokens through AI Gateway in a 30-day window, eleven months after the rollout began
  • The in-house flagship is the AI Code Reviewer: up to 7 specialized reviewer agents (security, performance, code quality, docs, release, Codex compliance, AGENTS.md validation) coordinated by a supervisor agent — 131,246 review runs on 48,095 GitLab MRs in 30 days, median 3m39s and $1.19 average per review, with only 0.6% "break glass" manual overrides
  • Honest framing: the engineer-facing assistant layer is third-party (Windsurf and the open-source OpenCode). Cloudflare's strategy is "build the harness, not the agent" — the proprietary work is the review orchestration, MCP Server Portal (13 servers, 182+ tools), auto-generated AGENTS.md across ~3,900 repos, and AI Gateway routing
  • Cost discipline at scale: 85.7% prompt cache hit rate across ~120B review tokens, risk-tiered reviews (trivial MRs get 2 agents, full reviews get all 7+), and dynamic model failback chains controlled by a Workers KV config plane

FAQ

What are Cloudflare's internal AI agents?

An internal AI engineering stack built on Cloudflare's own shipping products (AI Gateway, Workers AI, Agents SDK, Sandbox SDK), plus a multi-agent AI Code Reviewer that runs on every merge request in Cloudflare's internal GitLab. The assistant layer engineers type into is third-party (Windsurf, OpenCode); the review system and surrounding harness are built in-house.

How does Cloudflare's AI Code Reviewer work?

Each merge request is risk-tiered (trivial, lite, full) and reviewed by up to 7 specialized agents — security, performance, code quality, documentation, release management, Engineering Codex compliance, and AGENTS.md validation. A supervisor coordinator agent deduplicates findings, filters false positives, and posts one structured review. It ships as a one-line GitLab CI component include.

What does Cloudflare's AI code review cost?

Median $0.98 and average $1.19 per review (P99 $4.45) over 131,246 runs in a 30-day window, kept low by an 85.7% prompt cache hit rate and routing lightweight review tasks to cheaper models like Kimi K2.5 on Workers AI.

What models does Cloudflare use internally?

Multi-model by design: Claude Opus 4.7 and GPT-5.4 for the review coordinator, Claude Sonnet 4.6 and GPT-5.3 Codex for security, quality, and performance reviewers, and Kimi K2.5 on Workers AI for documentation and other lightweight passes. Frontier labs handle 91% of AI Gateway requests; models are hot-swappable via a Workers KV control plane.

How is Cloudflare's approach different from Coinbase Forge or Ramp Inspect?

Coinbase, Ramp, and Stripe built in-house agents that write code; Cloudflare deliberately bought the code-writing layer (Windsurf, OpenCode) and built the enforcement and context layer instead — review orchestration, MCP portal, AGENTS.md generation, and gateway routing. It is the clearest public example of the "build the harness, not the agent" strategy.

Executive Summary

In April 2026, Cloudflare published two companion engineering posts detailing how it runs AI-assisted development for its entire R&D organization: an internal AI engineering stack built on its own shipping products, and a multi-agent AI Code Reviewer that gates every merge request in its internal GitLab.[1][2] In a 30-day window, 3,683 internal users (93% of R&D, 60% of the whole company) generated 47.95 million AI requests and 241.37 billion tokens through AI Gateway — eleven months after a tiger team called iMARS started the rollout.[1]

The honest framing matters: Cloudflare did not build its own coding agent. Engineers write code with third-party tools — Windsurf (434.9K messages/month) and the open-source OpenCode (27.08M messages/month).[1] What Cloudflare built in-house is everything around the agent: the AI Code Reviewer (up to 7 specialized reviewer agents plus a supervisor, 131,246 review runs across 48,095 MRs in 30 days at $1.19 average per review), an MCP Server Portal exposing 13 servers and 182+ tools, auto-generated AGENTS.md files across ~3,900 repositories, and gateway-level model routing.[2][1] It is the clearest public case study of the "build the harness, not the agent" strategy.

AttributeValue
CompanyCloudflare
TypeInternal stack + multi-agent code reviewer
DisclosedApril 20, 2026 (two engineering blog posts)
Adoption3,683 users; 93% of R&D, 60% company-wide
Assistant layerThird-party: Windsurf, OpenCode
In-house layerAI Code Reviewer, MCP Portal, AGENTS.md generator, AI Gateway routing
VCSInternal GitLab (CI component)
Scale47.95M AI requests / 241B tokens per 30 days
Review cost$0.98 median, $1.19 average per review

The Stack: What's Bought vs. What's Built

Cloudflare dogfoods its own platform — AI Gateway, Workers AI, Agents SDK, Durable Objects, Sandbox SDK, and Workflows all sit underneath the internal tooling — but the layer engineers actually type into is third-party.[1]

LayerComponentBuild or Buy
AssistantWindsurf, OpenCodeBuy / adopt OSS
RoutingAI Gateway via a single proxy Worker (per-user attribution, model catalog, permissions)Build
ContextMCP Server Portal: 13 servers, 182+ tools (Backstage, GitLab, Jira, Sentry, Elasticsearch, Prometheus, Google Workspace)Build
ContextAGENTS.md auto-generated across ~3,900 repos from the Backstage catalog (16K+ entities)Build
EnforcementAI Code Reviewer on 100% of standard CI pipelinesBuild
EnforcementEngineering Codex standards exposed as agent skillsBuild

Two architecture decisions the post calls out: routing everything through a single proxy Worker from day one ("One thing we got right early"), and collapsing 34+ MCP tool schemas into two portal-level Code Mode tools, cutting ~15,000 tokens of context overhead per session.[1]

The velocity claim: the 4-week rolling average of merge requests climbed from ~5,600/week to over 8,700, with a peak week of 10,952 — nearly double the Q4 baseline.[1] Third-party coverage framed the disclosure as roughly 3,700 engineers running on Cloudflare's own stack.[3]


The AI Code Reviewer

The reviewer is the flagship in-house build, written on top of OpenCode and shipped as a GitLab CI component teams enable with a one-line include.[2] Cloudflare evaluated commercial review tools first; the recurring theme was that "they just didn't offer enough flexibility and customisation for an organisation the size of Cloudflare."[2]

Multi-Agent Architecture

Up to 7 specialized reviewers run per merge request, supervised by a coordinator:[2]

AgentRoleFindings (30 days)
Code QualityLogic errors, general issues (most prolific)74,898
DocumentationCompleteness and clarity26,432
PerformanceRegressions, optimization opportunities14,615
SecurityInjection, auth bypass, secrets, crypto11,985
Compliance (Codex)Internal Engineering RFC adherence9,654
AGENTS.md ValidatorKeeps repo AI instructions current6,878
Release ManagementRelease-related changes745
Review CoordinatorDeduplicates, filters false positives, decides approval

Merge requests are risk-tiered: trivial (≤10 lines) gets 2 agents and a downgraded coordinator, lite (≤100 lines) gets 4, and full reviews launch all 7+.[2] Model selection is tiered the same way — Claude Opus 4.7 / GPT-5.4 for the coordinator, Claude Sonnet 4.6 / GPT-5.3 Codex for the heavy reviewers, Kimi K2.5 on Workers AI for documentation and release passes — with circuit-breaker failback chains and a Workers KV control plane for instant model swaps without deploys.[2]

Production Metrics (March 10 – April 9, 2026)

MetricValue
Review runs131,246
Merge requests reviewed48,095 (across 5,169 repos)
CI coverage100% of repos on standard pipeline
Median review duration3m 39s (P95: 7m 29s)
Cost per review$0.98 median / $1.19 average / $4.45 P99
Total findings159,103 (1.2 avg per review)
Tokens processed~120B
Prompt cache hit rate85.7%
"Break glass" manual overrides288 (0.6% of MRs)

All figures from the Cloudflare engineering post.[2]

Lessons Cloudflare Reports

  • Negative prompting is the work — "telling an LLM what not to do is where the actual prompt engineering value resides." The security reviewer's prompt explicitly lists non-flags: defense-in-depth suggestions when primary defenses are adequate, issues in unchanged code, "consider using library X" advice.[2]
  • Re-reviews are stateful — new commits trigger incremental reviews that auto-resolve fixed findings, re-emit unfixed ones, and respect developer "won't fix" replies.[2]
  • Prompt injection is treated as a real threat — boundary tags are stripped from user-controlled content and MR descriptions are sanitized before reaching the agents.[2]
  • Acknowledged limits — the system struggles with architectural intent, cross-system impact, and timing-dependent concurrency bugs. Cloudflare's own words: "This isn't a replacement for human code review, at least not yet with today's models."[2]

Strengths

  • Scale with receipts — 131,246 review runs, 48,095 MRs, 100% CI coverage, and per-review cost published to the cent. Few companies disclose internal agent economics this precisely.[2]
  • Honest buy-vs-build split — adopting Windsurf/OpenCode for the assistant layer while building the review and context harness avoids competing with fast-moving agent vendors on their own turf.[1]
  • Cost engineering — risk tiers, 85.7% cache hit rate, shared context files, and cheap-model routing keep average review cost at $1.19 despite up to 7 frontier-model agents per MR.[2]
  • Trust calibration — the 0.6% break-glass override rate is a rare quantified signal that engineers mostly accept the AI gate; overrides are tracked in telemetry rather than forbidden.[2]
  • Context as infrastructure — Backstage's 16K-entity graph, auto-generated AGENTS.md in ~3,900 repos, and the MCP portal mean agents start with organizational context instead of cold prompts.[1]

Cautions

  • Not a product — none of this is purchasable as a unit; it is a case study, though most building blocks (AI Gateway, Workers AI, Agents SDK, OpenCode) are publicly available.[1]
  • Self-promotional substrate — the stack post doubles as marketing for Cloudflare's developer platform; metrics are self-reported and unaudited.[1]
  • Cost scales with diff size — Cloudflare concedes 500-file refactors through seven frontier models are expensive, and HN commenters questioned spending ~20% of review runs on sub-10-line MRs.[2][4]
  • No code-writing agent disclosed — unlike Coinbase Forge or Stripe Minions, there is no in-house background agent producing PRs; velocity gains route through third-party assistants.[1]
  • GitLab-first — the reviewer ships as a GitLab CI component; the plugin system claims VCS portability but only GitLab is in production.[2]

What Developers Say

The code review post drew a 145-point, 56-comment Hacker News thread, with debate centered on cost-effectiveness and signal-to-noise:[4]

"Trivial reviews cost 20 cents on average... the labour cost of having an intern spend ~30-60s is likely close to $0.20" — OtherShrezzing, HN[4]

"An intern would take 2-3 minutes per PR. At ~$100K salary, that's $1.6 per review, about 10X your estimate" — alain94040, HN, in reply[4]

"Built a similar system with Copilot and GitHub Actions. Team loves it. ROI is so high, just use strongest models available" — plmpsu, HN[4]

"Most people experienced poor signal-to-noise ratios, making AI review a burden rather than help" — afro88, HN[4]

Trade coverage was more uniformly positive, framing the system as evidence that disciplined orchestration — not bigger single models — is what makes AI review work at enterprise scale.[5]


Competitive Positioning

Direct Comparisons (In-House Coding Agents)

CompanySystemWhat's in-houseKey metric
CloudflareAI Code Reviewer + stackReview orchestration, MCP portal, AGENTS.md; assistant layer bought131K reviews/30d, $1.19/review, 0.6% override
CoinbaseForge + MuxFull coding agent + orchestration5% of merged PRs
RampInspectFull coding agent30% of PRs (disclosed)
StripeMinionsFull coding agent1,000+ PRs/week
Abnormal AIInternal agentsFull coding agent13% of PRs
BrowserbasebbGeneralized Slack agent100% feature-request coverage

When the Cloudflare Pattern Fits

  • You want AI leverage on every change without betting on one assistant vendor — enforce at CI, stay agnostic at the editor
  • Your org is large enough that review consistency and standards compliance are the bottleneck, not raw code generation
  • You already maintain a service catalog (Backstage or similar) that agents can mine for context

Viability Assessment

DimensionAssessment
Financial HealthBacked by Cloudflare (NYSE: NET); internal tooling, not a revenue line
Market PositionMost-documented public example of the review-side in-house pattern
Innovation Pace11 months from kickoff to 93% R&D adoption; next phase is background agents on Durable Objects + Sandbox containers[1]
Community/EcosystemBuilt on open-source OpenCode; same review agents runnable locally via a /fullreview TUI command[2]
Long-term OutlookDurable — doubles as dogfooding for products Cloudflare sells, so investment is self-reinforcing

Bottom Line

Cloudflare's disclosure is the strongest public evidence yet for the "build the harness, not the agent" thesis: buy or adopt the code-writing layer, and concentrate in-house engineering on review orchestration, context infrastructure, and model routing — the parts that encode your standards and your leverage. The published economics ($1.19 average per review, 85.7% cache hit rate, 0.6% override rate at 100% CI coverage) give other enterprises a concrete cost-and-trust baseline that vendor marketing never has.[2]

Recommended for: platform and DevEx teams designing CI-native AI review, or anyone deciding the buy-vs-build boundary for internal agents.

Not recommended for: teams looking for a purchasable product — this is a pattern to study, assembled from Cloudflare's platform plus OpenCode.

Outlook: Cloudflare says background coding agents (Durable Objects + Sandbox SDK) are the next phase — if shipped, the in-house surface expands from reviewing code to writing it.[1]


Research by Ry Walker Research • Back to In-House Coding Agents comparisonmethodology