Cloudflare Internal Agents | Ry Walker Research

Key takeaways

3,683 active internal users of AI coding tools — 93% of R&D and 60% of the whole company — generating 47.95M AI requests and 241B tokens through AI Gateway in a 30-day window, eleven months after the rollout began
The in-house flagship is the AI Code Reviewer: up to 7 specialized reviewer agents (security, performance, code quality, docs, release, Codex compliance, AGENTS.md validation) coordinated by a supervisor agent — 131,246 review runs on 48,095 GitLab MRs in 30 days, median 3m39s and $1.19 average per review, with only 0.6% "break glass" manual overrides
Honest framing: the engineer-facing assistant layer is third-party (Windsurf and the open-source OpenCode). Cloudflare's strategy is "build the harness, not the agent" — the proprietary work is the review orchestration, MCP Server Portal (13 servers, 182+ tools), auto-generated AGENTS.md across ~3,900 repos, and AI Gateway routing
Cost discipline at scale: 85.7% prompt cache hit rate across ~120B review tokens, risk-tiered reviews (trivial MRs get 2 agents, full reviews get all 7+), and dynamic model failback chains controlled by a Workers KV config plane

FAQ

What are Cloudflare's internal AI agents?

An internal AI engineering stack built on Cloudflare's own shipping products (AI Gateway, Workers AI, Agents SDK, Sandbox SDK), plus a multi-agent AI Code Reviewer that runs on every merge request in Cloudflare's internal GitLab. The assistant layer engineers type into is third-party (Windsurf, OpenCode); the review system and surrounding harness are built in-house.

How does Cloudflare's AI Code Reviewer work?

Each merge request is risk-tiered (trivial, lite, full) and reviewed by up to 7 specialized agents — security, performance, code quality, documentation, release management, Engineering Codex compliance, and AGENTS.md validation. A supervisor coordinator agent deduplicates findings, filters false positives, and posts one structured review. It ships as a one-line GitLab CI component include.

What does Cloudflare's AI code review cost?

Median $0.98 and average $1.19 per review (P99 $4.45) over 131,246 runs in a 30-day window, kept low by an 85.7% prompt cache hit rate and routing lightweight review tasks to cheaper models like Kimi K2.5 on Workers AI.

What models does Cloudflare use internally?

Multi-model by design: Claude Opus 4.7 and GPT-5.4 for the review coordinator, Claude Sonnet 4.6 and GPT-5.3 Codex for security, quality, and performance reviewers, and Kimi K2.5 on Workers AI for documentation and other lightweight passes. Frontier labs handle 91% of AI Gateway requests; models are hot-swappable via a Workers KV control plane.

How is Cloudflare's approach different from Coinbase Forge or Ramp Inspect?

Coinbase, Ramp, and Stripe built in-house agents that write code; Cloudflare deliberately bought the code-writing layer (Windsurf, OpenCode) and built the enforcement and context layer instead — review orchestration, MCP portal, AGENTS.md generation, and gateway routing. It is the clearest public example of the "build the harness, not the agent" strategy.

Executive Summary

In April 2026, Cloudflare published two companion engineering posts detailing how it runs AI-assisted development for its entire R&D organization: an internal AI engineering stack built on its own shipping products, and a multi-agent AI Code Reviewer that gates every merge request in its internal GitLab.^[1]^[2] In a 30-day window, 3,683 internal users (93% of R&D, 60% of the whole company) generated 47.95 million AI requests and 241.37 billion tokens through AI Gateway — eleven months after a tiger team called iMARS started the rollout.^[1]

The honest framing matters: Cloudflare did not build its own coding agent. Engineers write code with third-party tools — Windsurf (434.9K messages/month) and the open-source OpenCode (27.08M messages/month).^[1] What Cloudflare built in-house is everything around the agent: the AI Code Reviewer (up to 7 specialized reviewer agents plus a supervisor, 131,246 review runs across 48,095 MRs in 30 days at $1.19 average per review), an MCP Server Portal exposing 13 servers and 182+ tools, auto-generated AGENTS.md files across ~3,900 repositories, and gateway-level model routing.^[2]^[1] It is the clearest public case study of the "build the harness, not the agent" strategy.

Attribute	Value
Company	Cloudflare
Type	Internal stack + multi-agent code reviewer
Disclosed	April 20, 2026 (two engineering blog posts)
Adoption	3,683 users; 93% of R&D, 60% company-wide
Assistant layer	Third-party: Windsurf, OpenCode
In-house layer	AI Code Reviewer, MCP Portal, AGENTS.md generator, AI Gateway routing
VCS	Internal GitLab (CI component)
Scale	47.95M AI requests / 241B tokens per 30 days
Review cost	$0.98 median, $1.19 average per review

The Stack: What's Bought vs. What's Built

Cloudflare dogfoods its own platform — AI Gateway, Workers AI, Agents SDK, Durable Objects, Sandbox SDK, and Workflows all sit underneath the internal tooling — but the layer engineers actually type into is third-party.^[1]

Layer	Component	Build or Buy
Assistant	Windsurf, OpenCode	Buy / adopt OSS
Routing	AI Gateway via a single proxy Worker (per-user attribution, model catalog, permissions)	Build
Context	MCP Server Portal: 13 servers, 182+ tools (Backstage, GitLab, Jira, Sentry, Elasticsearch, Prometheus, Google Workspace)	Build
Context	AGENTS.md auto-generated across ~3,900 repos from the Backstage catalog (16K+ entities)	Build
Enforcement	AI Code Reviewer on 100% of standard CI pipelines	Build
Enforcement	Engineering Codex standards exposed as agent skills	Build

Two architecture decisions the post calls out: routing everything through a single proxy Worker from day one ("One thing we got right early"), and collapsing 34+ MCP tool schemas into two portal-level Code Mode tools, cutting ~15,000 tokens of context overhead per session.^[1]

The velocity claim: the 4-week rolling average of merge requests climbed from ~5,600/week to over 8,700, with a peak week of 10,952 — nearly double the Q4 baseline.^[1] Third-party coverage framed the disclosure as roughly 3,700 engineers running on Cloudflare's own stack.^[3]

The AI Code Reviewer

The reviewer is the flagship in-house build, written on top of OpenCode and shipped as a GitLab CI component teams enable with a one-line include.^[2] Cloudflare evaluated commercial review tools first; the recurring theme was that "they just didn't offer enough flexibility and customisation for an organisation the size of Cloudflare."^[2]

Multi-Agent Architecture

Up to 7 specialized reviewers run per merge request, supervised by a coordinator:^[2]

Agent	Role	Findings (30 days)
Code Quality	Logic errors, general issues (most prolific)	74,898
Documentation	Completeness and clarity	26,432
Performance	Regressions, optimization opportunities	14,615
Security	Injection, auth bypass, secrets, crypto	11,985
Compliance (Codex)	Internal Engineering RFC adherence	9,654
AGENTS.md Validator	Keeps repo AI instructions current	6,878
Release Management	Release-related changes	745
Review Coordinator	Deduplicates, filters false positives, decides approval	—

Merge requests are risk-tiered: trivial (≤10 lines) gets 2 agents and a downgraded coordinator, lite (≤100 lines) gets 4, and full reviews launch all 7+.^[2] Model selection is tiered the same way — Claude Opus 4.7 / GPT-5.4 for the coordinator, Claude Sonnet 4.6 / GPT-5.3 Codex for the heavy reviewers, Kimi K2.5 on Workers AI for documentation and release passes — with circuit-breaker failback chains and a Workers KV control plane for instant model swaps without deploys.^[2]

Production Metrics (March 10 – April 9, 2026)

Metric	Value
Review runs	131,246
Merge requests reviewed	48,095 (across 5,169 repos)
CI coverage	100% of repos on standard pipeline
Median review duration	3m 39s (P95: 7m 29s)
Cost per review	$0.98 median / $1.19 average / $4.45 P99
Total findings	159,103 (1.2 avg per review)
Tokens processed	~120B
Prompt cache hit rate	85.7%
"Break glass" manual overrides	288 (0.6% of MRs)

All figures from the Cloudflare engineering post.^[2]

Lessons Cloudflare Reports

Negative prompting is the work — "telling an LLM what not to do is where the actual prompt engineering value resides." The security reviewer's prompt explicitly lists non-flags: defense-in-depth suggestions when primary defenses are adequate, issues in unchanged code, "consider using library X" advice.^[2]
Re-reviews are stateful — new commits trigger incremental reviews that auto-resolve fixed findings, re-emit unfixed ones, and respect developer "won't fix" replies.^[2]
Prompt injection is treated as a real threat — boundary tags are stripped from user-controlled content and MR descriptions are sanitized before reaching the agents.^[2]
Acknowledged limits — the system struggles with architectural intent, cross-system impact, and timing-dependent concurrency bugs. Cloudflare's own words: "This isn't a replacement for human code review, at least not yet with today's models."^[2]

Strengths

Scale with receipts — 131,246 review runs, 48,095 MRs, 100% CI coverage, and per-review cost published to the cent. Few companies disclose internal agent economics this precisely.^[2]
Honest buy-vs-build split — adopting Windsurf/OpenCode for the assistant layer while building the review and context harness avoids competing with fast-moving agent vendors on their own turf.^[1]
Cost engineering — risk tiers, 85.7% cache hit rate, shared context files, and cheap-model routing keep average review cost at $1.19 despite up to 7 frontier-model agents per MR.^[2]
Trust calibration — the 0.6% break-glass override rate is a rare quantified signal that engineers mostly accept the AI gate; overrides are tracked in telemetry rather than forbidden.^[2]
Context as infrastructure — Backstage's 16K-entity graph, auto-generated AGENTS.md in ~3,900 repos, and the MCP portal mean agents start with organizational context instead of cold prompts.^[1]

Cautions

Not a product — none of this is purchasable as a unit; it is a case study, though most building blocks (AI Gateway, Workers AI, Agents SDK, OpenCode) are publicly available.^[1]
Self-promotional substrate — the stack post doubles as marketing for Cloudflare's developer platform; metrics are self-reported and unaudited.^[1]
Cost scales with diff size — Cloudflare concedes 500-file refactors through seven frontier models are expensive, and HN commenters questioned spending ~20% of review runs on sub-10-line MRs.^[2]^[4]
No code-writing agent disclosed — unlike Coinbase Forge or Stripe Minions, there is no in-house background agent producing PRs; velocity gains route through third-party assistants.^[1]
GitLab-first — the reviewer ships as a GitLab CI component; the plugin system claims VCS portability but only GitLab is in production.^[2]

What Developers Say

The code review post drew a 145-point, 56-comment Hacker News thread, with debate centered on cost-effectiveness and signal-to-noise:^[4]

"Trivial reviews cost 20 cents on average... the labour cost of having an intern spend ~30-60s is likely close to $0.20" — OtherShrezzing, HN^[4]

"An intern would take 2-3 minutes per PR. At ~$100K salary, that's $1.6 per review, about 10X your estimate" — alain94040, HN, in reply^[4]

"Built a similar system with Copilot and GitHub Actions. Team loves it. ROI is so high, just use strongest models available" — plmpsu, HN^[4]

"Most people experienced poor signal-to-noise ratios, making AI review a burden rather than help" — afro88, HN^[4]

Trade coverage was more uniformly positive, framing the system as evidence that disciplined orchestration — not bigger single models — is what makes AI review work at enterprise scale.^[5]

Competitive Positioning

Direct Comparisons (In-House Coding Agents)

Company	System	What's in-house	Key metric
Cloudflare	AI Code Reviewer + stack	Review orchestration, MCP portal, AGENTS.md; assistant layer bought	131K reviews/30d, $1.19/review, 0.6% override
Coinbase	Forge + Mux	Full coding agent + orchestration	5% of merged PRs
Ramp	Inspect	Full coding agent	30% of PRs (disclosed)
Stripe	Minions	Full coding agent	1,000+ PRs/week
Abnormal AI	Internal agents	Full coding agent	13% of PRs
Browserbase	bb	Generalized Slack agent	100% feature-request coverage

When the Cloudflare Pattern Fits

You want AI leverage on every change without betting on one assistant vendor — enforce at CI, stay agnostic at the editor
Your org is large enough that review consistency and standards compliance are the bottleneck, not raw code generation
You already maintain a service catalog (Backstage or similar) that agents can mine for context

Viability Assessment

Dimension	Assessment
Financial Health	Backed by Cloudflare (NYSE: NET); internal tooling, not a revenue line
Market Position	Most-documented public example of the review-side in-house pattern
Innovation Pace	11 months from kickoff to 93% R&D adoption; next phase is background agents on Durable Objects + Sandbox containers^[1]
Community/Ecosystem	Built on open-source OpenCode; same review agents runnable locally via a `/fullreview` TUI command^[2]
Long-term Outlook	Durable — doubles as dogfooding for products Cloudflare sells, so investment is self-reinforcing

Bottom Line

Cloudflare's disclosure is the strongest public evidence yet for the "build the harness, not the agent" thesis: buy or adopt the code-writing layer, and concentrate in-house engineering on review orchestration, context infrastructure, and model routing — the parts that encode your standards and your leverage. The published economics ($1.19 average per review, 85.7% cache hit rate, 0.6% override rate at 100% CI coverage) give other enterprises a concrete cost-and-trust baseline that vendor marketing never has.^[2]

Recommended for: platform and DevEx teams designing CI-native AI review, or anyone deciding the buy-vs-build boundary for internal agents.

Not recommended for: teams looking for a purchasable product — this is a pattern to study, assembled from Cloudflare's platform plus OpenCode.

Outlook: Cloudflare says background coding agents (Durable Objects + Sandbox SDK) are the next phase — if shipped, the in-house surface expands from reviewing code to writing it.^[1]

Research by Ry Walker Research • Back to In-House Coding Agents comparison • methodology

Sources