Key takeaways
- 3,683 active internal users of AI coding tools — 93% of R&D and 60% of the whole company — generating 47.95M AI requests and 241B tokens through AI Gateway in a 30-day window, eleven months after the rollout began
- The in-house flagship is the AI Code Reviewer: up to 7 specialized reviewer agents (security, performance, code quality, docs, release, Codex compliance, AGENTS.md validation) coordinated by a supervisor agent — 131,246 review runs on 48,095 GitLab MRs in 30 days, median 3m39s and $1.19 average per review, with only 0.6% "break glass" manual overrides
- Honest framing: the engineer-facing assistant layer is third-party (Windsurf and the open-source OpenCode). Cloudflare's strategy is "build the harness, not the agent" — the proprietary work is the review orchestration, MCP Server Portal (13 servers, 182+ tools), auto-generated AGENTS.md across ~3,900 repos, and AI Gateway routing
- Cost discipline at scale: 85.7% prompt cache hit rate across ~120B review tokens, risk-tiered reviews (trivial MRs get 2 agents, full reviews get all 7+), and dynamic model failback chains controlled by a Workers KV config plane
FAQ
What are Cloudflare's internal AI agents?
An internal AI engineering stack built on Cloudflare's own shipping products (AI Gateway, Workers AI, Agents SDK, Sandbox SDK), plus a multi-agent AI Code Reviewer that runs on every merge request in Cloudflare's internal GitLab. The assistant layer engineers type into is third-party (Windsurf, OpenCode); the review system and surrounding harness are built in-house.
How does Cloudflare's AI Code Reviewer work?
Each merge request is risk-tiered (trivial, lite, full) and reviewed by up to 7 specialized agents — security, performance, code quality, documentation, release management, Engineering Codex compliance, and AGENTS.md validation. A supervisor coordinator agent deduplicates findings, filters false positives, and posts one structured review. It ships as a one-line GitLab CI component include.
What does Cloudflare's AI code review cost?
Median $0.98 and average $1.19 per review (P99 $4.45) over 131,246 runs in a 30-day window, kept low by an 85.7% prompt cache hit rate and routing lightweight review tasks to cheaper models like Kimi K2.5 on Workers AI.
What models does Cloudflare use internally?
Multi-model by design: Claude Opus 4.7 and GPT-5.4 for the review coordinator, Claude Sonnet 4.6 and GPT-5.3 Codex for security, quality, and performance reviewers, and Kimi K2.5 on Workers AI for documentation and other lightweight passes. Frontier labs handle 91% of AI Gateway requests; models are hot-swappable via a Workers KV control plane.
How is Cloudflare's approach different from Coinbase Forge or Ramp Inspect?
Coinbase, Ramp, and Stripe built in-house agents that write code; Cloudflare deliberately bought the code-writing layer (Windsurf, OpenCode) and built the enforcement and context layer instead — review orchestration, MCP portal, AGENTS.md generation, and gateway routing. It is the clearest public example of the "build the harness, not the agent" strategy.
Executive Summary
In April 2026, Cloudflare published two companion engineering posts detailing how it runs AI-assisted development for its entire R&D organization: an internal AI engineering stack built on its own shipping products, and a multi-agent AI Code Reviewer that gates every merge request in its internal GitLab.[1][2] In a 30-day window, 3,683 internal users (93% of R&D, 60% of the whole company) generated 47.95 million AI requests and 241.37 billion tokens through AI Gateway — eleven months after a tiger team called iMARS started the rollout.[1]
The honest framing matters: Cloudflare did not build its own coding agent. Engineers write code with third-party tools — Windsurf (434.9K messages/month) and the open-source OpenCode (27.08M messages/month).[1] What Cloudflare built in-house is everything around the agent: the AI Code Reviewer (up to 7 specialized reviewer agents plus a supervisor, 131,246 review runs across 48,095 MRs in 30 days at $1.19 average per review), an MCP Server Portal exposing 13 servers and 182+ tools, auto-generated AGENTS.md files across ~3,900 repositories, and gateway-level model routing.[2][1] It is the clearest public case study of the "build the harness, not the agent" strategy.
| Attribute | Value |
|---|---|
| Company | Cloudflare |
| Type | Internal stack + multi-agent code reviewer |
| Disclosed | April 20, 2026 (two engineering blog posts) |
| Adoption | 3,683 users; 93% of R&D, 60% company-wide |
| Assistant layer | Third-party: Windsurf, OpenCode |
| In-house layer | AI Code Reviewer, MCP Portal, AGENTS.md generator, AI Gateway routing |
| VCS | Internal GitLab (CI component) |
| Scale | 47.95M AI requests / 241B tokens per 30 days |
| Review cost | $0.98 median, $1.19 average per review |
The Stack: What's Bought vs. What's Built
Cloudflare dogfoods its own platform — AI Gateway, Workers AI, Agents SDK, Durable Objects, Sandbox SDK, and Workflows all sit underneath the internal tooling — but the layer engineers actually type into is third-party.[1]
| Layer | Component | Build or Buy |
|---|---|---|
| Assistant | Windsurf, OpenCode | Buy / adopt OSS |
| Routing | AI Gateway via a single proxy Worker (per-user attribution, model catalog, permissions) | Build |
| Context | MCP Server Portal: 13 servers, 182+ tools (Backstage, GitLab, Jira, Sentry, Elasticsearch, Prometheus, Google Workspace) | Build |
| Context | AGENTS.md auto-generated across ~3,900 repos from the Backstage catalog (16K+ entities) | Build |
| Enforcement | AI Code Reviewer on 100% of standard CI pipelines | Build |
| Enforcement | Engineering Codex standards exposed as agent skills | Build |
Two architecture decisions the post calls out: routing everything through a single proxy Worker from day one ("One thing we got right early"), and collapsing 34+ MCP tool schemas into two portal-level Code Mode tools, cutting ~15,000 tokens of context overhead per session.[1]
The velocity claim: the 4-week rolling average of merge requests climbed from ~5,600/week to over 8,700, with a peak week of 10,952 — nearly double the Q4 baseline.[1] Third-party coverage framed the disclosure as roughly 3,700 engineers running on Cloudflare's own stack.[3]
The AI Code Reviewer
The reviewer is the flagship in-house build, written on top of OpenCode and shipped as a GitLab CI component teams enable with a one-line include.[2] Cloudflare evaluated commercial review tools first; the recurring theme was that "they just didn't offer enough flexibility and customisation for an organisation the size of Cloudflare."[2]
Multi-Agent Architecture
Up to 7 specialized reviewers run per merge request, supervised by a coordinator:[2]
| Agent | Role | Findings (30 days) |
|---|---|---|
| Code Quality | Logic errors, general issues (most prolific) | 74,898 |
| Documentation | Completeness and clarity | 26,432 |
| Performance | Regressions, optimization opportunities | 14,615 |
| Security | Injection, auth bypass, secrets, crypto | 11,985 |
| Compliance (Codex) | Internal Engineering RFC adherence | 9,654 |
| AGENTS.md Validator | Keeps repo AI instructions current | 6,878 |
| Release Management | Release-related changes | 745 |
| Review Coordinator | Deduplicates, filters false positives, decides approval | — |
Merge requests are risk-tiered: trivial (≤10 lines) gets 2 agents and a downgraded coordinator, lite (≤100 lines) gets 4, and full reviews launch all 7+.[2] Model selection is tiered the same way — Claude Opus 4.7 / GPT-5.4 for the coordinator, Claude Sonnet 4.6 / GPT-5.3 Codex for the heavy reviewers, Kimi K2.5 on Workers AI for documentation and release passes — with circuit-breaker failback chains and a Workers KV control plane for instant model swaps without deploys.[2]
Production Metrics (March 10 – April 9, 2026)
| Metric | Value |
|---|---|
| Review runs | 131,246 |
| Merge requests reviewed | 48,095 (across 5,169 repos) |
| CI coverage | 100% of repos on standard pipeline |
| Median review duration | 3m 39s (P95: 7m 29s) |
| Cost per review | $0.98 median / $1.19 average / $4.45 P99 |
| Total findings | 159,103 (1.2 avg per review) |
| Tokens processed | ~120B |
| Prompt cache hit rate | 85.7% |
| "Break glass" manual overrides | 288 (0.6% of MRs) |
All figures from the Cloudflare engineering post.[2]
Lessons Cloudflare Reports
- Negative prompting is the work — "telling an LLM what not to do is where the actual prompt engineering value resides." The security reviewer's prompt explicitly lists non-flags: defense-in-depth suggestions when primary defenses are adequate, issues in unchanged code, "consider using library X" advice.[2]
- Re-reviews are stateful — new commits trigger incremental reviews that auto-resolve fixed findings, re-emit unfixed ones, and respect developer "won't fix" replies.[2]
- Prompt injection is treated as a real threat — boundary tags are stripped from user-controlled content and MR descriptions are sanitized before reaching the agents.[2]
- Acknowledged limits — the system struggles with architectural intent, cross-system impact, and timing-dependent concurrency bugs. Cloudflare's own words: "This isn't a replacement for human code review, at least not yet with today's models."[2]
Strengths
- Scale with receipts — 131,246 review runs, 48,095 MRs, 100% CI coverage, and per-review cost published to the cent. Few companies disclose internal agent economics this precisely.[2]
- Honest buy-vs-build split — adopting Windsurf/OpenCode for the assistant layer while building the review and context harness avoids competing with fast-moving agent vendors on their own turf.[1]
- Cost engineering — risk tiers, 85.7% cache hit rate, shared context files, and cheap-model routing keep average review cost at $1.19 despite up to 7 frontier-model agents per MR.[2]
- Trust calibration — the 0.6% break-glass override rate is a rare quantified signal that engineers mostly accept the AI gate; overrides are tracked in telemetry rather than forbidden.[2]
- Context as infrastructure — Backstage's 16K-entity graph, auto-generated AGENTS.md in ~3,900 repos, and the MCP portal mean agents start with organizational context instead of cold prompts.[1]
Cautions
- Not a product — none of this is purchasable as a unit; it is a case study, though most building blocks (AI Gateway, Workers AI, Agents SDK, OpenCode) are publicly available.[1]
- Self-promotional substrate — the stack post doubles as marketing for Cloudflare's developer platform; metrics are self-reported and unaudited.[1]
- Cost scales with diff size — Cloudflare concedes 500-file refactors through seven frontier models are expensive, and HN commenters questioned spending ~20% of review runs on sub-10-line MRs.[2][4]
- No code-writing agent disclosed — unlike Coinbase Forge or Stripe Minions, there is no in-house background agent producing PRs; velocity gains route through third-party assistants.[1]
- GitLab-first — the reviewer ships as a GitLab CI component; the plugin system claims VCS portability but only GitLab is in production.[2]
What Developers Say
The code review post drew a 145-point, 56-comment Hacker News thread, with debate centered on cost-effectiveness and signal-to-noise:[4]
"Trivial reviews cost 20 cents on average... the labour cost of having an intern spend ~30-60s is likely close to $0.20" — OtherShrezzing, HN[4]
"An intern would take 2-3 minutes per PR. At ~$100K salary, that's $1.6 per review, about 10X your estimate" — alain94040, HN, in reply[4]
"Built a similar system with Copilot and GitHub Actions. Team loves it. ROI is so high, just use strongest models available" — plmpsu, HN[4]
"Most people experienced poor signal-to-noise ratios, making AI review a burden rather than help" — afro88, HN[4]
Trade coverage was more uniformly positive, framing the system as evidence that disciplined orchestration — not bigger single models — is what makes AI review work at enterprise scale.[5]
Competitive Positioning
Direct Comparisons (In-House Coding Agents)
| Company | System | What's in-house | Key metric |
|---|---|---|---|
| Cloudflare | AI Code Reviewer + stack | Review orchestration, MCP portal, AGENTS.md; assistant layer bought | 131K reviews/30d, $1.19/review, 0.6% override |
| Coinbase | Forge + Mux | Full coding agent + orchestration | 5% of merged PRs |
| Ramp | Inspect | Full coding agent | 30% of PRs (disclosed) |
| Stripe | Minions | Full coding agent | 1,000+ PRs/week |
| Abnormal AI | Internal agents | Full coding agent | 13% of PRs |
| Browserbase | bb | Generalized Slack agent | 100% feature-request coverage |
When the Cloudflare Pattern Fits
- You want AI leverage on every change without betting on one assistant vendor — enforce at CI, stay agnostic at the editor
- Your org is large enough that review consistency and standards compliance are the bottleneck, not raw code generation
- You already maintain a service catalog (Backstage or similar) that agents can mine for context
Viability Assessment
| Dimension | Assessment |
|---|---|
| Financial Health | Backed by Cloudflare (NYSE: NET); internal tooling, not a revenue line |
| Market Position | Most-documented public example of the review-side in-house pattern |
| Innovation Pace | 11 months from kickoff to 93% R&D adoption; next phase is background agents on Durable Objects + Sandbox containers[1] |
| Community/Ecosystem | Built on open-source OpenCode; same review agents runnable locally via a /fullreview TUI command[2] |
| Long-term Outlook | Durable — doubles as dogfooding for products Cloudflare sells, so investment is self-reinforcing |
Bottom Line
Cloudflare's disclosure is the strongest public evidence yet for the "build the harness, not the agent" thesis: buy or adopt the code-writing layer, and concentrate in-house engineering on review orchestration, context infrastructure, and model routing — the parts that encode your standards and your leverage. The published economics ($1.19 average per review, 85.7% cache hit rate, 0.6% override rate at 100% CI coverage) give other enterprises a concrete cost-and-trust baseline that vendor marketing never has.[2]
Recommended for: platform and DevEx teams designing CI-native AI review, or anyone deciding the buy-vs-build boundary for internal agents.
Not recommended for: teams looking for a purchasable product — this is a pattern to study, assembled from Cloudflare's platform plus OpenCode.
Outlook: Cloudflare says background coding agents (Durable Objects + Sandbox SDK) are the next phase — if shipped, the in-house surface expands from reviewing code to writing it.[1]
Research by Ry Walker Research • Back to In-House Coding Agents comparison • methodology
Sources
- [1] The AI engineering stack we built internally — on the platform we ship (Cloudflare Blog, April 2026)
- [2] Orchestrating AI Code Review at scale (Cloudflare Blog, April 2026)
- [3] Cloudflare's 3,700 Engineers Now Run on Their Own AI Stack (Agent Wars)
- [4] Orchestrating AI code review at scale (Hacker News discussion)
- [5] Cloudflare's AI Code Review Overhaul (StartupHub.ai)