← Back to research
·6 min read·company

Uber AI Coding Agents

Uber's LangGraph-powered Validator and Autocover agents saved 21,000 developer hours and increased test coverage 10%.

Key takeaways

  • 21,000 developer hours saved with Validator and Autocover agents
  • 10% increase in test coverage from Autocover's generative test authoring
  • Hybrid architecture: LLM for complex issues, deterministic tools for common patterns

FAQ

How much time did Uber save with AI coding agents?

Uber saved an estimated 21,000 developer hours using their Validator and Autocover AI agents, primarily through automated test generation and code validation.

What AI framework does Uber use for coding agents?

Uber uses LangGraph (from LangChain) to orchestrate reusable, domain-specific agents for testing, validation, and workflow assistance.

What is Uber Autocover?

A generative test-authoring tool that scaffolds, generates, executes, and mutates test cases — running up to 100 tests concurrently and increasing coverage by 10%.

Executive Summary

Uber's Developer Platform Team presented at LangChain's Interrupt conference detailing how they've deployed agentic tools across 5,000 developers and a codebase with hundreds of millions of lines. Using LangGraph for orchestration, they built Validator (IDE-embedded code review) and Autocover (generative test authoring), delivering 21,000 developer hours saved and 10% test coverage increase.

AttributeValue
CompanyUber
Scale5,000 developers
CodebaseHundreds of millions of LOC
FrameworkLangGraph (LangChain)
Key Metric21,000 hours saved

Product Overview

Uber's approach differs from background PR agents: they built domain-specific agents for testing and validation embedded in developer workflows. The architecture is hybrid — LLM for complex issues, deterministic tools (static linters) for common patterns.

Key Tools

ToolDescriptionImpact
ValidatorIDE-embedded security/best-practice agentReal-time vulnerability detection
AutocoverGenerative test-authoring tool10% coverage increase, 21,000 hours saved
PicassoWorkflow platform with conversational AIOrganizational knowledge access

Validator Details

IDE-embedded agent that:

  • Flags security vulnerabilities in real-time
  • Detects best-practice violations
  • Proposes fixes (one-click acceptance or agent-routed resolution)
  • Uses hybrid architecture: LLM for complex issues, deterministic linters for common patterns

Autocover Details

Generative test-authoring that:

  • Scaffolds, generates, executes, and mutates test cases
  • Runs up to 100 tests concurrently
  • 2-3x faster than other AI coding tools
  • Increased test coverage by 10%
  • Saved an estimated 21,000 developer hours

Technical Architecture

Uber uses LangGraph to orchestrate reusable, domain-specific agents with clear encapsulation.

Architecture Layers

Picasso (Workflow Platform)
├── Conversational AI agents
└── Organizational knowledge integration
    ↓
Domain-Specific Agents
├── Validator (IDE-embedded)
├── Autocover (test generation)
└── Custom agents (team-specific)
    ↓
Reusable Primitives
├── Build system agent (cross-product)
├── Security rules (team-contributed)
└── LangGraph orchestration
    ↓
Hybrid Execution
├── LLM (complex issues)
└── Deterministic tools (common patterns)

Key Technical Details

AspectDetail
FrameworkLangGraph (LangChain ecosystem)
SurfacesIDE (Validator), Workflow (Autocover, Picasso)
ExecutionHybrid LLM + deterministic
ConcurrencyUp to 100 tests simultaneously
Throughput2-3x faster than alternatives

Key Learnings from Uber

Uber shared organizational lessons from deploying agents at scale:

1. Encapsulation Enables Reuse

Clear interfaces let teams extend without central coordination. The security team can contribute rules without deep LangGraph knowledge.

2. Domain Expert Agents Outperform Generic Tools

Specialized context beats general-purpose AI. A test-generation agent with Uber-specific knowledge outperforms generic coding assistants.

3. Determinism Still Matters

Linters and build tools work better deterministically, orchestrated by agents. Not everything should be LLM-driven.

4. Solve Narrow Problems First

Tightly scoped solutions get reused in broader workflows. Start specific, then generalize.


Strengths

  • Massive scale validation — 5,000 developers, hundreds of millions of LOC proves the approach works
  • Proven ROI — 21,000 developer hours saved is concrete, measurable impact
  • Hybrid architecture — LLM + deterministic tools captures best of both worlds
  • Reusable primitives — Security team can contribute rules without framework expertise
  • Domain expertise encoded — Specialized agents outperform generic AI coding tools

Cautions

  • Infrastructure investment — Requires dedicated platform team to maintain LangGraph infrastructure
  • LangGraph dependency — Tightly coupled to LangChain ecosystem; migration would be significant
  • Enterprise context — Patterns optimized for 5,000+ developer organizations may not transfer to smaller teams
  • Not PR-focused — Validator and Autocover are developer tools, not background PR agents
  • Not for sale — Internal tooling only

Competitive Positioning

vs. Other In-House Agents

SystemDifferentiation
Stripe MinionsStripe focuses on PR generation; Uber on testing/validation
Ramp InspectRamp is background PR agent; Uber is IDE + workflow embedded
StrongDM FactoryStrongDM eliminates review; Uber enhances review workflow

Approach Comparison

ApproachUberStripe/Ramp
Primary goalTesting/validationPR generation
InterfaceIDE + workflowSlack/CLI
OutputFixes + testsPull requests
Human roleAccepts suggestionsReviews PRs

Ideal Customer Profile

This is internal tooling, not a product for sale. The approach is worth studying if:

Good fit for similar approach:

  • Large engineering organization (1,000+ developers)
  • Existing LangChain/LangGraph investment or interest
  • Test coverage is a key metric
  • IDE-embedded tools preferred over background agents
  • Security and best-practices enforcement priority

Poor fit:

  • Small team (ROI threshold not met)
  • Prefer background PR generation over IDE tools
  • No LangGraph expertise available
  • Simple CI/CD without sophisticated testing needs

Viability Assessment

FactorAssessment
Public DocumentationGood (conference talk, multiple articles)
Adoption MetricsStrong (21,000 hours, 10% coverage)
Architecture DetailGood (LangGraph patterns documented)
Scale ValidationExcellent (5,000 developers)
External ValidationStrong (LangChain conference, ZenML coverage)

Uber's presentation at LangChain Interrupt provides valuable reference architecture for enterprise LangGraph adoption.


Bottom Line

Uber's AI coding agents represent a different approach than Stripe/Ramp: IDE-embedded and workflow-integrated tools rather than background PR generators. The hybrid architecture — LLM for complex reasoning, deterministic tools for common patterns — reflects mature thinking about where AI adds value.

Key metrics: 21,000 hours saved, 10% test coverage increase, 100 concurrent test execution.

Key insight: Domain-expert agents outperform generic tools. Determinism still matters.

Recommended study for: Large engineering organizations, teams building testing infrastructure, LangGraph adopters.

Not recommended for: Small teams, organizations wanting background PR agents, teams without LangGraph expertise.

Outlook: Uber's approach suggests AI coding agents will specialize by use case (testing vs. PR generation vs. security) rather than converging on a single pattern.


Research by Ry Walker Research • methodology

Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.