Key takeaways
- 21,000 developer hours saved with Validator and Autocover agents
- 10% increase in test coverage from Autocover's generative test authoring
- Hybrid architecture: LLM for complex issues, deterministic tools for common patterns
FAQ
How much time did Uber save with AI coding agents?
Uber saved an estimated 21,000 developer hours using their Validator and Autocover AI agents, primarily through automated test generation and code validation.
What AI framework does Uber use for coding agents?
Uber uses LangGraph (from LangChain) to orchestrate reusable, domain-specific agents for testing, validation, and workflow assistance.
What is Uber Autocover?
A generative test-authoring tool that scaffolds, generates, executes, and mutates test cases — running up to 100 tests concurrently and increasing coverage by 10%.
Executive Summary
Uber's Developer Platform Team presented at LangChain's Interrupt conference detailing how they've deployed agentic tools across 5,000 developers and a codebase with hundreds of millions of lines. Using LangGraph for orchestration, they built Validator (IDE-embedded code review) and Autocover (generative test authoring), delivering 21,000 developer hours saved and 10% test coverage increase.
| Attribute | Value |
|---|---|
| Company | Uber |
| Scale | 5,000 developers |
| Codebase | Hundreds of millions of LOC |
| Framework | LangGraph (LangChain) |
| Key Metric | 21,000 hours saved |
Product Overview
Uber's approach differs from background PR agents: they built domain-specific agents for testing and validation embedded in developer workflows. The architecture is hybrid — LLM for complex issues, deterministic tools (static linters) for common patterns.
Key Tools
| Tool | Description | Impact |
|---|---|---|
| Validator | IDE-embedded security/best-practice agent | Real-time vulnerability detection |
| Autocover | Generative test-authoring tool | 10% coverage increase, 21,000 hours saved |
| Picasso | Workflow platform with conversational AI | Organizational knowledge access |
Validator Details
IDE-embedded agent that:
- Flags security vulnerabilities in real-time
- Detects best-practice violations
- Proposes fixes (one-click acceptance or agent-routed resolution)
- Uses hybrid architecture: LLM for complex issues, deterministic linters for common patterns
Autocover Details
Generative test-authoring that:
- Scaffolds, generates, executes, and mutates test cases
- Runs up to 100 tests concurrently
- 2-3x faster than other AI coding tools
- Increased test coverage by 10%
- Saved an estimated 21,000 developer hours
Technical Architecture
Uber uses LangGraph to orchestrate reusable, domain-specific agents with clear encapsulation.
Architecture Layers
Picasso (Workflow Platform)
├── Conversational AI agents
└── Organizational knowledge integration
↓
Domain-Specific Agents
├── Validator (IDE-embedded)
├── Autocover (test generation)
└── Custom agents (team-specific)
↓
Reusable Primitives
├── Build system agent (cross-product)
├── Security rules (team-contributed)
└── LangGraph orchestration
↓
Hybrid Execution
├── LLM (complex issues)
└── Deterministic tools (common patterns)
Key Technical Details
| Aspect | Detail |
|---|---|
| Framework | LangGraph (LangChain ecosystem) |
| Surfaces | IDE (Validator), Workflow (Autocover, Picasso) |
| Execution | Hybrid LLM + deterministic |
| Concurrency | Up to 100 tests simultaneously |
| Throughput | 2-3x faster than alternatives |
Key Learnings from Uber
Uber shared organizational lessons from deploying agents at scale:
1. Encapsulation Enables Reuse
Clear interfaces let teams extend without central coordination. The security team can contribute rules without deep LangGraph knowledge.
2. Domain Expert Agents Outperform Generic Tools
Specialized context beats general-purpose AI. A test-generation agent with Uber-specific knowledge outperforms generic coding assistants.
3. Determinism Still Matters
Linters and build tools work better deterministically, orchestrated by agents. Not everything should be LLM-driven.
4. Solve Narrow Problems First
Tightly scoped solutions get reused in broader workflows. Start specific, then generalize.
Strengths
- Massive scale validation — 5,000 developers, hundreds of millions of LOC proves the approach works
- Proven ROI — 21,000 developer hours saved is concrete, measurable impact
- Hybrid architecture — LLM + deterministic tools captures best of both worlds
- Reusable primitives — Security team can contribute rules without framework expertise
- Domain expertise encoded — Specialized agents outperform generic AI coding tools
Cautions
- Infrastructure investment — Requires dedicated platform team to maintain LangGraph infrastructure
- LangGraph dependency — Tightly coupled to LangChain ecosystem; migration would be significant
- Enterprise context — Patterns optimized for 5,000+ developer organizations may not transfer to smaller teams
- Not PR-focused — Validator and Autocover are developer tools, not background PR agents
- Not for sale — Internal tooling only
Competitive Positioning
vs. Other In-House Agents
| System | Differentiation |
|---|---|
| Stripe Minions | Stripe focuses on PR generation; Uber on testing/validation |
| Ramp Inspect | Ramp is background PR agent; Uber is IDE + workflow embedded |
| StrongDM Factory | StrongDM eliminates review; Uber enhances review workflow |
Approach Comparison
| Approach | Uber | Stripe/Ramp |
|---|---|---|
| Primary goal | Testing/validation | PR generation |
| Interface | IDE + workflow | Slack/CLI |
| Output | Fixes + tests | Pull requests |
| Human role | Accepts suggestions | Reviews PRs |
Ideal Customer Profile
This is internal tooling, not a product for sale. The approach is worth studying if:
Good fit for similar approach:
- Large engineering organization (1,000+ developers)
- Existing LangChain/LangGraph investment or interest
- Test coverage is a key metric
- IDE-embedded tools preferred over background agents
- Security and best-practices enforcement priority
Poor fit:
- Small team (ROI threshold not met)
- Prefer background PR generation over IDE tools
- No LangGraph expertise available
- Simple CI/CD without sophisticated testing needs
Viability Assessment
| Factor | Assessment |
|---|---|
| Public Documentation | Good (conference talk, multiple articles) |
| Adoption Metrics | Strong (21,000 hours, 10% coverage) |
| Architecture Detail | Good (LangGraph patterns documented) |
| Scale Validation | Excellent (5,000 developers) |
| External Validation | Strong (LangChain conference, ZenML coverage) |
Uber's presentation at LangChain Interrupt provides valuable reference architecture for enterprise LangGraph adoption.
Bottom Line
Uber's AI coding agents represent a different approach than Stripe/Ramp: IDE-embedded and workflow-integrated tools rather than background PR generators. The hybrid architecture — LLM for complex reasoning, deterministic tools for common patterns — reflects mature thinking about where AI adds value.
Key metrics: 21,000 hours saved, 10% test coverage increase, 100 concurrent test execution.
Key insight: Domain-expert agents outperform generic tools. Determinism still matters.
Recommended study for: Large engineering organizations, teams building testing infrastructure, LangGraph adopters.
Not recommended for: Small teams, organizations wanting background PR agents, teams without LangGraph expertise.
Outlook: Uber's approach suggests AI coding agents will specialize by use case (testing vs. PR generation vs. security) rather than converging on a single pattern.
Research by Ry Walker Research • methodology
Disclosure: Author is CEO of Tembo, which offers agent orchestration as an alternative to building in-house.