← Back to research
·5 min read·industry

AutoAgent

AutoAgent applies the autoresearch experiment loop pattern to agent harness engineering — give it a task, let it build and iterate on an agent harness autonomously overnight. From Third Layer, the team behind Dex.

Key takeaways

  • AutoAgent applies the autoresearch loop to agent engineering itself — the meta-agent modifies system prompts, tools, config, and orchestration, then benchmarks the result
  • Built on Harbor benchmark framework, making results comparable across agent implementations
  • From Third Layer (thirdlayer.inc), a funded startup building self-configuring agent infrastructure

FAQ

What is AutoAgent?

AutoAgent is an open-source tool that autonomously improves an agent harness. A meta-agent modifies the system prompt, tools, agent configuration, and orchestration, runs a benchmark, checks the score, keeps or discards the change, and repeats.

How is AutoAgent different from Karpathy's autoresearch?

Karpathy's autoresearch optimizes ML training code. AutoAgent optimizes agent harnesses — the system prompts, tool definitions, routing, and orchestration that define how an agent operates. Same loop pattern, different target.

What benchmark framework does AutoAgent use?

AutoAgent uses Harbor, a framework from the creators of Terminal-Bench for evaluating agents like Claude Code, OpenHands, and Codex CLI. Tasks run in Docker containers with deterministic or LLM-as-judge scoring.

Overview

AutoAgent takes the autoresearch experiment loop pattern and applies it to agent harness engineering itself. Instead of optimizing ML training code or GPU kernels, AutoAgent optimizes the system prompt, tool definitions, agent configuration, and orchestration that define how an agent operates.[1]

The pitch: give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. The meta-agent modifies agent.py, runs a benchmark via the Harbor framework, checks the score, keeps or discards the change, and repeats — hill-climbing toward better agent performance.

Built by Kevin Gu at Third Layer (thirdlayer.inc), a startup building self-configuring agent infrastructure backed by advisors from MIT Media Lab, Stanford SAIL, Berkeley BAIR, Sakana AI, DeepMind, and ElevenLabs.[2]

How It Works

The architecture follows the same core pattern as Karpathy's autoresearch, with a key twist — the optimization target is the agent itself:

Key Files

FilePurpose
program.mdInstructions for the meta-agent + directive (what kind of agent to build). Edited by the human.
agent.pyThe entire harness under test in a single file — config, tool definitions, agent registry, routing/orchestration, and Harbor adapter. The meta-agent edits everything except the fixed adapter section.
tasks/Evaluation tasks in Harbor format with Docker containers, instruction prompts, and test suites.
.agent/Optional workspace for reusable instructions, notes, prompts, or skills.

The Loop

  1. Meta-agent reads program.md for directive and constraints
  2. Inspects current agent.py harness
  3. Modifies the editable section (prompts, tools, routing, config)
  4. Runs benchmark via Harbor (harbor run -p tasks/)
  5. Score produced (0.0-1.0) by task test suites
  6. Keep if score improved, revert if not
  7. Repeat

Design Principles

  • Program the meta-agent, not the harness. The human steers through program.md, the meta-agent edits agent.py
  • Single-file, registry-driven harness. Everything in one file for simplicity, but with structured registration for clean evolution
  • Docker isolation. Agent runs in containers — can't damage the host
  • Score-driven. Numeric score from benchmarks. Keep if better, discard if not
  • Harbor-compatible tasks. Same format as Harbor benchmarks, making results portable across agent implementations[3]

What Makes It Different

AutoAgent is notable for being meta-autoresearch — applying the experiment loop to agent engineering rather than a downstream task:

ToolOptimizesTarget File
karpathy/autoresearchML trainingtrain.py
pi-autoresearchAny metricconfigurable
AutoKernelGPU kernelskernel code
AutoAgentAgent harnessesagent.py (prompts, tools, routing, orchestration)

This is the first tool that closes the loop on agent self-improvement through benchmark-driven iteration — the agent engineers better agents.

Harbor Integration

AutoAgent uses the Harbor framework (from the creators of Terminal-Bench) as its evaluation backbone. Harbor provides:[3]

  • Docker-isolated task environments
  • Standardized task format (instruction.md + test suites)
  • Support for evaluating Claude Code, OpenHands, Codex CLI, Aider, and custom agents
  • Deterministic and LLM-as-judge scoring
  • Registry of reusable benchmark tasks

This means AutoAgent results are comparable with any other Harbor-evaluated agent.

Third Layer (Company)

Third Layer (thirdlayer.inc) describes their mission as solving "one of the hardest problems in deploying agents: AI models are generic, but people's work and processes are specific."[2]

Their generally available product Dex is a browser extension that learns organizational workflows by observing real work. AutoAgent appears to be the open-source research component of their broader self-configuring agent platform.

The team includes advisors and collaborators from MIT Media Lab, Stanford SAIL, Berkeley BAIR, Sakana AI, DeepMind, and ElevenLabs. They're hiring engineers and have a product waitlist for a self-configuring agent product.

Strengths

  • Meta-level optimization — Optimizes the agent itself, not just downstream tasks
  • Harbor compatibility — Results are portable and comparable across agent implementations
  • Clean architecture — Single-file harness with clear edit boundaries keeps scope manageable
  • Docker isolation — Safe experimentation without host damage
  • Backed by a real company — Third Layer has advisors from top labs and a product roadmap

Cautions

  • Very new — Just open-sourced (April 2026), limited community adoption
  • Ships without tasks — Users must bring their own Harbor-format benchmarks
  • Single-file constraint — Real-world agent harnesses are multi-file; unclear how this scales
  • Company product risk — Open-source component may evolve to serve Third Layer's commercial interests

Key Stats

DetailValue
GitHubkevinrgu/autoagent
LicenseMIT
LanguagePython
RequirementsDocker, Python 3.10+, uv
BenchmarkHarbor framework
CompanyThird Layer (thirdlayer.inc)
CreatedApril 2026

See also: Autoresearch Tools Compared for the full category analysis.

Research by Ry Walker Research · methodology