AutoAgent: Autonomous Harness Engineering | Ry Walker Research

Key takeaways

AutoAgent applies the autoresearch loop to agent engineering itself — the meta-agent modifies system prompts, tools, config, and orchestration, then benchmarks the result
Built on the Harbor benchmark framework (now harbor-framework/harbor), making results comparable across agent implementations
From Third Layer (thirdlayer.inc), a YC W25 startup building self-configuring agent infrastructure
Launched hot — 4.5K stars in two months — but development stalled after launch week: zero commits since April 3, 2026, and no releases

FAQ

What is AutoAgent?

AutoAgent is an open-source tool that autonomously improves an agent harness. A meta-agent modifies the system prompt, tools, agent configuration, and orchestration, runs a benchmark, checks the score, keeps or discards the change, and repeats.

How is AutoAgent different from Karpathy's autoresearch?

Karpathy's autoresearch optimizes ML training code. AutoAgent optimizes agent harnesses — the system prompts, tool definitions, routing, and orchestration that define how an agent operates. Same loop pattern, different target.

What benchmark framework does AutoAgent use?

AutoAgent uses Harbor, a framework from the creators of Terminal-Bench for evaluating agents like Claude Code, OpenHands, and Codex CLI. Tasks run in Docker containers with deterministic or LLM-as-judge scoring. Harbor now lives at harbor-framework/harbor on GitHub and remains actively developed.

Is AutoAgent still maintained?

As of June 2026, no visible development since launch week — the last commit landed April 3, 2026, with no releases since. The repo has 4.5K stars and is not archived, but it appears to be a launch-week research artifact rather than an actively maintained project.

Overview

AutoAgent takes the autoresearch experiment loop pattern and applies it to agent harness engineering itself. Instead of optimizing ML training code or GPU kernels, AutoAgent optimizes the system prompt, tool definitions, agent configuration, and orchestration that define how an agent operates.^[1]

The pitch: give an AI agent a task, let it build and iterate on an agent harness autonomously overnight. The meta-agent modifies agent.py, runs a benchmark via the Harbor framework, checks the score, keeps or discards the change, and repeats — hill-climbing toward better agent performance.

Built by Kevin Gu, co-founder of Third Layer (thirdlayer.inc), a Y Combinator W25 startup building self-configuring agent infrastructure with talent from MIT Media Lab, Stanford, Berkeley, DeepMind, and Sakana AI.^[2]^[3]

The launch announcement claimed strong results — 96.5% on SpreadsheetBench (first place) and 55.1% on Terminal-Bench with GPT-5 — though those figures came from Gu's own posts and did not appear on the official leaderboards at the time.^[3]

Status (June 2026): the launch was a hit — roughly 4,500 stars and 499 forks in two months — but development stopped after launch week. The repository's last commit landed April 3, 2026, one day after creation, with zero commits, zero releases, and only a handful of open issues since. It is not archived, but it reads as a launch-week research artifact, not a maintained project.^[1]

How It Works

The architecture follows the same core pattern as Karpathy's autoresearch, with a key twist — the optimization target is the agent itself:

Key Files

File	Purpose
`program.md`	Instructions for the meta-agent + directive (what kind of agent to build). Edited by the human.
`agent.py`	The entire harness under test in a single file — config, tool definitions, agent registry, routing/orchestration, and Harbor adapter. The meta-agent edits everything except the fixed adapter section.
`tasks/`	Evaluation tasks in Harbor format with Docker containers, instruction prompts, and test suites.
`.agent/`	Optional workspace for reusable instructions, notes, prompts, or skills.

The Loop

Meta-agent reads program.md for directive and constraints
Inspects current agent.py harness
Modifies the editable section (prompts, tools, routing, config)
Runs benchmark via Harbor (harbor run -p tasks/)
Score produced (0.0-1.0) by task test suites
Keep if score improved, revert if not
Repeat

Design Principles

Program the meta-agent, not the harness. The human steers through program.md, the meta-agent edits agent.py
Single-file, registry-driven harness. Everything in one file for simplicity, but with structured registration for clean evolution
Docker isolation. Agent runs in containers — can't damage the host
Score-driven. Numeric score from benchmarks. Keep if better, discard if not
Harbor-compatible tasks. Same format as Harbor benchmarks, making results portable across agent implementations^[4]

What Makes It Different

AutoAgent is notable for being meta-autoresearch — applying the experiment loop to agent engineering rather than a downstream task:

Tool	Optimizes	Target File
karpathy/autoresearch	ML training	`train.py`
pi-autoresearch	Any metric	configurable
AutoKernel	GPU kernels	kernel code
AutoAgent	Agent harnesses	`agent.py` (prompts, tools, routing, orchestration)

This is the first tool that closes the loop on agent self-improvement through benchmark-driven iteration — the agent engineers better agents.

Harbor Integration

AutoAgent uses the Harbor framework (from the creators of Terminal-Bench) as its evaluation backbone. Harbor has since moved from laude-institute/harbor to its own harbor-framework GitHub org and remains under active development (2.4K stars, commits as of June 2026) — unlike AutoAgent itself. Harbor provides:^[4]

Docker-isolated task environments
Standardized task format (instruction.md + test suites)
Support for evaluating Claude Code, OpenHands, Codex CLI, Aider, and custom agents
Deterministic and LLM-as-judge scoring
Registry of reusable benchmark tasks

This means AutoAgent results are comparable with any other Harbor-evaluated agent.

Third Layer (Company)

Third Layer (thirdlayer.inc) describes their mission as solving "one of the hardest problems in deploying agents: AI models are generic, but people's work and processes are specific."^[2]

Their generally available product Dex is a browser extension ("browser AI copilot") that learns organizational workflows by observing real work across the browser and work apps. As of June 2026, the Third Layer site makes no mention of AutoAgent, publishes no pricing for Dex, and has announced no new funding — the open-source project appears to have been a research release informing their agent-reliability work rather than an ongoing product line.^[2]^[3]

The team includes people from MIT Media Lab, Stanford, Berkeley, DeepMind, and Sakana AI.^[2]

Strengths

Meta-level optimization — Optimizes the agent itself, not just downstream tasks
Harbor compatibility — Results are portable and comparable across agent implementations
Clean architecture — Single-file harness with clear edit boundaries keeps scope manageable
Docker isolation — Safe experimentation without host damage
Backed by a real company — Third Layer has advisors from top labs and a product roadmap

Cautions

Stalled — No commits since April 3, 2026 (launch week) and no releases, despite 4.5K stars. Open issues sit unanswered^[1]
Unverified benchmark claims — The 96.5% SpreadsheetBench and 55.1% Terminal-Bench numbers came from the author's announcement and weren't reflected on official leaderboards^[3]
Ships without tasks — Users must bring their own Harbor-format benchmarks
Single-file constraint — Real-world agent harnesses are multi-file; unclear how this scales
Side project, not product — Third Layer's site doesn't mention AutoAgent; the company's focus is Dex^[2]

What Developers Say

No substantive verbatim developer commentary surfaced in launch-thread or forum searches as of June 2026 — coverage consists of news write-ups summarizing the author's announcement rather than attributed first-hand reports of running the tool. One launch-week reviewer called the project "truly early," noting minimal documentation beyond the README and no published reliability numbers beyond the claimed benchmark scores.^[3]

Key Stats

Detail	Value
GitHub	kevinrgu/autoagent
Stars	~4,500 (June 2026)
License	MIT
Pricing	Free, open source
Language	Python
Requirements	Docker, Python 3.10+, uv
Benchmark	Harbor framework (harbor-framework/harbor)
Company	Third Layer (thirdlayer.inc), YC W25
Created	April 2026
Last commit	April 3, 2026

Bottom Line

AutoAgent is the cleanest articulation of meta-autoresearch — pointing the experiment loop at the agent harness itself — and the idea earned it 4.5K stars in two months. But the execution stopped at the idea: zero commits since launch week, no releases, unverified benchmark claims, and a parent company whose website doesn't acknowledge the project exists.^[1]^[2]

Recommended for: Reading. The single-file harness + Harbor loop is a genuinely good pattern to study or fork for your own harness-optimization experiments.

Not recommended for: Anyone expecting a maintained tool, support, or a roadmap — this is a frozen launch-week artifact.

Outlook: The pattern will outlive the repo. Harbor (its evaluation backbone) is actively developed, and harness-optimization loops are appearing elsewhere; AutoAgent's contribution is the proof of concept, not the implementation. Expect Third Layer's learnings to surface in Dex rather than in this repository.

See also: Autoresearch Tools Compared for the full category analysis.

Research by Ry Walker Research · methodology

Sources