Autoresearch (Karpathy) | Ry Walker Research

Key takeaways

630 lines of Python that let an AI agent autonomously run ML experiments overnight on a single GPU — 12 experiments/hour, ~100 while you sleep
The real innovation is program.md — natural language specs that define agent behavior, replacing traditional code. You're programming the program, not the model.
86,192 stars and 12,492 forks as of June 2026 — but the repo is dormant: no commits since March 26, 2026, with 185 open issues piling up
Not formally licensed: the README references MIT, but no LICENSE file exists and GitHub reports "no license" — community PRs to fix it remain unmerged

FAQ

What is Karpathy's autoresearch?

An autonomous AI research agent for LLM training. You give an agent a training setup, point it at program.md instructions, and it experiments overnight — modifying train.py, training for 5 minutes, keeping improvements, reverting failures, repeating.

What hardware do I need?

A single NVIDIA GPU (tested on H100), Python 3.10+, and the uv package manager. Any coding agent (Claude, Codex, etc.) serves as the AI researcher.

Can autoresearch be used for things other than ML?

The original repo is LLM training-specific, but the pattern generalizes. pi-autoresearch already extends it to test speed, bundle size, and Lighthouse scores. AutoKernel applies it to GPU kernel optimization. PostHog applied the loop to its ClickHouse query engine.

Is autoresearch still maintained?

Effectively no. The last commit landed March 26, 2026, and as of June 2026 the repo has 185 open issues and unmerged PRs with no maintainer activity. The pattern lives on in forks and ports, but the original repo is dormant.

Overview

Status (June 11, 2026): Dormant. The repo's last push was March 26, 2026 — no commits in over two months — while 185 open issues and unmerged PRs accumulate. The project keeps gaining stars (86,192, with 12,492 forks) but has no active maintainer.

Autoresearch is Andrej Karpathy's open-source framework for autonomous AI research on LLM training. Released March 6, 2026, it hit 30,307 stars in one week — one of the fastest-growing repos in GitHub history — and stands at 86,192 stars as of June 2026.

The concept: give an AI agent a small but real LLM training setup (simplified nanochat), point it at program.md, and let it experiment overnight. The agent modifies train.py, trains for exactly 5 minutes, checks if validation performance improved, keeps or reverts, and repeats.

~12 experiments/hour. ~100 experiments overnight. You wake up to a log of what worked and what didn't — and hopefully a better model.

Architecture

The whole framework is ~630 lines of Python. Three files that matter:

File	Role	Modified by
`prepare.py`	Fixed. One-time data prep, dataloader, evaluation utilities	Nobody
`train.py`	Full GPT model, optimizer (Muon + AdamW), training loop. Everything fair game	Agent
`program.md`	"Research org code." Instructions for the agent	Human

Key Design Choices

Single file constraint — Agent only touches train.py. Keeps scope manageable, diffs reviewable.
Fixed 5-minute time budget — Wall clock, excluding startup/compilation. Makes experiments comparable across hardware.
val_bpb metric — Validation bits per byte. Lower is better. Vocab-size-independent so architectural changes are fairly compared.
Append-only log — autoresearch.jsonl survives restarts and context resets.
Branch-aware — Each experiment session runs on its own branch.

Pricing & License

Free — it's a public GitHub repo with no paid tier. But the licensing is murkier than the early "MIT licensed" coverage suggested: the README references MIT, yet the repo ships no LICENSE file, and GitHub's API reports "no license." Community attempts to fix this (issue #210, filed within the first week, and PR #575 from May 2026) remain unmerged. Strictly read, code without a license grant is all-rights-reserved — a real consideration for commercial use until a LICENSE file lands.

The Big Idea

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone." — Karpathy, March 2026

The paradigm shift: you're not writing code, you're writing the markdown that tells the agent how to write code. The program.md file is essentially a lightweight "skill" — a natural language specification that defines agent behavior. The human is the meta-researcher.

Ecosystem

Within 7 days of release:

Fork/Variant	What it does
autoresearch-mlx	Apple Silicon port (MLX, no PyTorch)
autoresearch-win-rtx	Windows/RTX port
autokernel	GPU kernel optimization
autoresearch-at-home	Distributed SETI@home-style coordination
pi-autoresearch	Generic metric support for pi editor

Adoption (as of June 2026)

The repo sits at 86,192 stars and 12,492 forks, and the "Karpathy loop" has spread well beyond nanochat training. Notable real-world applications:

PostHog ran the loop against its ClickHouse query engine and surfaced a three-year-old timestamp-filter bug, with a fix that cut granules scanned by 62% on the benchmark query (June 2026).
Shopify engineers ran 120 experiments against the Liquid templating engine, cutting parse-plus-render time from 7,469 to 3,534 microseconds (53%) — but the PR remains unmerged and reviewers flagged it as overfit to the benchmark.
Vector Institute demonstrated a 9x wall-clock advantage by parallelizing the loop across GPUs.

Adoption is paradoxical: usage of the pattern is accelerating while the repo itself is unmaintained. Karpathy publicly sketched the next step — asynchronous, massively collaborative agents, SETI@home style — in mid-March, but never shipped it in this repo.

What Developers Say

PostHog engineer Robbie Coomber, after an overnight run on their query engine: "for almost three years, every query with a timestamp filter had not been using ClickHouse's primary key correctly" — the resulting fix "cut the number of granules ClickHouse had to scan by 62% on the benchmark query, and made the query itself meaningfully faster." (June 2026)
On the much-publicized Shopify Liquid result: independent reviewer Josh Moody called the agent-generated code quality "just bad," and the PR's own author conceded the 53% gain was "probably somewhat overfit." (May 2026)
Karpathy, on where it should go next: "The goal is not to emulate a single PhD student, it's to emulate a research community of them." (March 2026)

Changes Since March 2026 (profile date)

Stars: 30,307 → 86,192; forks → 12,492.
Development stopped. Last push March 26, 2026; 185 open issues with no maintainer triage since.
License gap confirmed. Early coverage (and this profile's first edition) called it MIT-licensed; in fact no LICENSE file exists and PR #575 to add one is still open.
Real-world wins and warnings. PostHog's bug find validated the pattern outside ML training; Shopify's overfit benchmark showed its failure mode.

Strengths & Limitations

Strengths:

Elegant simplicity — 630 lines of Python
The program.md framing is genuinely novel and immediately reproducible
Battle-tested by Karpathy himself, then validated in production codebases (PostHog)
Fastest ecosystem spawn in recent memory

Limitations:

Unmaintained — no commits since March 26, 2026; 185 open issues
No LICENSE file despite README's MIT reference — legally ambiguous for commercial use
LLM training only (nanochat specifically)
Single GPU only
No built-in multi-agent coordination
Optimizes whatever metric you freeze — overfitting to the benchmark is the documented failure mode

Bottom Line

Not recommended as a dependency: the repo is dormant (no commits since March 26, 2026, 185 open issues) and ships no license file, so you can't safely build on the code itself. Recommended as a reference implementation: the pattern — natural language specs → autonomous experiment loop → measurable improvement — has been validated far beyond ML training, finding a three-year-old production bug at PostHog while also demonstrating its overfitting failure mode at Shopify. Outlook: Karpathy got his 86k-star category definition and moved on; the platform layer he gestured at (massively collaborative agent research) remains unbuilt, and the forks are where the living code is.

Research by Ry Walker Research • methodology

Sources