Key takeaways
- 630 lines of Python that let an AI agent autonomously run ML experiments overnight on a single GPU — 12 experiments/hour, ~100 while you sleep
- The real innovation is program.md — natural language specs that define agent behavior, replacing traditional code. You're programming the program, not the model.
- 86,192 stars and 12,492 forks as of June 2026 — but the repo is dormant: no commits since March 26, 2026, with 185 open issues piling up
- Not formally licensed: the README references MIT, but no LICENSE file exists and GitHub reports "no license" — community PRs to fix it remain unmerged
FAQ
What is Karpathy's autoresearch?
An autonomous AI research agent for LLM training. You give an agent a training setup, point it at program.md instructions, and it experiments overnight — modifying train.py, training for 5 minutes, keeping improvements, reverting failures, repeating.
What hardware do I need?
A single NVIDIA GPU (tested on H100), Python 3.10+, and the uv package manager. Any coding agent (Claude, Codex, etc.) serves as the AI researcher.
Can autoresearch be used for things other than ML?
The original repo is LLM training-specific, but the pattern generalizes. pi-autoresearch already extends it to test speed, bundle size, and Lighthouse scores. AutoKernel applies it to GPU kernel optimization. PostHog applied the loop to its ClickHouse query engine.
Is autoresearch still maintained?
Effectively no. The last commit landed March 26, 2026, and as of June 2026 the repo has 185 open issues and unmerged PRs with no maintainer activity. The pattern lives on in forks and ports, but the original repo is dormant.
Overview
Status (June 11, 2026): Dormant. The repo's last push was March 26, 2026 — no commits in over two months — while 185 open issues and unmerged PRs accumulate. The project keeps gaining stars (86,192, with 12,492 forks) but has no active maintainer.
Autoresearch is Andrej Karpathy's open-source framework for autonomous AI research on LLM training. Released March 6, 2026, it hit 30,307 stars in one week — one of the fastest-growing repos in GitHub history — and stands at 86,192 stars as of June 2026.
The concept: give an AI agent a small but real LLM training setup (simplified nanochat), point it at program.md, and let it experiment overnight. The agent modifies train.py, trains for exactly 5 minutes, checks if validation performance improved, keeps or reverts, and repeats.
~12 experiments/hour. ~100 experiments overnight. You wake up to a log of what worked and what didn't — and hopefully a better model.
Architecture
The whole framework is ~630 lines of Python. Three files that matter:
| File | Role | Modified by |
|---|---|---|
prepare.py | Fixed. One-time data prep, dataloader, evaluation utilities | Nobody |
train.py | Full GPT model, optimizer (Muon + AdamW), training loop. Everything fair game | Agent |
program.md | "Research org code." Instructions for the agent | Human |
Key Design Choices
- Single file constraint — Agent only touches
train.py. Keeps scope manageable, diffs reviewable. - Fixed 5-minute time budget — Wall clock, excluding startup/compilation. Makes experiments comparable across hardware.
- val_bpb metric — Validation bits per byte. Lower is better. Vocab-size-independent so architectural changes are fairly compared.
- Append-only log —
autoresearch.jsonlsurvives restarts and context resets. - Branch-aware — Each experiment session runs on its own branch.
Pricing & License
Free — it's a public GitHub repo with no paid tier. But the licensing is murkier than the early "MIT licensed" coverage suggested: the README references MIT, yet the repo ships no LICENSE file, and GitHub's API reports "no license." Community attempts to fix this (issue #210, filed within the first week, and PR #575 from May 2026) remain unmerged. Strictly read, code without a license grant is all-rights-reserved — a real consideration for commercial use until a LICENSE file lands.
The Big Idea
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone." — Karpathy, March 2026
The paradigm shift: you're not writing code, you're writing the markdown that tells the agent how to write code. The program.md file is essentially a lightweight "skill" — a natural language specification that defines agent behavior. The human is the meta-researcher.
Ecosystem
Within 7 days of release:
| Fork/Variant | What it does |
|---|---|
| autoresearch-mlx | Apple Silicon port (MLX, no PyTorch) |
| autoresearch-win-rtx | Windows/RTX port |
| autokernel | GPU kernel optimization |
| autoresearch-at-home | Distributed SETI@home-style coordination |
| pi-autoresearch | Generic metric support for pi editor |
Adoption (as of June 2026)
The repo sits at 86,192 stars and 12,492 forks, and the "Karpathy loop" has spread well beyond nanochat training. Notable real-world applications:
- PostHog ran the loop against its ClickHouse query engine and surfaced a three-year-old timestamp-filter bug, with a fix that cut granules scanned by 62% on the benchmark query (June 2026).
- Shopify engineers ran 120 experiments against the Liquid templating engine, cutting parse-plus-render time from 7,469 to 3,534 microseconds (53%) — but the PR remains unmerged and reviewers flagged it as overfit to the benchmark.
- Vector Institute demonstrated a 9x wall-clock advantage by parallelizing the loop across GPUs.
Adoption is paradoxical: usage of the pattern is accelerating while the repo itself is unmaintained. Karpathy publicly sketched the next step — asynchronous, massively collaborative agents, SETI@home style — in mid-March, but never shipped it in this repo.
What Developers Say
- PostHog engineer Robbie Coomber, after an overnight run on their query engine: "for almost three years, every query with a timestamp filter had not been using ClickHouse's primary key correctly" — the resulting fix "cut the number of granules ClickHouse had to scan by 62% on the benchmark query, and made the query itself meaningfully faster." (June 2026)
- On the much-publicized Shopify Liquid result: independent reviewer Josh Moody called the agent-generated code quality "just bad," and the PR's own author conceded the 53% gain was "probably somewhat overfit." (May 2026)
- Karpathy, on where it should go next: "The goal is not to emulate a single PhD student, it's to emulate a research community of them." (March 2026)
Changes Since March 2026 (profile date)
- Stars: 30,307 → 86,192; forks → 12,492.
- Development stopped. Last push March 26, 2026; 185 open issues with no maintainer triage since.
- License gap confirmed. Early coverage (and this profile's first edition) called it MIT-licensed; in fact no LICENSE file exists and PR #575 to add one is still open.
- Real-world wins and warnings. PostHog's bug find validated the pattern outside ML training; Shopify's overfit benchmark showed its failure mode.
Strengths & Limitations
Strengths:
- Elegant simplicity — 630 lines of Python
- The
program.mdframing is genuinely novel and immediately reproducible - Battle-tested by Karpathy himself, then validated in production codebases (PostHog)
- Fastest ecosystem spawn in recent memory
Limitations:
- Unmaintained — no commits since March 26, 2026; 185 open issues
- No LICENSE file despite README's MIT reference — legally ambiguous for commercial use
- LLM training only (nanochat specifically)
- Single GPU only
- No built-in multi-agent coordination
- Optimizes whatever metric you freeze — overfitting to the benchmark is the documented failure mode
Bottom Line
Not recommended as a dependency: the repo is dormant (no commits since March 26, 2026, 185 open issues) and ships no license file, so you can't safely build on the code itself. Recommended as a reference implementation: the pattern — natural language specs → autonomous experiment loop → measurable improvement — has been validated far beyond ML training, finding a three-year-old production bug at PostHog while also demonstrating its overfitting failure mode at Shopify. Outlook: Karpathy got his 86k-star category definition and moved on; the platform layer he gestured at (massively collaborative agent research) remains unbuilt, and the forks are where the living code is.
Research by Ry Walker Research • methodology
Sources
- [1] karpathy/autoresearch GitHub
- [2] Karpathy's tweet on autoresearch vision
- [3] VentureBeat coverage
- [4] PR #575: Add MIT license file matching README (unmerged)
- [5] TechTimes: Shopify's 53% speed claim still unmerged, flagged as overfit
- [6] PostHog: Autoresearch found a 3-year-old bug in our query engine
- [7] Karpathy on the SETI@home-style next step for autoresearch