Key takeaways
- 630 lines of Python that let an AI agent autonomously run ML experiments overnight on a single GPU — 12 experiments/hour, ~100 while you sleep
- The real innovation is program.md — natural language specs that define agent behavior, replacing traditional code. You're programming the program, not the model.
- Spawned an entire ecosystem in 7 days: MLX port (Mac), Windows port, GPU kernel variant, distributed SETI@home variant, generic-metric variant
- MIT licensed, 30k stars in one week. The simplicity (3 files that matter) is what makes it reproducible and extensible.
FAQ
What is Karpathy's autoresearch?
An autonomous AI research agent for LLM training. You give an agent a training setup, point it at program.md instructions, and it experiments overnight — modifying train.py, training for 5 minutes, keeping improvements, reverting failures, repeating.
What hardware do I need?
A single NVIDIA GPU (tested on H100), Python 3.10+, and the uv package manager. Any coding agent (Claude, Codex, etc.) serves as the AI researcher.
Can autoresearch be used for things other than ML?
The original repo is LLM training-specific, but the pattern generalizes. pi-autoresearch already extends it to test speed, bundle size, and Lighthouse scores. AutoKernel applies it to GPU kernel optimization.
Overview
Autoresearch is Andrej Karpathy's open-source framework for autonomous AI research on LLM training. Released March 6, 2026, it hit 30,307 stars in one week — making it one of the fastest-growing repos in GitHub history.
The concept: give an AI agent a small but real LLM training setup (simplified nanochat), point it at program.md, and let it experiment overnight. The agent modifies train.py, trains for exactly 5 minutes, checks if validation performance improved, keeps or reverts, and repeats.
~12 experiments/hour. ~100 experiments overnight. You wake up to a log of what worked and what didn't — and hopefully a better model.
Architecture
Three files that matter:
| File | Role | Modified by |
|---|---|---|
prepare.py | Fixed. One-time data prep, dataloader, evaluation utilities | Nobody |
train.py | Full GPT model, optimizer (Muon + AdamW), training loop. Everything fair game | Agent |
program.md | "Research org code." Instructions for the agent | Human |
Key Design Choices
- Single file constraint — Agent only touches
train.py. Keeps scope manageable, diffs reviewable. - Fixed 5-minute time budget — Wall clock, excluding startup/compilation. Makes experiments comparable across hardware.
- val_bpb metric — Validation bits per byte. Lower is better. Vocab-size-independent so architectural changes are fairly compared.
- Append-only log —
autoresearch.jsonlsurvives restarts and context resets. - Branch-aware — Each experiment session runs on its own branch.
The Big Idea
"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone." — Karpathy, March 2026
The paradigm shift: you're not writing code, you're writing the markdown that tells the agent how to write code. The program.md file is essentially a lightweight "skill" — a natural language specification that defines agent behavior. The human is the meta-researcher.
Ecosystem
Within 7 days of release:
| Fork/Variant | What it does |
|---|---|
| autoresearch-mlx | Apple Silicon port (MLX, no PyTorch) |
| autoresearch-win-rtx | Windows/RTX port |
| autokernel | GPU kernel optimization |
| autoresearch-at-home | Distributed SETI@home-style coordination |
| pi-autoresearch | Generic metric support for pi editor |
Strengths & Limitations
Strengths:
- Elegant simplicity — 630 lines of Python, MIT license
- The
program.mdframing is genuinely novel and immediately reproducible - Battle-tested by Karpathy himself
- Fastest ecosystem spawn in recent memory
Limitations:
- LLM training only (nanochat specifically)
- Single GPU only
- No built-in multi-agent coordination
- Requires human to set up initial training code
Bottom Line
Autoresearch isn't just a tool — it's a category definition. The pattern (natural language specs → autonomous experiment loop → measurable improvement) applies far beyond ML training. The 30k-star week proves the idea resonates. The question isn't whether this pattern matters — it's who builds the platform layer.
Research by Ry Walker Research • methodology