← Back to research
·3 min read·company

Autoresearch

Karpathy's autoresearch — autonomous AI agents that run LLM training experiments overnight. 30k stars in 7 days. The repo that launched a category.

Key takeaways

  • 630 lines of Python that let an AI agent autonomously run ML experiments overnight on a single GPU — 12 experiments/hour, ~100 while you sleep
  • The real innovation is program.md — natural language specs that define agent behavior, replacing traditional code. You're programming the program, not the model.
  • Spawned an entire ecosystem in 7 days: MLX port (Mac), Windows port, GPU kernel variant, distributed SETI@home variant, generic-metric variant
  • MIT licensed, 30k stars in one week. The simplicity (3 files that matter) is what makes it reproducible and extensible.

FAQ

What is Karpathy's autoresearch?

An autonomous AI research agent for LLM training. You give an agent a training setup, point it at program.md instructions, and it experiments overnight — modifying train.py, training for 5 minutes, keeping improvements, reverting failures, repeating.

What hardware do I need?

A single NVIDIA GPU (tested on H100), Python 3.10+, and the uv package manager. Any coding agent (Claude, Codex, etc.) serves as the AI researcher.

Can autoresearch be used for things other than ML?

The original repo is LLM training-specific, but the pattern generalizes. pi-autoresearch already extends it to test speed, bundle size, and Lighthouse scores. AutoKernel applies it to GPU kernel optimization.

Overview

Autoresearch is Andrej Karpathy's open-source framework for autonomous AI research on LLM training. Released March 6, 2026, it hit 30,307 stars in one week — making it one of the fastest-growing repos in GitHub history.

The concept: give an AI agent a small but real LLM training setup (simplified nanochat), point it at program.md, and let it experiment overnight. The agent modifies train.py, trains for exactly 5 minutes, checks if validation performance improved, keeps or reverts, and repeats.

~12 experiments/hour. ~100 experiments overnight. You wake up to a log of what worked and what didn't — and hopefully a better model.

Architecture

Three files that matter:

FileRoleModified by
prepare.pyFixed. One-time data prep, dataloader, evaluation utilitiesNobody
train.pyFull GPT model, optimizer (Muon + AdamW), training loop. Everything fair gameAgent
program.md"Research org code." Instructions for the agentHuman

Key Design Choices

  • Single file constraint — Agent only touches train.py. Keeps scope manageable, diffs reviewable.
  • Fixed 5-minute time budget — Wall clock, excluding startup/compilation. Makes experiments comparable across hardware.
  • val_bpb metric — Validation bits per byte. Lower is better. Vocab-size-independent so architectural changes are fairly compared.
  • Append-only logautoresearch.jsonl survives restarts and context resets.
  • Branch-aware — Each experiment session runs on its own branch.

The Big Idea

"One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone." — Karpathy, March 2026

The paradigm shift: you're not writing code, you're writing the markdown that tells the agent how to write code. The program.md file is essentially a lightweight "skill" — a natural language specification that defines agent behavior. The human is the meta-researcher.

Ecosystem

Within 7 days of release:

Fork/VariantWhat it does
autoresearch-mlxApple Silicon port (MLX, no PyTorch)
autoresearch-win-rtxWindows/RTX port
autokernelGPU kernel optimization
autoresearch-at-homeDistributed SETI@home-style coordination
pi-autoresearchGeneric metric support for pi editor

Strengths & Limitations

Strengths:

  • Elegant simplicity — 630 lines of Python, MIT license
  • The program.md framing is genuinely novel and immediately reproducible
  • Battle-tested by Karpathy himself
  • Fastest ecosystem spawn in recent memory

Limitations:

  • LLM training only (nanochat specifically)
  • Single GPU only
  • No built-in multi-agent coordination
  • Requires human to set up initial training code

Bottom Line

Autoresearch isn't just a tool — it's a category definition. The pattern (natural language specs → autonomous experiment loop → measurable improvement) applies far beyond ML training. The 30k-star week proves the idea resonates. The question isn't whether this pattern matters — it's who builds the platform layer.


Research by Ry Walker Research • methodology