Key takeaways
- Karpathy's autoresearch hit 30k stars in 7 days, spawning an entire ecosystem of experiment loop agents for LLM training, GPU kernels, and generic metrics
- The core innovation is program.md as agent orchestration — natural language specs that define what the agent optimizes, replacing traditional code
- The pattern generalizes beyond ML: pi-autoresearch proves any measurable metric (test speed, bundle size, Lighthouse scores) can be an autoresearch target
- Nobody has built the general-purpose autoresearch-as-a-service platform yet — the gap between vertical tools and a horizontal orchestration layer is wide open
FAQ
What is autoresearch?
Autoresearch is a pattern where AI agents autonomously run experiment loops — modify code, run a benchmark, measure the result, keep improvements or revert failures, and repeat. Coined by Andrej Karpathy in March 2026.
How many experiments can autoresearch run overnight?
Karpathy's original runs ~12 experiments/hour (~100 overnight) with 5-minute training budgets. AutoKernel runs ~40 experiments/hour with 90-second cycles. Results vary by domain and benchmark duration.
Does autoresearch only work for ML training?
No. pi-autoresearch generalizes the pattern to any measurable metric — test speed, bundle size, build times, Lighthouse scores. AutoKernel applies it to GPU kernel optimization. The pattern is domain-agnostic.
What is the difference between autoresearch and deep research agents?
Autoresearch runs code experiment loops (edit → benchmark → keep/revert). Deep research agents search the web, read sources, and synthesize knowledge reports. Both are autonomous but solve different problems.
Executive Summary
Autoresearch is the hottest new category in AI tooling. In one week, Karpathy's autoresearch repo hit 30k stars and spawned an ecosystem of tools that let AI agents autonomously run experiments — modifying code, measuring results, and iterating without human intervention.
The category spans three tiers: experiment loop agents (edit → benchmark → keep/revert), deep research agents (search → read → synthesize reports), and scientific discovery agents (ideate → experiment → write papers). The connecting thread: the human defines the objective, the agent runs the loop.
Key Findings:
- Karpathy's framing shift is the real innovation —
program.mdas "research org code" turns natural language into agent orchestration - The pattern generalizes immediately — within 7 days: MLX port, Windows port, GPU kernel variant, distributed variant, generic-metric variant
- Deep research is already commoditized — 10+ open-source agents, differentiation is shifting to coordination and specialized models
- The platform gap is wide open — nobody has built general-purpose "autoresearch-as-a-service" yet
Strategic Planning Assumptions:
- By Q3 2026, every major coding agent (Cursor, Claude Code, Codex) will have native experiment loop support
- By 2027, distributed autoresearch (multi-agent coordination) will be the default, not single-agent loops
- The deep research agent space will consolidate around 2-3 winners with specialized fine-tuned models
Market Definition
Autoresearch tools are AI systems that autonomously conduct research — whether through code experimentation, web knowledge synthesis, or scientific discovery — with minimal human intervention.
Inclusion Criteria:
- Autonomous operation (agent decides what to try next)
- Measurable output (metrics, reports, or papers)
- Open source or publicly documented
- Active development (commits in last 6 months)
Exclusion Criteria:
- Manual AI-assisted tools (copilots that suggest but don't act)
- Pure benchmarking frameworks without agent loops
- Proprietary-only products (OpenAI Deep Research, Gemini Deep Research)
Tier 1: Experiment Loop Agents
The "overnight optimization" pattern. Agent modifies a single file, runs a fixed-time benchmark, keeps improvements, reverts failures, repeats autonomously.
Market Map
| Tool | ⭐ Stars | Created | Domain | Key Differentiator |
|---|---|---|---|---|
| karpathy/autoresearch | 30,307 | Mar 6, 2026 | LLM training | The original. program.md as org chart |
| davebcn87/pi-autoresearch | 817 | Mar 11, 2026 | Any metric | Generalizes to test speed, bundle size, Lighthouse |
| RightNow-AI/autokernel | 556 | Mar 11, 2026 | GPU kernels | Profiles PyTorch models, optimizes Triton/CUDA |
| autoresearch-at-home | 188 | Mar 10, 2026 | Distributed | SETI@home-style multi-agent coordination |
| Hyperspace AGI | 696 | Mar 8, 2026 | Multi-domain | Distributed P2P autoresearch on 2M-node network |
| autoresearch-mlx | 633 | Mar 8, 2026 | LLM training (Mac) | Apple Silicon MLX port |
| autoresearch-mlx-mkw | 1 | Mar 12, 2026 | LLM training (Mac) | Deep MLX rewrite: Gated DeltaNet hybrid attention, runs on 16GB M4 Air |
| autoresearch-win-rtx | 158 | Mar 8, 2026 | LLM training (Win) | Windows/RTX port |
The Core Architecture
All experiment loop tools share this pattern:
program.md— Natural language instructions defining what to optimize, constraints, and strategy- Single file constraint — Agent only modifies one file (e.g.,
train.py), keeping scope manageable - Fixed time budget — Each experiment runs for the same duration, making results comparable
- Append-only log — Results survive restarts and context resets
- Keep/revert decision — Binary outcome per experiment, committed to git
The agent reads program.md, modifies the target file, runs the benchmark, and decides to keep or revert. Then repeats — forever, or until interrupted.
Key Innovation: program.md as Orchestration
Karpathy's insight: you're not writing code, you're writing the markdown that tells the agent how to write code. The human is the meta-researcher. program.md is essentially a lightweight "skill" — a natural language specification that defines agent behavior.
This is directly analogous to how coding agent orchestration platforms define agent tasks through specs rather than code.
Tier 2: Deep Research Agents
Web-based knowledge synthesis. These agents search, read, reason, and produce comprehensive research reports.
Market Map
| Tool | ⭐ Stars | Created | Key Differentiator |
|---|---|---|---|
| gpt-researcher | 25,701 | May 2023 | OG. Planner/execution pattern. 20+ sources per report |
| dzhng/deep-research | 18,562 | Feb 2025 | Simplest implementation (under 500 LoC). Depth/breadth controls |
| Tongyi DeepResearch | 18,422 | Jan 2025 | SOTA benchmarks. RL-trained Qwen3-30B-A3B |
| open_deep_research | 10,797 | Nov 2024 | LangGraph. Multi-provider, MCP, no-code UI |
| open-deep-research (Firecrawl) | 6,191 | Feb 2025 | Firecrawl-powered. Simple clone |
| DeepResearchAgent | 3,237 | May 2025 | Self-evolving agents with Autogenesis protocol |
| agents-deep-research | 739 | Mar 2025 | OpenAI Agents SDK |
Approaches Diverging
Two schools of thought are emerging:
Prompt-based: Use frontier models (GPT-5, Claude, Gemini) with good prompting and tool orchestration. Exemplified by gpt-researcher, dzhng/deep-research, LangChain's open_deep_research.
Fine-tuned: Train specialized models for research agent tasks using RL (GRPO, Reinforce++). Exemplified by Tongyi DeepResearch (Qwen3-30B-A3B) and SkyworkAI's Autogenesis.
The fine-tuned approach is gaining ground — Tongyi leads benchmarks across BrowseComp, GAIA, and HLE despite being a smaller model than frontier alternatives.
Tier 3: Scientific Discovery Agents
End-to-end: ideation → experiment → paper writing → review.
Market Map
| Tool | ⭐ Stars | Created | Key Differentiator |
|---|---|---|---|
| AI Scientist v1 | 12,330 | Aug 2024 | First fully automated science pipeline |
| AI Scientist v2 | 2,266 | Apr 2025 | Agentic tree search. First AI paper accepted through peer review |
| AutoRA | 80 | 2020 | Academic. Model discovery, experimental design, open science |
AI Scientist v2 is a milestone — the first paper written entirely by AI accepted through peer review at an ICLR workshop. The system autonomously generates hypotheses, designs experiments, runs them, and writes LaTeX papers without human templates.
Competitive Dynamics
What Makes This a Category
-
The framing shift. "You're not editing Python, you're programming the
program.md." This reframes agent orchestration as natural language specification. -
Immediate ecosystem. 7 days from Karpathy's release to 6+ variants. The pattern is extractable — it works for LLM training, GPU kernels, test suites, bundle sizes.
-
Convergence with coding agents. Experiment loops are what coding agents already do (edit → test → iterate), but with a clear fitness function. Any CI pipeline + metric = autoresearch target.
-
The "research community" vision. Karpathy: "The goal is not to emulate a single PhD student, it's to emulate a research community of them." autoresearch-at-home is the first implementation of distributed agent coordination for research.
The Platform Gap
Currently fragmented:
- Karpathy's is LLM training-specific
- pi-autoresearch is pi editor-specific
- AutoKernel is GPU kernel-specific
- Deep research agents are web search-specific
Nobody has built the horizontal layer: point any coding agent at any repo + any metric, define a program.md, and get autoresearch-as-a-service. This is the obvious platform opportunity.
Technical Comparison
| Dimension | Experiment Loops | Deep Research | Scientific Discovery |
|---|---|---|---|
| Input | Code + metric | Query/topic | Research area |
| Agent action | Edit code, run benchmark | Search web, read sources | Design experiments, write papers |
| Output | Optimized code + log | Markdown/PDF report | LaTeX paper with results |
| Loop type | Keep/revert per experiment | Depth/breadth exploration | Tree search over hypotheses |
| Duration | Hours to days | Minutes to hours | Hours to days |
| Compute | GPU required | API calls only | GPU + API calls |
| Coordination | Emerging (at-home) | Not needed | Not implemented |
What to Watch
Near-term (Q2 2026)
- Every coding agent adds native experiment loop support
- Karpathy's autoresearch expands beyond nanochat
- More vertical applications (compiler optimization, API latency, cost reduction)
Medium-term (2026-2027)
- Distributed autoresearch becomes default (multi-agent swarms)
- Deep research consolidates around specialized fine-tuned models
- Platform layer emerges for "autoresearch-as-a-service"
Long-term (2027+)
- Self-evolving agents (SkyworkAI's Autogenesis pattern) become mainstream
- Research agents coordinate across organizations (the "research community" vision)
- Experiment loops extend to non-code domains (business metrics, marketing, operations)
Bottom Line
Autoresearch is a paradigm shift hiding inside a simple loop. The core pattern — agent modifies code, measures result, keeps or reverts — is trivially simple. What makes it powerful is the orchestration layer: program.md as natural language agent specs, fixed-budget experiments for comparability, and append-only logs for continuity across sessions.
The experiment loop agents (Tier 1) are most relevant to coding agent platforms. The pattern generalizes to any measurable outcome, and the multi-agent coordination problem (autoresearch-at-home) is the next frontier.
The biggest opportunity: A platform that makes any repo + any metric into an autoresearch target, with multi-agent coordination, shared results, and optimized scheduling. The vertical tools exist. The horizontal platform doesn't — yet.
Research by Ry Walker Research • methodology
Sources
- [1] karpathy/autoresearch
- [2] davebcn87/pi-autoresearch
- [3] RightNow-AI/autokernel
- [4] mutable-state-inc/autoresearch-at-home
- [5] assafelovic/gpt-researcher
- [6] dzhng/deep-research
- [7] Alibaba-NLP/DeepResearch (Tongyi)
- [8] langchain-ai/open_deep_research
- [9] SakanaAI/AI-Scientist-v2
- [10] SkyworkAI/DeepResearchAgent
- [11] VentureBeat Coverage
- [12] Awesome Deep Research (comprehensive list)
- [13] matt-k-wong/autoresearch_mlx_mkw
- [14] hyperspaceai/agi