← Back to research
·8 min read·industry

Autoresearch Tools

Category analysis of 13+ autoresearch tools — autonomous AI agents that run experiment loops, deep web research, and scientific discovery. Covers Karpathy's autoresearch, pi-autoresearch, AutoKernel, Hyperspace AGI, GPT Researcher, Tongyi DeepResearch, AI Scientist, and more.

Key takeaways

  • Karpathy's autoresearch hit 30k stars in 7 days, spawning an entire ecosystem of experiment loop agents for LLM training, GPU kernels, and generic metrics
  • The core innovation is program.md as agent orchestration — natural language specs that define what the agent optimizes, replacing traditional code
  • The pattern generalizes beyond ML: pi-autoresearch proves any measurable metric (test speed, bundle size, Lighthouse scores) can be an autoresearch target
  • Nobody has built the general-purpose autoresearch-as-a-service platform yet — the gap between vertical tools and a horizontal orchestration layer is wide open

FAQ

What is autoresearch?

Autoresearch is a pattern where AI agents autonomously run experiment loops — modify code, run a benchmark, measure the result, keep improvements or revert failures, and repeat. Coined by Andrej Karpathy in March 2026.

How many experiments can autoresearch run overnight?

Karpathy's original runs ~12 experiments/hour (~100 overnight) with 5-minute training budgets. AutoKernel runs ~40 experiments/hour with 90-second cycles. Results vary by domain and benchmark duration.

Does autoresearch only work for ML training?

No. pi-autoresearch generalizes the pattern to any measurable metric — test speed, bundle size, build times, Lighthouse scores. AutoKernel applies it to GPU kernel optimization. The pattern is domain-agnostic.

What is the difference between autoresearch and deep research agents?

Autoresearch runs code experiment loops (edit → benchmark → keep/revert). Deep research agents search the web, read sources, and synthesize knowledge reports. Both are autonomous but solve different problems.

Executive Summary

Autoresearch is the hottest new category in AI tooling. In one week, Karpathy's autoresearch repo hit 30k stars and spawned an ecosystem of tools that let AI agents autonomously run experiments — modifying code, measuring results, and iterating without human intervention.

The category spans three tiers: experiment loop agents (edit → benchmark → keep/revert), deep research agents (search → read → synthesize reports), and scientific discovery agents (ideate → experiment → write papers). The connecting thread: the human defines the objective, the agent runs the loop.

Key Findings:

  • Karpathy's framing shift is the real innovationprogram.md as "research org code" turns natural language into agent orchestration
  • The pattern generalizes immediately — within 7 days: MLX port, Windows port, GPU kernel variant, distributed variant, generic-metric variant
  • Deep research is already commoditized — 10+ open-source agents, differentiation is shifting to coordination and specialized models
  • The platform gap is wide open — nobody has built general-purpose "autoresearch-as-a-service" yet

Strategic Planning Assumptions:

  • By Q3 2026, every major coding agent (Cursor, Claude Code, Codex) will have native experiment loop support
  • By 2027, distributed autoresearch (multi-agent coordination) will be the default, not single-agent loops
  • The deep research agent space will consolidate around 2-3 winners with specialized fine-tuned models

Market Definition

Autoresearch tools are AI systems that autonomously conduct research — whether through code experimentation, web knowledge synthesis, or scientific discovery — with minimal human intervention.

Inclusion Criteria:

  • Autonomous operation (agent decides what to try next)
  • Measurable output (metrics, reports, or papers)
  • Open source or publicly documented
  • Active development (commits in last 6 months)

Exclusion Criteria:

  • Manual AI-assisted tools (copilots that suggest but don't act)
  • Pure benchmarking frameworks without agent loops
  • Proprietary-only products (OpenAI Deep Research, Gemini Deep Research)

Tier 1: Experiment Loop Agents

The "overnight optimization" pattern. Agent modifies a single file, runs a fixed-time benchmark, keeps improvements, reverts failures, repeats autonomously.

Market Map

Tool⭐ StarsCreatedDomainKey Differentiator
karpathy/autoresearch30,307Mar 6, 2026LLM trainingThe original. program.md as org chart
davebcn87/pi-autoresearch817Mar 11, 2026Any metricGeneralizes to test speed, bundle size, Lighthouse
RightNow-AI/autokernel556Mar 11, 2026GPU kernelsProfiles PyTorch models, optimizes Triton/CUDA
autoresearch-at-home188Mar 10, 2026DistributedSETI@home-style multi-agent coordination
Hyperspace AGI696Mar 8, 2026Multi-domainDistributed P2P autoresearch on 2M-node network
autoresearch-mlx633Mar 8, 2026LLM training (Mac)Apple Silicon MLX port
autoresearch-mlx-mkw1Mar 12, 2026LLM training (Mac)Deep MLX rewrite: Gated DeltaNet hybrid attention, runs on 16GB M4 Air
autoresearch-win-rtx158Mar 8, 2026LLM training (Win)Windows/RTX port

The Core Architecture

All experiment loop tools share this pattern:

  1. program.md — Natural language instructions defining what to optimize, constraints, and strategy
  2. Single file constraint — Agent only modifies one file (e.g., train.py), keeping scope manageable
  3. Fixed time budget — Each experiment runs for the same duration, making results comparable
  4. Append-only log — Results survive restarts and context resets
  5. Keep/revert decision — Binary outcome per experiment, committed to git

The agent reads program.md, modifies the target file, runs the benchmark, and decides to keep or revert. Then repeats — forever, or until interrupted.

Key Innovation: program.md as Orchestration

Karpathy's insight: you're not writing code, you're writing the markdown that tells the agent how to write code. The human is the meta-researcher. program.md is essentially a lightweight "skill" — a natural language specification that defines agent behavior.

This is directly analogous to how coding agent orchestration platforms define agent tasks through specs rather than code.


Tier 2: Deep Research Agents

Web-based knowledge synthesis. These agents search, read, reason, and produce comprehensive research reports.

Market Map

Tool⭐ StarsCreatedKey Differentiator
gpt-researcher25,701May 2023OG. Planner/execution pattern. 20+ sources per report
dzhng/deep-research18,562Feb 2025Simplest implementation (under 500 LoC). Depth/breadth controls
Tongyi DeepResearch18,422Jan 2025SOTA benchmarks. RL-trained Qwen3-30B-A3B
open_deep_research10,797Nov 2024LangGraph. Multi-provider, MCP, no-code UI
open-deep-research (Firecrawl)6,191Feb 2025Firecrawl-powered. Simple clone
DeepResearchAgent3,237May 2025Self-evolving agents with Autogenesis protocol
agents-deep-research739Mar 2025OpenAI Agents SDK

Approaches Diverging

Two schools of thought are emerging:

Prompt-based: Use frontier models (GPT-5, Claude, Gemini) with good prompting and tool orchestration. Exemplified by gpt-researcher, dzhng/deep-research, LangChain's open_deep_research.

Fine-tuned: Train specialized models for research agent tasks using RL (GRPO, Reinforce++). Exemplified by Tongyi DeepResearch (Qwen3-30B-A3B) and SkyworkAI's Autogenesis.

The fine-tuned approach is gaining ground — Tongyi leads benchmarks across BrowseComp, GAIA, and HLE despite being a smaller model than frontier alternatives.


Tier 3: Scientific Discovery Agents

End-to-end: ideation → experiment → paper writing → review.

Market Map

Tool⭐ StarsCreatedKey Differentiator
AI Scientist v112,330Aug 2024First fully automated science pipeline
AI Scientist v22,266Apr 2025Agentic tree search. First AI paper accepted through peer review
AutoRA802020Academic. Model discovery, experimental design, open science

AI Scientist v2 is a milestone — the first paper written entirely by AI accepted through peer review at an ICLR workshop. The system autonomously generates hypotheses, designs experiments, runs them, and writes LaTeX papers without human templates.


Competitive Dynamics

What Makes This a Category

  1. The framing shift. "You're not editing Python, you're programming the program.md." This reframes agent orchestration as natural language specification.

  2. Immediate ecosystem. 7 days from Karpathy's release to 6+ variants. The pattern is extractable — it works for LLM training, GPU kernels, test suites, bundle sizes.

  3. Convergence with coding agents. Experiment loops are what coding agents already do (edit → test → iterate), but with a clear fitness function. Any CI pipeline + metric = autoresearch target.

  4. The "research community" vision. Karpathy: "The goal is not to emulate a single PhD student, it's to emulate a research community of them." autoresearch-at-home is the first implementation of distributed agent coordination for research.

The Platform Gap

Currently fragmented:

  • Karpathy's is LLM training-specific
  • pi-autoresearch is pi editor-specific
  • AutoKernel is GPU kernel-specific
  • Deep research agents are web search-specific

Nobody has built the horizontal layer: point any coding agent at any repo + any metric, define a program.md, and get autoresearch-as-a-service. This is the obvious platform opportunity.


Technical Comparison

DimensionExperiment LoopsDeep ResearchScientific Discovery
InputCode + metricQuery/topicResearch area
Agent actionEdit code, run benchmarkSearch web, read sourcesDesign experiments, write papers
OutputOptimized code + logMarkdown/PDF reportLaTeX paper with results
Loop typeKeep/revert per experimentDepth/breadth explorationTree search over hypotheses
DurationHours to daysMinutes to hoursHours to days
ComputeGPU requiredAPI calls onlyGPU + API calls
CoordinationEmerging (at-home)Not neededNot implemented

What to Watch

Near-term (Q2 2026)

  • Every coding agent adds native experiment loop support
  • Karpathy's autoresearch expands beyond nanochat
  • More vertical applications (compiler optimization, API latency, cost reduction)

Medium-term (2026-2027)

  • Distributed autoresearch becomes default (multi-agent swarms)
  • Deep research consolidates around specialized fine-tuned models
  • Platform layer emerges for "autoresearch-as-a-service"

Long-term (2027+)

  • Self-evolving agents (SkyworkAI's Autogenesis pattern) become mainstream
  • Research agents coordinate across organizations (the "research community" vision)
  • Experiment loops extend to non-code domains (business metrics, marketing, operations)

Bottom Line

Autoresearch is a paradigm shift hiding inside a simple loop. The core pattern — agent modifies code, measures result, keeps or reverts — is trivially simple. What makes it powerful is the orchestration layer: program.md as natural language agent specs, fixed-budget experiments for comparability, and append-only logs for continuity across sessions.

The experiment loop agents (Tier 1) are most relevant to coding agent platforms. The pattern generalizes to any measurable outcome, and the multi-agent coordination problem (autoresearch-at-home) is the next frontier.

The biggest opportunity: A platform that makes any repo + any metric into an autoresearch target, with multi-agent coordination, shared results, and optimized scheduling. The vertical tools exist. The horizontal platform doesn't — yet.


Research by Ry Walker Research • methodology