← Back to research
·14 min read·industry

Autoresearch Tools

Category analysis of 16 autoresearch tools — autonomous AI agents that run experiment loops, deep web research, and scientific discovery. Covers Karpathy's autoresearch, Kosmos, Autoscience, Google Co-Scientist, AutoResearchClaw, GPT Researcher, Tongyi DeepResearch, AI Scientist, and more.

Key takeaways

  • The hobbyist/experimental tier froze: Karpathy's autoresearch is dormant at 86K stars (no commits since March 26, 2026, no LICENSE file), and AutoAgent, AutoKernel, and autoresearch-at-home all stalled within days of launch
  • Commercial science arrived in its place — Kosmos ($70M seed, $200/run, embedded in Incyte's R&D), Autoscience ($14M led by General Catalyst), and Google Co-Scientist (GA via Gemini for Science at I/O May 2026, deployed to all 17 DOE national labs)
  • Nature published the first peer-reviewed methodology for a fully automated AI research system — Sakana's AI Scientist, March 26, 2026 — credentialing the category even as the open-source repos went quiet
  • Recursive Superintelligence's $650M raise signals "automated AI research" is now a venture category — the platform gap the hobbyist wave exposed is being filled by funded companies, not weekend repos

FAQ

What is autoresearch?

Autoresearch is a pattern where AI agents autonomously run experiment loops — modify code, run a benchmark, measure the result, keep improvements or revert failures, and repeat. Coined by Andrej Karpathy in March 2026.

How many experiments can autoresearch run overnight?

Karpathy's original runs ~12 experiments/hour (~100 overnight) with 5-minute training budgets. AutoKernel runs ~40 experiments/hour with 90-second cycles. Results vary by domain and benchmark duration.

Does autoresearch only work for ML training?

No. pi-autoresearch generalizes the pattern to any measurable metric — test speed, bundle size, build times, Lighthouse scores. AutoKernel applies it to GPU kernel optimization. The pattern is domain-agnostic.

What is the difference between autoresearch and deep research agents?

Autoresearch runs code experiment loops (edit → benchmark → keep/revert). Deep research agents search the web, read sources, and synthesize knowledge reports. Both are autonomous but solve different problems.

Is Karpathy's autoresearch still maintained?

No. The repo has had no commits since March 26, 2026, with 185 open issues and no LICENSE file despite the README's MIT reference. The pattern lives on in forks and ports — pi-autoresearch is the most active — but the original repo is dormant.

What are the commercial AI scientist platforms?

Kosmos (Edison Scientific) runs 12-hour autonomous discovery loops at $200 per run; Autoscience deploys its Mira agent into customers' production ML models; Google Co-Scientist generates and ranks research hypotheses via Gemini for Science. All three are closed, funded, and enterprise-focused.

Executive Summary

Three months after Karpathy's autoresearch lit up GitHub, the category has split in two. The hobbyist/experimental tier froze: Karpathy's repo is dormant at 86,192 stars (no commits since March 26, 2026, no LICENSE file), and the launch-week derivatives — AutoAgent, AutoKernel, autoresearch-at-home — all stalled within days of shipping.

Meanwhile, commercial science arrived. Kosmos (Edison Scientific) raised a $70M seed, charges $200 per 12-hour discovery run, and is embedded across Incyte's R&D. Autoscience raised $14M led by General Catalyst for its Carl and Mira agents. Google took Co-Scientist GA via Gemini for Science at I/O May 2026 and deployed it to all 17 DOE national labs. And on the same day Karpathy's repo went quiet, Nature published the first peer-reviewed methodology for a fully automated AI research system — Sakana's AI Scientist.

Key Findings:

  • The experiment-loop tier is a graveyard with one survivor — only pi-autoresearch still ships releases; everything else froze within weeks of launch
  • The money moved to scientific discovery — Kosmos, Autoscience, and Google Co-Scientist are funded, closed, enterprise products, not repos
  • Credentialing replaced code — Nature publication (AI Scientist), arXiv papers (AutoKernel, AutoResearchClaw), and peer-review milestones now define the leaderboard
  • Recursive Superintelligence's $650M raise confirms "automated AI research" as a venture category — the platform gap is being filled by companies, not open source

Strategic Planning Assumptions:

  • By 2027, the dominant autoresearch products will be closed commercial platforms (Kosmos-style per-run pricing or Mira-style embedded agents), with open source relegated to reference implementations
  • The experiment-loop pattern will survive as a feature inside coding agents (pi-autoresearch's ports to Claude Code and Cursor are the template), not as standalone tools
  • Verification — citation integrity, result registries, accuracy audits — becomes the primary differentiator as venues crack down on AI-generated papers

Market Definition

Autoresearch tools are AI systems that autonomously conduct research — whether through code experimentation, web knowledge synthesis, or scientific discovery — with minimal human intervention.

Inclusion Criteria:

  • Autonomous operation (agent decides what to try next)
  • Measurable output (metrics, reports, papers, or deployed improvements)
  • Open source, or commercially available with publicly documented capabilities

Exclusion Criteria:

  • Manual AI-assisted tools (copilots that suggest but don't act)
  • Pure benchmarking frameworks without agent loops

Note: the original April 2026 edition excluded proprietary products and required active development. Both criteria are gone — the most consequential entrants are now closed commercial platforms, and half the open-source field is dormant. Dormancy is now a status flag, not a disqualifier.


Status Check: The Dormant Tier (June 2026)

The defining fact of this refresh — most of the open-source wave stopped moving:

ToolStarsLast activityStatus
karpathy/autoresearch86,192Mar 26, 2026Dormant. 185 open issues, no LICENSE file despite README's MIT reference
AutoAgent~4,500Apr 3, 2026Frozen since launch week. Zero commits, zero releases since one day after creation
AutoKernel1,404Mar 19, 2026Stalled six days after launch; companion arXiv paper is the lasting artifact
autoresearch-at-home487Mar 13, 2026Dormant three days after creation; one documented coordinated run
dzhng/deep-research19,099mid-2025Dormant. Only commit since Sep 2025 is an April 2026 README attribution change (Aomni → Duet)
AI Scientist13.9k (v1) / 6.5k (v2)Dec 2025Repo quiet since the custom-license change; 2026 milestone was the Nature paper, not code
Tongyi DeepResearch19,360Feb 27, 2026Cooling. Sept 2025 weights remain the only release

The paradox: stars keep climbing across the board while commits stop. Usage of the patterns is accelerating — PostHog and Shopify ran the Karpathy loop against production codebases — but the repos themselves are unmaintained.


Tier 1: Experiment Loop Agents

The "overnight optimization" pattern. Agent modifies a single file, runs a fixed-time benchmark, keeps improvements, reverts failures, repeats autonomously.

Market Map

Tool⭐ StarsCreatedDomainStatus (June 2026)
karpathy/autoresearch86,192Mar 6, 2026LLM trainingDormant since Mar 26; no license
davebcn87/pi-autoresearch~7,000Mar 11, 2026Any metricActive — v1.6.0 (Jun 8), Earendil stewardship
kevinrgu/autoagent~4,500Apr 2026Agent harnessesFrozen since Apr 3 (launch week)
Hyperspace AGI1,923Mar 8, 2026Multi-domainAgent branches alive; pivoting to crypto (A1 blockchain, airdrop trackers)
RightNow-AI/autokernel1,404Mar 11, 2026GPU kernelsStalled Mar 19; arXiv paper published
autoresearch-at-home487Mar 10, 2026DistributedDormant since Mar 13

The launch-week ports (MLX, Windows/RTX, autoresearch-mlx-mkw) remain small curiosities.

pi-autoresearch is the only survivor — five releases since March (v1.2.0–v1.6.0), confidence scoring to separate real gains from benchmark jitter, and an npm-scope migration to @earendil-works tracking Pi's move under Earendil stewardship. Its architecture became the template for community ports to Claude Code and Cursor.

The Core Architecture

All experiment loop tools share this pattern:

  1. program.md — Natural language instructions defining what to optimize, constraints, and strategy
  2. Single file constraint — Agent only modifies one file (e.g., train.py), keeping scope manageable
  3. Fixed time budget — Each experiment runs for the same duration, making results comparable
  4. Append-only log — Results survive restarts and context resets
  5. Keep/revert decision — Binary outcome per experiment, committed to git

Karpathy's insight — you're not writing code, you're writing the markdown that tells the agent how to write code — survives the repo's dormancy. The pattern's real-world record is now mixed and instructive: it found a three-year-old query-engine bug at PostHog, and produced an overfit 53% "speedup" at Shopify that reviewers rejected.

AutoAgent took the loop meta — optimizing the agent harness itself (prompts, tools, routing) via Harbor benchmarks — before freezing at the proof-of-concept stage.


Tier 2: Deep Research Agents

Web-based knowledge synthesis. These agents search, read, reason, and produce comprehensive research reports. The tier has matured into a two-speed market: one actively shipping incumbent, four slowing or static alternatives.

Market Map

Tool⭐ StarsCreatedStatus (June 2026)
gpt-researcher27,643May 2023Active — v3.5.0 (May 28, 2026), steady release cadence since 2023
Tongyi DeepResearch19,360Jan 2025Cooling — repo quiet since Feb 2026; Sept 2025 weights still SOTA-class, now on OpenRouter with a free tier
dzhng/deep-research19,099Feb 2025Dormant since mid-2025; attribution moved to Duet. Still the best ~500-LoC teaching artifact
open_deep_research11,671Nov 2024Maintenance mode — dependency bumps only since the July 2025 supervisor rewrite
DeepResearchAgent3,449May 2025Slowing — last push May 4, 2026; v2.0.0 "self evolving" remains the only major release

Approaches Diverging

The two schools of thought from April still hold, with a status update:

Prompt-based: Use frontier models with good prompting and tool orchestration. GPT Researcher is the durable winner here — three years of releases, ~$0.10 per research run, provider-agnostic. dzhng/deep-research and LangChain's open_deep_research persist as reference architectures rather than evolving products.

Fine-tuned: Train specialized models via RL. Tongyi DeepResearch proved a 30B MoE (3.3B active params) can lead BrowseComp, GAIA, and HLE — and it runs in production inside Alibaba (Amap, Tongyi FaRui) — but nine months without a new checkpoint is eroding the bet. The promised "next generation of agentic models" hasn't shipped.

SkyworkAI's Autogenesis protocol (self-evolving agents that version their own tools and prompts) remains the most architecturally ambitious design in the tier, with near-zero independent validation.


Tier 3: Scientific Discovery Agents

End-to-end: ideation → experiment → paper or report. This is where the category's center of gravity moved — and where the money went.

Market Map

ToolModelCreatedKey Differentiator
Kosmos (Edison Scientific)Closed SaaS, $200/run202512-hour runs: ~200 rollouts, ~42K lines of code, ~1,500 papers read; every statement cited; $70M seed; embedded in Incyte's R&D
Google Co-ScientistClosed (Gemini)Feb 2025Multi-agent Gemini research partner; Gemini for Science GA at I/O May 2026; DOE national labs; enterprise (Daiichi Sankyo, Bayer Crop Science)
Autoscience (Carl + Mira)Closed, early accessMar 2025First AI papers through double-blind review (ICLR 2025 workshops — withdrawn amid controversy); Mira deploys research into production ML models; $14M GC seed
AutoResearchClawOpen source (MIT)Mar 15, 2026UNC AIMING Lab's 23-stage pipeline: idea → cited LaTeX paper; 13.4K stars in under 3 months; 4-layer citation verification; walked back its own "no human intervention" claim
AI Scientist (Sakana)Open source (custom license)Aug 2024First AI paper through peer review; methodology published in Nature (Mar 26, 2026) — but no v3 and quiet repos since Dec 2025

The Commercial Arrival

The April edition excluded proprietary products. That's no longer tenable — the three most consequential entrants are closed:

Kosmos is the most credible commercial AI scientist: traceable citations on every statement, three of seven reported discoveries reproducing unpublished findings, and a real pharma deployment. The catch is accuracy — independent scientists rated 79.4% of statements accurate, with synthesis claims at just 58%.

Autoscience has the strongest external validation (double-blind peer review, a Kaggle featured-competition silver medal) and the most tainted milestone — the ICLR papers were submitted without organizers' knowledge and withdrawn after academics accused the company of co-opting peer review for publicity. Its commercial product Mira reads 1,200+ papers a week and implements improvements directly into customers' production models.

Google Co-Scientist is the scale play: a supervisor coordinating Generation, Reflection, Ranking, Evolution, Proximity, and Meta-review agents, with Nature-published validations and distribution through Gemini for Science to individual researchers, enterprise pharma, and all 17 DOE national labs.

AutoResearchClaw is the open-source counterweight — MIT-licensed where AI Scientist relicensed restrictively, actively shipping where everything else froze, and refreshingly honest: its own paper concludes targeted human collaboration beats full autonomy. Caveats: its 54.7% benchmark win over AI Scientist v2 is self-graded on the lab's own ARC-Bench, and its 13.4K stars coexist with near-zero independent community discussion.

The Verification Arms Race

The tier's emerging differentiator isn't generation — it's verification. Kosmos cites every statement to code or literature; AutoResearchClaw ships 4-layer citation checks and anti-fabrication registries; arXiv now bans authors who submit unverified AI-generated content. The systems winning trust are the ones engineering their errors to be findable.


Competitive Dynamics

What Changed Since April

  1. The ecosystem flywheel stopped. April's story was "7 days from Karpathy's release to 6+ variants." June's story is that nearly all of those variants — and the original — are frozen. The pattern won; the repos lost.

  2. Capital replaced stars as the scoreboard. Kosmos's $70M, Autoscience's $14M, and Recursive Superintelligence's $650M say more about where autoresearch is going than any GitHub metric.

  3. Credentialing became the moat. Nature publication (Sakana), peer-review acceptances (Autoscience's Carl), Nature-validated discoveries (Google Co-Scientist) — the scientific establishment's gatekeepers are now the benchmark that matters.

  4. The platform gap is being filled top-down. April's "nobody has built autoresearch-as-a-service" observation is being answered — but by closed commercial platforms (per-run Kosmos, embedded Mira), not the horizontal open-source layer the ecosystem expected.

The Platform Gap, Revisited

The vertical fragmentation remains on the open-source side: Karpathy's is LLM-training-specific (and dormant), pi-autoresearch is Pi-editor-specific (though its ports spread the architecture), AutoKernel is kernel-specific (and stalled). The horizontal layer is emerging as a commercial product category instead — which is exactly what the funding signals predicted.


Technical Comparison

DimensionExperiment LoopsDeep ResearchScientific Discovery
InputCode + metricQuery/topicResearch goal + datasets
Agent actionEdit code, run benchmarkSearch web, read sourcesDesign experiments, analyze data, write papers
OutputOptimized code + logMarkdown/PDF reportCited report or LaTeX paper
Loop typeKeep/revert per experimentDepth/breadth explorationTree search / multi-cycle discovery runs
DurationHours to daysMinutes to hoursHours (Kosmos: 12-hour runs)
Business modelFree OSS (mostly dormant)Free OSS + BYO API keysPer-run ($200), seats, enterprise deals
VerificationBenchmark + git historySource citationsCitation-to-code/literature, accuracy audits

What to Watch

Near-term (H2 2026)

  • Whether any dormant Tier 1 repo revives — a Karpathy follow-up (the "research community of PhD students" vision) would restart the wave
  • Named Mira customers with verifiable results, and whether Kosmos's Incyte collaboration reports productivity gains net of the verification tax
  • Independent replication of AutoResearchClaw's ARC-Bench results, and a first peer-review acceptance for one of its generated papers

Medium-term (2026-2027)

  • Tongyi's promised next-generation agentic models — if nothing ships in 2026, prompt-based agents on newer frontier models erode its benchmark lead
  • Whether venue defenses (arXiv's ban policy, reviewer-consent norms) make autonomous paper generation publishable at all — or push the category fully toward Kosmos-style data discovery
  • Sakana's v3 leveraging the Nature paper's scaling-law claim: paper quality tracks foundation-model capability

Long-term (2027+)

  • Whether "automated AI research" as a venture category (Recursive Superintelligence's $650M bet) produces a genuine research breakthrough attributable to an agent
  • Consolidation: the deep research tier's likely endgame is absorption into frontier-lab products, with GPT Researcher as the durable open-source baseline

Bottom Line

The April thesis — autoresearch as a paradigm shift hiding inside a simple loop — survived. The repos didn't. Karpathy's 86K-star category definition is dormant and unlicensed; the launch-week derivatives froze; only pi-autoresearch still ships. The pattern's value was proven in production (PostHog's three-year-old bug) and its failure mode documented (Shopify's overfit benchmark) — and then the energy moved up-stack.

The category's second act is commercial science: Kosmos selling $200 discovery runs into pharma R&D, Autoscience deploying research agents into production ML models, Google distributing Co-Scientist through Gemini for Science to national labs. Nature publishing Sakana's methodology gave the field its scientific credential at exactly the moment the open-source wave receded.

The biggest opportunity has changed shape: in April it was a horizontal open-source platform — any repo + any metric. In June it's trust infrastructure: verification, citation integrity, and accuracy auditing for autonomous research systems whose best-in-class still gets one in five conclusions wrong. The tools that make AI research checkable will outlast the tools that merely make it fast.


Research by Ry Walker Research • methodology