Autoresearch Tools Compared | Ry Walker Research

Key takeaways

The hobbyist/experimental tier froze: Karpathy's autoresearch is dormant at 86K stars (no commits since March 26, 2026, no LICENSE file), and AutoAgent, AutoKernel, and autoresearch-at-home all stalled within days of launch
Commercial science arrived in its place — Kosmos ($70M seed, $200/run, embedded in Incyte's R&D), Autoscience ($14M led by General Catalyst), and Google Co-Scientist (GA via Gemini for Science at I/O May 2026, deployed to all 17 DOE national labs)
Nature published the first peer-reviewed methodology for a fully automated AI research system — Sakana's AI Scientist, March 26, 2026 — credentialing the category even as the open-source repos went quiet
Recursive Superintelligence's $650M raise signals "automated AI research" is now a venture category — the platform gap the hobbyist wave exposed is being filled by funded companies, not weekend repos

FAQ

What is autoresearch?

Autoresearch is a pattern where AI agents autonomously run experiment loops — modify code, run a benchmark, measure the result, keep improvements or revert failures, and repeat. Coined by Andrej Karpathy in March 2026.

How many experiments can autoresearch run overnight?

Karpathy's original runs ~12 experiments/hour (~100 overnight) with 5-minute training budgets. AutoKernel runs ~40 experiments/hour with 90-second cycles. Results vary by domain and benchmark duration.

Does autoresearch only work for ML training?

No. pi-autoresearch generalizes the pattern to any measurable metric — test speed, bundle size, build times, Lighthouse scores. AutoKernel applies it to GPU kernel optimization. The pattern is domain-agnostic.

What is the difference between autoresearch and deep research agents?

Autoresearch runs code experiment loops (edit → benchmark → keep/revert). Deep research agents search the web, read sources, and synthesize knowledge reports. Both are autonomous but solve different problems.

Is Karpathy's autoresearch still maintained?

No. The repo has had no commits since March 26, 2026, with 185 open issues and no LICENSE file despite the README's MIT reference. The pattern lives on in forks and ports — pi-autoresearch is the most active — but the original repo is dormant.

What are the commercial AI scientist platforms?

Kosmos (Edison Scientific) runs 12-hour autonomous discovery loops at $200 per run; Autoscience deploys its Mira agent into customers' production ML models; Google Co-Scientist generates and ranks research hypotheses via Gemini for Science. All three are closed, funded, and enterprise-focused.

Executive Summary

Three months after Karpathy's autoresearch lit up GitHub, the category has split in two. The hobbyist/experimental tier froze: Karpathy's repo is dormant at 86,192 stars (no commits since March 26, 2026, no LICENSE file), and the launch-week derivatives — AutoAgent, AutoKernel, autoresearch-at-home — all stalled within days of shipping.

Meanwhile, commercial science arrived. Kosmos (Edison Scientific) raised a $70M seed, charges $200 per 12-hour discovery run, and is embedded across Incyte's R&D. Autoscience raised $14M led by General Catalyst for its Carl and Mira agents. Google took Co-Scientist GA via Gemini for Science at I/O May 2026 and deployed it to all 17 DOE national labs. And on the same day Karpathy's repo went quiet, Nature published the first peer-reviewed methodology for a fully automated AI research system — Sakana's AI Scientist.

Key Findings:

The experiment-loop tier is a graveyard with one survivor — only pi-autoresearch still ships releases; everything else froze within weeks of launch
The money moved to scientific discovery — Kosmos, Autoscience, and Google Co-Scientist are funded, closed, enterprise products, not repos
Credentialing replaced code — Nature publication (AI Scientist), arXiv papers (AutoKernel, AutoResearchClaw), and peer-review milestones now define the leaderboard
Recursive Superintelligence's $650M raise confirms "automated AI research" as a venture category — the platform gap is being filled by companies, not open source

Strategic Planning Assumptions:

By 2027, the dominant autoresearch products will be closed commercial platforms (Kosmos-style per-run pricing or Mira-style embedded agents), with open source relegated to reference implementations
The experiment-loop pattern will survive as a feature inside coding agents (pi-autoresearch's ports to Claude Code and Cursor are the template), not as standalone tools
Verification — citation integrity, result registries, accuracy audits — becomes the primary differentiator as venues crack down on AI-generated papers

Market Definition

Autoresearch tools are AI systems that autonomously conduct research — whether through code experimentation, web knowledge synthesis, or scientific discovery — with minimal human intervention.

Inclusion Criteria:

Autonomous operation (agent decides what to try next)
Measurable output (metrics, reports, papers, or deployed improvements)
Open source, or commercially available with publicly documented capabilities

Exclusion Criteria:

Manual AI-assisted tools (copilots that suggest but don't act)
Pure benchmarking frameworks without agent loops

Note: the original April 2026 edition excluded proprietary products and required active development. Both criteria are gone — the most consequential entrants are now closed commercial platforms, and half the open-source field is dormant. Dormancy is now a status flag, not a disqualifier.

Status Check: The Dormant Tier (June 2026)

The defining fact of this refresh — most of the open-source wave stopped moving:

Tool	Stars	Last activity	Status
karpathy/autoresearch	86,192	Mar 26, 2026	Dormant. 185 open issues, no LICENSE file despite README's MIT reference
AutoAgent	~4,500	Apr 3, 2026	Frozen since launch week. Zero commits, zero releases since one day after creation
AutoKernel	1,404	Mar 19, 2026	Stalled six days after launch; companion arXiv paper is the lasting artifact
autoresearch-at-home	487	Mar 13, 2026	Dormant three days after creation; one documented coordinated run
dzhng/deep-research	19,099	mid-2025	Dormant. Only commit since Sep 2025 is an April 2026 README attribution change (Aomni → Duet)
AI Scientist	13.9k (v1) / 6.5k (v2)	Dec 2025	Repo quiet since the custom-license change; 2026 milestone was the Nature paper, not code
Tongyi DeepResearch	19,360	Feb 27, 2026	Cooling. Sept 2025 weights remain the only release

The paradox: stars keep climbing across the board while commits stop. Usage of the patterns is accelerating — PostHog and Shopify ran the Karpathy loop against production codebases — but the repos themselves are unmaintained.

Tier 1: Experiment Loop Agents

The "overnight optimization" pattern. Agent modifies a single file, runs a fixed-time benchmark, keeps improvements, reverts failures, repeats autonomously.

Market Map

Tool	⭐ Stars	Created	Domain	Status (June 2026)
karpathy/autoresearch	86,192	Mar 6, 2026	LLM training	Dormant since Mar 26; no license
davebcn87/pi-autoresearch	~7,000	Mar 11, 2026	Any metric	Active — v1.6.0 (Jun 8), Earendil stewardship
kevinrgu/autoagent	~4,500	Apr 2026	Agent harnesses	Frozen since Apr 3 (launch week)
Hyperspace AGI	1,923	Mar 8, 2026	Multi-domain	Agent branches alive; pivoting to crypto (A1 blockchain, airdrop trackers)
RightNow-AI/autokernel	1,404	Mar 11, 2026	GPU kernels	Stalled Mar 19; arXiv paper published
autoresearch-at-home	487	Mar 10, 2026	Distributed	Dormant since Mar 13

The launch-week ports (MLX, Windows/RTX, autoresearch-mlx-mkw) remain small curiosities.

pi-autoresearch is the only survivor — five releases since March (v1.2.0–v1.6.0), confidence scoring to separate real gains from benchmark jitter, and an npm-scope migration to @earendil-works tracking Pi's move under Earendil stewardship. Its architecture became the template for community ports to Claude Code and Cursor.

The Core Architecture

All experiment loop tools share this pattern:

program.md — Natural language instructions defining what to optimize, constraints, and strategy
Single file constraint — Agent only modifies one file (e.g., train.py), keeping scope manageable
Fixed time budget — Each experiment runs for the same duration, making results comparable
Append-only log — Results survive restarts and context resets
Keep/revert decision — Binary outcome per experiment, committed to git

Karpathy's insight — you're not writing code, you're writing the markdown that tells the agent how to write code — survives the repo's dormancy. The pattern's real-world record is now mixed and instructive: it found a three-year-old query-engine bug at PostHog, and produced an overfit 53% "speedup" at Shopify that reviewers rejected.

AutoAgent took the loop meta — optimizing the agent harness itself (prompts, tools, routing) via Harbor benchmarks — before freezing at the proof-of-concept stage.

Tier 2: Deep Research Agents

Web-based knowledge synthesis. These agents search, read, reason, and produce comprehensive research reports. The tier has matured into a two-speed market: one actively shipping incumbent, four slowing or static alternatives.

Market Map

Tool	⭐ Stars	Created	Status (June 2026)
gpt-researcher	27,643	May 2023	Active — v3.5.0 (May 28, 2026), steady release cadence since 2023
Tongyi DeepResearch	19,360	Jan 2025	Cooling — repo quiet since Feb 2026; Sept 2025 weights still SOTA-class, now on OpenRouter with a free tier
dzhng/deep-research	19,099	Feb 2025	Dormant since mid-2025; attribution moved to Duet. Still the best ~500-LoC teaching artifact
open_deep_research	11,671	Nov 2024	Maintenance mode — dependency bumps only since the July 2025 supervisor rewrite
DeepResearchAgent	3,449	May 2025	Slowing — last push May 4, 2026; v2.0.0 "self evolving" remains the only major release

Approaches Diverging

The two schools of thought from April still hold, with a status update:

Prompt-based: Use frontier models with good prompting and tool orchestration. GPT Researcher is the durable winner here — three years of releases, ~$0.10 per research run, provider-agnostic. dzhng/deep-research and LangChain's open_deep_research persist as reference architectures rather than evolving products.

Fine-tuned: Train specialized models via RL. Tongyi DeepResearch proved a 30B MoE (3.3B active params) can lead BrowseComp, GAIA, and HLE — and it runs in production inside Alibaba (Amap, Tongyi FaRui) — but nine months without a new checkpoint is eroding the bet. The promised "next generation of agentic models" hasn't shipped.

SkyworkAI's Autogenesis protocol (self-evolving agents that version their own tools and prompts) remains the most architecturally ambitious design in the tier, with near-zero independent validation.

Tier 3: Scientific Discovery Agents

End-to-end: ideation → experiment → paper or report. This is where the category's center of gravity moved — and where the money went.

Market Map

Tool	Model	Created	Key Differentiator
Kosmos (Edison Scientific)	Closed SaaS, $200/run	2025	12-hour runs: ~200 rollouts, ~42K lines of code, ~1,500 papers read; every statement cited; $70M seed; embedded in Incyte's R&D
Google Co-Scientist	Closed (Gemini)	Feb 2025	Multi-agent Gemini research partner; Gemini for Science GA at I/O May 2026; DOE national labs; enterprise (Daiichi Sankyo, Bayer Crop Science)
Autoscience (Carl + Mira)	Closed, early access	Mar 2025	First AI papers through double-blind review (ICLR 2025 workshops — withdrawn amid controversy); Mira deploys research into production ML models; $14M GC seed
AutoResearchClaw	Open source (MIT)	Mar 15, 2026	UNC AIMING Lab's 23-stage pipeline: idea → cited LaTeX paper; 13.4K stars in under 3 months; 4-layer citation verification; walked back its own "no human intervention" claim
AI Scientist (Sakana)	Open source (custom license)	Aug 2024	First AI paper through peer review; methodology published in Nature (Mar 26, 2026) — but no v3 and quiet repos since Dec 2025

The Commercial Arrival

The April edition excluded proprietary products. That's no longer tenable — the three most consequential entrants are closed:

Kosmos is the most credible commercial AI scientist: traceable citations on every statement, three of seven reported discoveries reproducing unpublished findings, and a real pharma deployment. The catch is accuracy — independent scientists rated 79.4% of statements accurate, with synthesis claims at just 58%.

Autoscience has the strongest external validation (double-blind peer review, a Kaggle featured-competition silver medal) and the most tainted milestone — the ICLR papers were submitted without organizers' knowledge and withdrawn after academics accused the company of co-opting peer review for publicity. Its commercial product Mira reads 1,200+ papers a week and implements improvements directly into customers' production models.

Google Co-Scientist is the scale play: a supervisor coordinating Generation, Reflection, Ranking, Evolution, Proximity, and Meta-review agents, with Nature-published validations and distribution through Gemini for Science to individual researchers, enterprise pharma, and all 17 DOE national labs.

AutoResearchClaw is the open-source counterweight — MIT-licensed where AI Scientist relicensed restrictively, actively shipping where everything else froze, and refreshingly honest: its own paper concludes targeted human collaboration beats full autonomy. Caveats: its 54.7% benchmark win over AI Scientist v2 is self-graded on the lab's own ARC-Bench, and its 13.4K stars coexist with near-zero independent community discussion.

The Verification Arms Race

The tier's emerging differentiator isn't generation — it's verification. Kosmos cites every statement to code or literature; AutoResearchClaw ships 4-layer citation checks and anti-fabrication registries; arXiv now bans authors who submit unverified AI-generated content. The systems winning trust are the ones engineering their errors to be findable.

Competitive Dynamics

What Changed Since April

The ecosystem flywheel stopped. April's story was "7 days from Karpathy's release to 6+ variants." June's story is that nearly all of those variants — and the original — are frozen. The pattern won; the repos lost.
Capital replaced stars as the scoreboard. Kosmos's $70M, Autoscience's $14M, and Recursive Superintelligence's $650M say more about where autoresearch is going than any GitHub metric.
Credentialing became the moat. Nature publication (Sakana), peer-review acceptances (Autoscience's Carl), Nature-validated discoveries (Google Co-Scientist) — the scientific establishment's gatekeepers are now the benchmark that matters.
The platform gap is being filled top-down. April's "nobody has built autoresearch-as-a-service" observation is being answered — but by closed commercial platforms (per-run Kosmos, embedded Mira), not the horizontal open-source layer the ecosystem expected.

The Platform Gap, Revisited

The vertical fragmentation remains on the open-source side: Karpathy's is LLM-training-specific (and dormant), pi-autoresearch is Pi-editor-specific (though its ports spread the architecture), AutoKernel is kernel-specific (and stalled). The horizontal layer is emerging as a commercial product category instead — which is exactly what the funding signals predicted.

Technical Comparison

Dimension	Experiment Loops	Deep Research	Scientific Discovery
Input	Code + metric	Query/topic	Research goal + datasets
Agent action	Edit code, run benchmark	Search web, read sources	Design experiments, analyze data, write papers
Output	Optimized code + log	Markdown/PDF report	Cited report or LaTeX paper
Loop type	Keep/revert per experiment	Depth/breadth exploration	Tree search / multi-cycle discovery runs
Duration	Hours to days	Minutes to hours	Hours (Kosmos: 12-hour runs)
Business model	Free OSS (mostly dormant)	Free OSS + BYO API keys	Per-run ($200), seats, enterprise deals
Verification	Benchmark + git history	Source citations	Citation-to-code/literature, accuracy audits

What to Watch

Near-term (H2 2026)

Whether any dormant Tier 1 repo revives — a Karpathy follow-up (the "research community of PhD students" vision) would restart the wave
Named Mira customers with verifiable results, and whether Kosmos's Incyte collaboration reports productivity gains net of the verification tax
Independent replication of AutoResearchClaw's ARC-Bench results, and a first peer-review acceptance for one of its generated papers

Medium-term (2026-2027)

Tongyi's promised next-generation agentic models — if nothing ships in 2026, prompt-based agents on newer frontier models erode its benchmark lead
Whether venue defenses (arXiv's ban policy, reviewer-consent norms) make autonomous paper generation publishable at all — or push the category fully toward Kosmos-style data discovery
Sakana's v3 leveraging the Nature paper's scaling-law claim: paper quality tracks foundation-model capability

Long-term (2027+)

Whether "automated AI research" as a venture category (Recursive Superintelligence's $650M bet) produces a genuine research breakthrough attributable to an agent
Consolidation: the deep research tier's likely endgame is absorption into frontier-lab products, with GPT Researcher as the durable open-source baseline

Bottom Line

The April thesis — autoresearch as a paradigm shift hiding inside a simple loop — survived. The repos didn't. Karpathy's 86K-star category definition is dormant and unlicensed; the launch-week derivatives froze; only pi-autoresearch still ships. The pattern's value was proven in production (PostHog's three-year-old bug) and its failure mode documented (Shopify's overfit benchmark) — and then the energy moved up-stack.

The category's second act is commercial science: Kosmos selling $200 discovery runs into pharma R&D, Autoscience deploying research agents into production ML models, Google distributing Co-Scientist through Gemini for Science to national labs. Nature publishing Sakana's methodology gave the field its scientific credential at exactly the moment the open-source wave receded.

The biggest opportunity has changed shape: in April it was a horizontal open-source platform — any repo + any metric. In June it's trust infrastructure: verification, citation integrity, and accuracy auditing for autonomous research systems whose best-in-class still gets one in five conclusions wrong. The tools that make AI research checkable will outlast the tools that merely make it fast.

Research by Ry Walker Research • methodology

Sources