AutoResearchClaw | Ry Walker Research

Key takeaways

13,360 GitHub stars and 1,570 forks within three months of its March 15, 2026 launch — one of the fastest-growing autoresearch repos — backed by an arXiv paper claiming it outperforms AI Scientist v2 by 54.7% on the lab's own ARC-Bench benchmark
The launch pitch was "no human intervention required"; three weeks later v0.4.0 declared the system "no longer purely autonomous," adding a human-in-the-loop co-pilot with six intervention modes — and the paper concludes targeted collaboration beats full autonomy
Differentiators are verification-heavy: 4-layer citation integrity checks against arXiv/CrossRef/DataCite, anti-fabrication result registries, multi-agent peer review, and cross-run "self-evolution" via MetaClaw
Free and MIT-licensed; you bring your own LLM API keys, and per-paper API cost is not publicly disclosed

FAQ

What is AutoResearchClaw?

An open-source 23-stage autonomous research pipeline from AIMING Lab at UNC-Chapel Hill that turns a natural-language research idea into a conference-ready LaTeX paper, with literature retrieval, sandboxed experiments, statistical analysis, multi-agent peer review, and citation verification.

How much does AutoResearchClaw cost?

The software is free under the MIT license. You supply your own LLM API keys; the project ships cost budget guardrails but does not publicly disclose a typical per-paper API cost.

Is AutoResearchClaw really fully autonomous?

It can run end-to-end with --auto-approve, but the project itself walked back the launch claim — v0.4.0 added a human-in-the-loop co-pilot system, and the team's paper reports that targeted human collaboration at key decision points outperforms full autonomy.

How is AutoResearchClaw different from AI Scientist?

Both generate complete papers autonomously, but AutoResearchClaw is MIT-licensed (AI Scientist moved to a restrictive custom license), runs on commodity hardware via pluggable CLI agent backends, and emphasizes citation verification and human-in-the-loop modes — though unlike AI Scientist, none of its papers has passed peer review.

Executive Summary

AutoResearchClaw is an open-source autonomous research system from AIMING Lab at UNC-Chapel Hill: drop a research topic into a chat or CLI and a 23-stage pipeline retrieves real literature from OpenAlex, Semantic Scholar, and arXiv, designs and runs sandboxed experiments, performs statistical analysis, runs multi-agent peer review, and emits a conference-ready LaTeX paper targeting NeurIPS/ICML/ICLR templates.^[1]^[2] Traction has been steep for the category: 13,360 GitHub stars and 1,570 forks as of June 2026, less than three months after the repo's creation on March 15, 2026, with the last push on June 3, 2026 and just 10 open issues.^[3]

The most honest thing about the project is that it contradicted itself in public. v0.1.0 launched as "a fully autonomous 23-stage research pipeline… No human intervention required"; by April 1, 2026, v0.4.0 opened with "AutoResearchClaw is no longer purely autonomous," adding a human-in-the-loop co-pilot system with six intervention modes.^[1] The accompanying arXiv paper — 35 authors including Caiming Xiong, James Zou, Cihang Xie, and lab lead Huaxiu Yao — claims a 54.7% win over AI Scientist v2 on ARC-Bench (a benchmark the same lab created) and concludes that "targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight," repositioning the tool as an "amplifier that augments rather than replaces human scientific judgment."^[4]^[5]

Attribute	Value
Creator	AIMING Lab, UNC-Chapel Hill ("Adaptive Intelligence through Alignment, Interaction and Learning")^[2]
Launched	March 15, 2026 (v0.1.0)^[3]^[6]
GitHub Stars	13,360 (1,570 forks, 10 open issues, as of June 2026)^[3]
License	MIT^[3]
Paper	arXiv 2605.20025, submitted May 19, 2026^[4]
Funding	Academic lab project; no commercial funding disclosed^[2]

Product Overview

The core loop is one command: researchclaw run --topic "Your research idea" --auto-approve produces an artifacts folder containing a full paper draft, compile-ready LaTeX, a BibTeX file auto-pruned to inline citations, a citation verification report, experiment code with structured JSON metrics, comparison charts with error bars, multi-agent reviews, and "evolution" lessons extracted from the run.^[1] A showcase documents 8 generated papers across 8 domains — math, statistics, biology, computing, NLP, RL, vision, and robustness — produced "fully autonomously or with Human-in-the-Loop co-pilot guidance."^[1]

The pipeline rests on five mechanisms described in the paper: structured multi-agent debate for hypothesis generation, a self-healing executor with a Pivot/Refine loop that treats failures as information, verifiable result reporting to prevent fabricated numbers and hallucinated citations, human-in-the-loop collaboration spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards.^[4]

Key Capabilities

Capability	Description
23-stage pipeline	Idea → literature → hypothesis → experiments → analysis → review → LaTeX, end-to-end^[1]
Citation verification	4-layer integrity + relevance checks (arXiv, CrossRef, DataCite, LLM); real references only^[1]
HITL co-pilot	Six intervention modes (`full-auto` to `co-pilot`), SmartPause, Idea Workshop, Paper Co-Writer, CLI `attach`/`approve`/`guide`^[1]
Self-evolution	MetaClaw cross-run learning: failures → structured lessons → reusable skills; +18.3% robustness claimed in controlled experiments^[1]
Domain-specialist agents	v0.5.0 routes experiments to field-specific executors — high-energy physics, biology (COBRApy), statistics — beyond the default ML sandbox^[6]^[1]
ARC-Bench	55-topic open-ended autonomous-research benchmark (ML 25, HEP 10, quantum 10, biology 7, statistics 3), released on Hugging Face^[1]^[5]

Product Surfaces

Surface	Description	Availability
CLI (`researchclaw`)	Standalone pipeline runner with config YAML	GA^[1]
OpenClaw integration	"Research X" in chat → full run; bridges to Discord, Telegram, Lark, WeChat	GA^[1]
ACP agent backends	Delegates code-generation stages to Claude Code, Codex CLI, Copilot CLI, Gemini CLI, or Kimi CLI	GA (v0.3.2)^[1]

Technical Architecture

AutoResearchClaw is a Python 3.11+ application installed from source via pip install -e ., configured interactively (researchclaw setup / init) with a chosen LLM provider; experiments execute in a hardened Docker sandbox with network-policy-aware execution and GPU/MPS/CPU auto-detection.^[1] Anti-fabrication machinery is a recurring theme: a VerifiedRegistry plus an experiment diagnosis-and-repair loop (v0.3.2), a 4-round paper quality audit with "AI-slop detection" and NeurIPS-checklist scoring (v0.2.0), and the citation kill-switch ("When citations are fake, it kills them").^[1]^[6] The README badge reports 2,699 passing tests.^[1]

Key Technical Details

Aspect	Detail
Deployment	Self-hosted only; local CLI or any ACP-compatible agent backend^[1]
Model(s)	Provider-agnostic — bring your own API key (OpenAI-style config shown; Novita AI added v0.3.1)^[1]
Integrations	OpenClaw, MetaClaw, OpenCode "Beast Mode" for complex codegen, messaging platforms via OpenClaw bridge^[1]
Open Source	Fully open, MIT license^[3]

Strengths

Exceptional star velocity for the category — 13,360 stars in under three months puts it within range of AI Scientist v1's all-time total, with active maintenance (last push June 3, 2026) and a clean 10-open-issue tracker.^[3]
Verification is the design center, not an afterthought — 4-layer citation checking, verified result registries, and AI-slop detection directly target the failure modes that made earlier paper generators infamous.^[1]
Credible academic backing — a 35-author arXiv paper with recognizable names (James Zou, Caiming Xiong, Cihang Xie) and a UNC-Chapel Hill lab behind it, plus a public 55-topic benchmark others can run.^[4]^[5]^[2]
Honest course-correction on autonomy — shipping six HITL intervention modes and publishing data that targeted collaboration beats full autonomy is more intellectually honest than the launch tagline.^[1]^[4]
Pluggable everything — LLM providers, CLI agent backends, messaging frontends, and domain-specialist executors keep it from being locked to one vendor's stack.^[1]

Cautions

The headline claim is self-refuted — "No human intervention required" (v0.1.0) became "no longer purely autonomous" (v0.4.0) in 17 days of releases; treat "chat an idea, get a paper" as marketing for a system whose own paper says humans at key decision points produce better output.^[1]^[6]^[4]
The benchmark win is self-graded — the 54.7% advantage over AI Scientist v2 is measured on ARC-Bench, a benchmark created and released by the same lab; no independent replication exists as of June 2026.^[4]^[5]
No peer-review milestone — unlike AI Scientist, whose output passed an ICLR workshop review, none of AutoResearchClaw's 8 showcase papers is claimed to have been accepted anywhere.^[1]
It feeds a pipeline the venues are actively defending against — arXiv announced one-year bans (May 15, 2026) for authors who submit unverified LLM-generated content, after Nikkei found hidden prompts in 17 preprints designed to manipulate AI reviewers; a tool that mass-produces NeurIPS-formatted papers lowers the cost of exactly this flood.^[7]
Per-paper cost is undisclosed — the project ships budget guardrails but publishes no typical API spend, and a 23-stage pipeline with multi-agent debate and self-healing retries is structurally token-hungry.^[1]
Three months old — rapid iteration (v0.1.0 → v0.5.0 in nine weeks) cuts both ways; the v0.3.2 changelog alone lists 100+ bug fixes.^[6]^[1]

What Developers Say

Independent community discussion is strikingly thin relative to the star count: as of June 11, 2026, the only Hacker News submission of the repo (March 15, 2026) drew 2 points and zero comments, and no substantive Reddit threads surfaced in searches — for a 13K-star project, that absence is itself a data point worth weighing.^[8]^[3]

The criticism that does exist targets the genre rather than the tool. arXiv's computer-science moderators, announcing the May 2026 crackdown on unverified AI-generated submissions:

"Authors bear full responsibility for the content of their papers, regardless of how that content was produced." — arXiv CS moderation announcement, via The Decoder^[7]

"Hallucinated references and meta-comments left in by the language model — things like 'Here is a 200-word summary.'" — Thomas G. Dietterich, chair of arXiv's CS section, describing the failure pattern that triggers bans^[7]

The project's own authors supply the most quotable self-assessment: AutoResearchClaw is "an amplifier that augments rather than replaces human scientific judgment" — vendor voice, but a notable walk-back of the README's tagline.^[4]

Pricing & Licensing

Tier	Price	Includes
Open source	Free	Full pipeline, all 23 stages, HITL system, ARC-Bench, Discord community

You bring your own LLM API keys (provider-agnostic); compute for the Docker experiment sandbox is local.^[1]

Licensing model: MIT — fully permissive, unlike AI Scientist's December 2025 move to a custom responsible-AI license.^[3]

Hidden costs: LLM API spend per paper is not publicly disclosed and is the dominant real cost; budget guardrails (v0.4.0) exist precisely because runs can get expensive. LaTeX and Docker setup are prerequisites.^[1]

Competitive Positioning

Direct Competitors

Competitor	Differentiation
AI Scientist	The category pioneer with the peer-review milestone (ICLR 2025 workshop) and a Nature-published methodology; but restrictively relicensed and quiet since December 2025, while AutoResearchClaw is MIT-licensed and shipping — and claims a 54.7% ARC-Bench win over v2 (self-graded)^[4]
GPT Researcher	Deep research reports with citations, not novel science — no experiments, no LaTeX papers; far more battle-tested (since May 2023) for the literature-synthesis half of the job
Proprietary deep research (OpenAI, Google, Anthropic)	Produce research reports, not conference submissions with executed experiments; closed and platform-locked^[1]

When to Choose AutoResearchClaw Over Alternatives

Choose AutoResearchClaw when: you want the full idea-to-LaTeX pipeline with executed experiments, citation verification, and the option to steer at decision points — under a permissive MIT license.
Choose AI Scientist when: you want the system with a peer-reviewed track record and published-in-Nature methodology, and can accept the custom license and dormant repo.
Choose GPT Researcher when: you need literature-grounded research reports, not generated papers — it is cheaper, older, and does not pretend to do science.

Ideal Customer Profile

Best fit:

ML researchers using it as a co-pilot — hypothesis workshopping, baseline scaffolding, experiment orchestration, draft generation — with a human owning the claims
Labs studying autonomous-research systems themselves, for whom ARC-Bench and the open MIT codebase are research artifacts
Teams prototyping research ideas who want reproducible, sandboxed experiment runs with structured metrics

Poor fit:

Anyone planning to submit fully autonomous output to venues — arXiv now bans authors who fail to verify LLM-generated content, and no AutoResearchClaw paper has passed peer review^[7]^[1]
Non-ML domains beyond the v0.5.0 specialist executors (HEP, biology, statistics); coverage elsewhere is a generic Docker executor^[1]
Budget-sensitive users without API-cost headroom for a 23-stage multi-agent pipeline

Viability Assessment

Factor	Assessment
Financial Health	Academic lab project, no commercial entity or funding to evaluate; sustainability depends on AIMING Lab's continued investment^[2]
Market Position	Momentum leader in open-source autoresearch — 13.3K+ stars in three months while AI Scientist sits dormant^[3]
Innovation Pace	Very high — five minor versions, HITL system, domain agents, and a 55-topic benchmark in nine weeks^[6]^[5]
Community/Ecosystem	Stars without discourse — near-zero HN/Reddit engagement despite 1,570 forks; Discord exists but independent evaluation is absent^[8]^[3]
Long-term Outlook	Hinges on independent replication of ARC-Bench results and on whether venues' anti-AI-slop policies make the category's output publishable at all^[4]^[7]

The star-to-scrutiny ratio is the anomaly: 13,360 stars with one 2-point, zero-comment HN thread suggests adoption is running well ahead of independent evaluation.^[3]^[8] The strongest viability signal is the author list and the lab's willingness to publish data undercutting its own autonomy pitch; the weakest is that every performance claim currently traces back to the lab itself.^[4]

Bottom Line

AutoResearchClaw is the most active and most permissively licensed entry in the autonomous-research category, and its verification machinery — citation kill-switches, verified result registries, AI-slop audits — addresses the genre's real failure modes head-on. But the honest read of its own paper is that it is a research co-pilot, not a scientist: full autonomy is the marketing, targeted human collaboration is the documented best practice, and the headline benchmark win is graded on the lab's own rubric.

Recommended for: ML researchers who want an experiment-running, draft-writing co-pilot they can audit and steer; labs studying autonomous research systems.

Not recommended for: anyone hoping to "chat an idea" into a publishable paper without owning every claim — arXiv's ban policy now makes that a career risk, not a shortcut.

Outlook: Watch for independent ARC-Bench replications, a first peer-review acceptance for a generated paper, and whether community scrutiny ever catches up to the star count.

Research by Ry Walker Research • methodology

Sources