Key takeaways
- Generalizes autoresearch beyond ML — any measurable metric becomes an optimization target: test speed, bundle size, Lighthouse scores, build times
- Clean extension/skill separation: infrastructure (run/log/dashboard) is global, domain knowledge (command, metric, scope) is per-project
- Resumable sessions — `.auto/prompt.md` + `.auto/log.jsonl` let any fresh agent continue where the last left off, surviving context resets and crashes
- Grew from 817 stars at launch to ~7K by June 2026, with five releases since March and a confidence-scoring system that separates real gains from benchmark jitter
- The most-starred derivative of Karpathy's autoresearch — and the template the community ported to Claude Code and Cursor
FAQ
What is pi-autoresearch?
An extension for the Pi coding agent that adds autonomous experiment loops for any optimization target. Try an idea, measure it, keep what works, discard what doesn't, repeat forever.
What metrics can it optimize?
Anything measurable: test execution time, JavaScript bundle size, build speed, Lighthouse performance scores, LLM training loss, or any custom metric that outputs a number.
How much does pi-autoresearch cost?
It's free, MIT-licensed open source, distributed as the `pi-autoresearch` npm package. You pay only for the LLM tokens your Pi agent consumes while running experiments.
Overview
pi-autoresearch takes Karpathy's autoresearch pattern and makes it domain-agnostic. Built as a TypeScript extension for the Pi coding agent (now stewarded by Earendil), it adds autonomous experiment loops that work for any optimization target — not just ML training.
"Try an idea, measure it, keep what works, discard what doesn't, repeat forever."
Status (as of June 2026)
The project is healthy and actively maintained. The GitHub repo sits at ~7K stars and 413 forks (up from 817 stars at our March 2026 profile), with the latest push on June 8, 2026. Five releases have shipped since this profile was first written — v1.2.0 through v1.6.0 — with v1.6.0 (June 8, 2026) migrating the npm scope from @mariozechner to @earendil-works, tracking Pi's move under Earendil stewardship at pi.dev. It is listed as an official package in the pi.dev registry.
Pricing: free, MIT-licensed open source; the only cost is the LLM tokens Pi consumes while iterating.
Architecture
Extension + Skill Pattern
The key insight is separating infrastructure from domain knowledge:
| Layer | Scope | What it provides |
|---|---|---|
| Extension (global) | All projects | run_experiment, log_experiment, widget, dashboard |
| Skill (per-domain) | Specific project | command, metric, direction, scope, ideas |
This means one extension serves unlimited domains — from optimizing React bundle size to ML training loss.
Three Tools
| Tool | Description |
|---|---|
init_experiment | One-time session config — name, metric, unit, direction (lower/higher) |
run_experiment | Runs any command, times wall-clock duration, captures output |
log_experiment | Records result, auto-commits, updates widget and dashboard |
Session Files
As of v1.6.0, all session files live in a .auto/ subfolder (previously top-level autoresearch.* files):
| File | Purpose |
|---|---|
.auto/prompt.md | Session document: objective, metrics, files in scope, what's been tried |
.auto/log.jsonl | Append-only experiment log: commit hash, metric value, kept/discarded status |
| Backpressure checks | Optional: tests, types, lint. Failures block keeping changes |
Example Targets
| Domain | Metric | Command |
|---|---|---|
| Test speed | seconds ↓ | pnpm test |
| Bundle size | KB ↓ | pnpm build && du -sb dist |
| LLM training | val_bpb ↓ | uv run train.py |
| Build speed | seconds ↓ | pnpm build |
| Lighthouse | perf score ↑ | lighthouse http://localhost:3000 --output=json |
UX
- Status widget — Always visible above the editor with live run counts and best metric
- Dashboard — Defaults to expanded since v1.6.0; fullscreen overlay via Ctrl+Shift+F
- Confidence scoring — After 3+ experiments, green/yellow/red indicators distinguish real gains from benchmark jitter on noisy signals
- Resumable —
.auto/prompt.mdcaptures enough context for any fresh agent to continue; auto-compaction keeps long loops running across context limits
Changes Since March 2026
- Five releases (v1.2.0–v1.6.0) between late April and June 8, 2026
- Earendil migration — npm scope moved from
@mariozechnerto@earendil-worksas Pi itself (61.7K stars) moved under Earendil stewardship at pi.dev - Confidence scoring, hooks system, and config-file support added for custom behavior at iteration boundaries
- Ecosystem spread — the architecture has been ported to a Claude Code plugin (ozeron/autoresearch) and a Cursor port (cursor-autoresearch), making pi-autoresearch the de facto template for non-ML autoresearch
Strengths & Limitations
Strengths:
- Proves autoresearch generalizes beyond ML
- Clean architecture (extension/skill separation)
- Correctness checks gate keeping changes (tests/lint must pass)
- Good UX with real-time dashboard
- Branch-aware, resumable across context resets
Limitations:
- Pi agent only (not standalone CLI) — though community ports cover Claude Code and Cursor
- Single-maintainer project; bus factor remains a risk despite the Earendil ecosystem alignment
- No multi-agent coordination
What Developers Say
Community commentary is positive but still thin — most discussion happens in ecosystem roundups rather than long review threads.
- paddo.dev's autoresearch ecosystem survey called it "the most popular derivative by stars, largely because it makes the loop usable for non-ML tasks with a proper interface," noting it "adds persistent sessions that survive restarts and context resets, a dashboard UI, and branch-aware experiment tracking."
- On Hacker News, developer ozeron called it a "great find" and announced a Claude Code plugin port of its architecture.
Bottom Line
Recommended if you use Pi and have any metric worth optimizing overnight. pi-autoresearch is the proof point that the autoresearch pattern is bigger than ML. By separating infrastructure from domain knowledge, it shows that any CI pipeline with a metric is an autoresearch target. Not recommended as your entry point if you're not on Pi — use the Claude Code or Cursor ports instead.
Outlook: the March question — would this stay editor-specific or become a platform? — has been half-answered: it stayed a Pi extension, but its architecture became the template everyone else ports. With Earendil now stewarding Pi and the npm scope migrated accordingly, pi-autoresearch looks like durable first-party-adjacent infrastructure rather than a weekend hack.
Research by Ry Walker Research • methodology