← Back to research
·3 min read·company

pi-autoresearch

pi-autoresearch generalizes Karpathy's autoresearch loop to any optimization target — test speed, bundle size, Lighthouse scores, build times. Domain-agnostic experiment loops for the pi editor.

Key takeaways

  • Generalizes autoresearch beyond ML — any measurable metric becomes an optimization target: test speed, bundle size, Lighthouse scores, build times
  • Clean extension/skill separation: infrastructure (run/log/dashboard) is global, domain knowledge (command, metric, scope) is per-project
  • Resumable sessions — autoresearch.md + autoresearch.jsonl let any fresh agent continue where the last left off, surviving context resets
  • Built for pi editor; proves the autoresearch pattern is extractable from ML into general software engineering

FAQ

What is pi-autoresearch?

An extension for the pi editor that adds autonomous experiment loops for any optimization target. Try an idea, measure it, keep what works, discard what doesn't, repeat forever.

What metrics can it optimize?

Anything measurable: test execution time, JavaScript bundle size, build speed, Lighthouse performance scores, LLM training loss, or any custom metric that outputs a number.

Overview

pi-autoresearch takes Karpathy's autoresearch pattern and makes it domain-agnostic. Built as an extension for the pi editor, it adds autonomous experiment loops that work for any optimization target — not just ML training.

"Try an idea, measure it, keep what works, discard what doesn't, repeat forever."

Architecture

Extension + Skill Pattern

The key insight is separating infrastructure from domain knowledge:

LayerScopeWhat it provides
Extension (global)All projectsrun_experiment, log_experiment, widget, dashboard
Skill (per-domain)Specific projectcommand, metric, direction, scope, ideas

This means one extension serves unlimited domains — from optimizing React bundle size to ML training loss.

Three Tools

ToolDescription
init_experimentOne-time session config — name, metric, unit, direction (lower/higher)
run_experimentRuns any command, times wall-clock duration, captures output
log_experimentRecords result, auto-commits, updates widget and dashboard

Session Files

FilePurpose
autoresearch.mdSession document: objective, metrics, files in scope, what's been tried
autoresearch.shBenchmark script: pre-checks, runs workload, outputs METRIC name=number
autoresearch.checks.shOptional backpressure: tests, types, lint. Failures block keeping changes

Example Targets

DomainMetricCommand
Test speedseconds ↓pnpm test
Bundle sizeKB ↓pnpm build && du -sb dist
LLM trainingval_bpb ↓uv run train.py
Build speedseconds ↓pnpm build
Lighthouseperf score ↑lighthouse http://localhost:3000 --output=json

UX

  • Status widget — Always visible above editor: 🔬 autoresearch 12 runs 8 kept │ best: 42.3s
  • /autoresearch dashboard — Full results table (Ctrl+X to toggle, Escape to close)
  • Resumableautoresearch.md captures enough context for any fresh agent to continue

Strengths & Limitations

Strengths:

  • Proves autoresearch generalizes beyond ML
  • Clean architecture (extension/skill separation)
  • Correctness checks gate keeping changes (tests/lint must pass)
  • Good UX with real-time dashboard
  • Branch-aware, resumable across context resets

Limitations:

  • pi editor only (not standalone CLI)
  • 2 days old, small community (817 stars)
  • No multi-agent coordination

Bottom Line

pi-autoresearch is the proof point that the autoresearch pattern is bigger than ML. By separating infrastructure from domain knowledge, it shows that any CI pipeline with a metric is an autoresearch target. The question is whether this pattern stays editor-specific or becomes a standalone platform.


Research by Ry Walker Research • methodology