Key takeaways
- Autoresearch applied to GPU kernel optimization — profiles any PyTorch model, extracts bottleneck ops, autonomously optimizes Triton or CUDA C++ kernels
- Uses Amdahl's law to prioritize which kernels to optimize next, maximizing aggregate speedup per experiment hour
- ~40 experiments/hour (~90s each), ~320 overnight. 5-stage correctness checks prevent regressions during autonomous operation
- Ships with pre-built profiles for GPT-2, LLaMA, BERT — start optimizing immediately without model setup
FAQ
What is AutoKernel?
AutoKernel is an autonomous AI agent that optimizes GPU kernels. It profiles a PyTorch model, finds bottleneck operations, and lets an AI agent iteratively optimize the Triton or CUDA kernels — keeping improvements, reverting failures.
How does it decide which kernel to optimize?
It uses Amdahl's law reasoning — prioritizing kernels where optimization would have the largest impact on total model performance, based on GPU time profiling data.
Overview
AutoKernel takes the autoresearch pattern and applies it to a high-value vertical: GPU kernel optimization. Give it any PyTorch model, and an AI agent will profile it, extract bottleneck operations as standalone kernels, and optimize each one autonomously.
~40 experiments/hour. ~320 overnight. Each experiment takes ~90 seconds with 5-stage correctness verification.
Architecture
Pipeline
Any PyTorch ──▶ Rank kernels ──▶ Generate baseline ──▶ Optimize each ──▶ End-to-end
model by GPU time Triton/CUDA kernels kernel (agent) verification
(profile.py) (extract.py) (bench.py loop) (verify.py)
Tools
| Tool | What it does |
|---|---|
profile.py | Profiles any PyTorch model with torch.profiler, ranks kernels by GPU time |
extract.py | Extracts top-N bottleneck kernels into standalone Triton or CUDA C++ files |
orchestrate.py | Multi-kernel scheduler using Amdahl's law to prioritize work |
bench.py | Fixed benchmark with 5-stage correctness checks + roofline analysis |
verify.py | End-to-end correctness and speedup verification |
Amdahl's Law Scheduling
The orchestrate.py scheduler decides which kernel to optimize next based on potential impact. A kernel consuming 40% of GPU time has more optimization headroom than one at 2%. This is a form of agent task prioritization — "which sub-task has the most leverage?" — applicable beyond kernel optimization.
Strengths & Limitations
Strengths:
- Solves an expensive, real problem (kernel optimization is expert work)
- Smart Amdahl's law prioritization
- Supports both Triton and CUDA C++ backends
- 5-stage correctness checks prevent silent regressions
- Ships with model profiles for immediate use
Limitations:
- NVIDIA GPU required (H100/A100/RTX 4090 tested)
- 2 days old, unproven at scale
- Vertical-specific (GPU kernels only)
Bottom Line
AutoKernel validates that autoresearch works for specialized optimization domains beyond LLM training. The Amdahl's law scheduler is an interesting agent orchestration primitive — prioritizing tasks by expected impact. The pattern could apply to any multi-target optimization: which test to speed up first, which API endpoint to optimize, which component to shrink.
Research by Ry Walker Research • methodology