AutoKernel | Ry Walker Research

Key takeaways

Autoresearch applied to GPU kernel optimization — profiles any PyTorch model, extracts bottleneck ops, autonomously optimizes Triton or CUDA C++ kernels; 300–400 experiments per overnight 10-hour run
Reported H100 results beat PyTorch eager by 5.29x (RMSNorm), 2.82x (softmax), 2.21x (cross-entropy) — but matmul is a weak spot at ~28% of cuBLAS throughput, a gap HN commenters called out
Development appears stalled — no commits since March 19, 2026, six days after launch, despite 1.4k stars and a companion arXiv paper
Meta's KernelEvolve (April 2026, ISCA 2026 paper) validates the agentic kernel-optimization category at production scale across NVIDIA, AMD, and MTIA — dwarfing AutoKernel in scope

FAQ

What is AutoKernel?

AutoKernel is an open-source (MIT) autonomous AI agent that optimizes GPU kernels. It profiles a PyTorch model, finds bottleneck operations, and lets an AI agent iteratively optimize the Triton or CUDA kernels — keeping improvements, reverting failures.

How does it decide which kernel to optimize?

It uses Amdahl's law reasoning — prioritizing kernels where optimization would have the largest impact on total model performance, based on GPU time profiling data.

Is AutoKernel still maintained?

Unclear. As of June 2026 the repo has had no commits since March 19, 2026 — six days after launch — with 12 open issues. It is not archived, but treat it as a frozen research artifact rather than a maintained tool.

How much does AutoKernel cost?

Nothing — it is MIT-licensed open source. You pay only for your own GPU time and LLM API usage during agent runs. RightNow AI's commercial product is its CUDA IDE; AutoKernel has no paid tier.

Overview

AutoKernel takes the autoresearch pattern and applies it to a high-value vertical: GPU kernel optimization. Give it any PyTorch model, and an AI agent will profile it, extract bottleneck operations as standalone kernels, and optimize each one autonomously.

~40 experiments/hour, 300–400 per overnight 10-hour run. Each experiment takes ~90 seconds with 5-stage correctness verification.

Status note (June 2026): the project appears stalled. The GitHub repo's last push was March 19, 2026 — six days after launch — and it sits at 1,404 stars, 141 forks, and 12 open issues, unarchived but inactive. The team did publish a companion arXiv paper in late March 2026.

Pricing

Free and open source under MIT. There is no hosted or paid tier — costs are your own GPU time plus LLM API usage during agent runs. RightNow AI's commercial product is its CUDA IDE, not AutoKernel.

Architecture

Pipeline

Any PyTorch ──▶ Rank kernels ──▶ Generate baseline ──▶ Optimize each ──▶ End-to-end
  model         by GPU time     Triton/CUDA kernels     kernel (agent)   verification
              (profile.py)      (extract.py)            (bench.py loop)  (verify.py)

Tools

All tooling is Python scripts driving the agent loop:

Tool	What it does
`profile.py`	Profiles any PyTorch model with torch.profiler, ranks kernels by GPU time
`extract.py`	Extracts top-N bottleneck kernels into standalone Triton or CUDA C++ files
`orchestrate.py`	Multi-kernel scheduler using Amdahl's law to prioritize work
`bench.py`	Fixed benchmark with 5-stage correctness checks + roofline analysis
`verify.py`	End-to-end correctness and speedup verification

Amdahl's Law Scheduling

The orchestrate.py scheduler decides which kernel to optimize next based on potential impact. A kernel consuming 40% of GPU time has more optimization headroom than one at 2%. This is a form of agent task prioritization — "which sub-task has the most leverage?" — applicable beyond kernel optimization.

Reported Results

On an NVIDIA H100, the team reports its Triton kernels beating PyTorch eager by 5.29x on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy — and beating torch.compile (max-autotune) by 2.83x, 3.44x, and 2.94x respectively. These are memory-bound elementwise/reduction kernels; on compute-bound matmul the picture inverts — the Triton starter reaches ~278 TFLOPS against cuBLAS at ~989.5 TFLOPS (~28% of peak).

Strengths & Limitations

Strengths:

Solves an expensive, real problem (kernel optimization is expert work)
Smart Amdahl's law prioritization
Supports both Triton and CUDA C++ backends
5-stage correctness checks prevent silent regressions
Ships with model profiles (GPT-2, LLaMA, BERT) for immediate use
Methodology written up in an arXiv paper

Limitations:

NVIDIA GPU required (H100/A100/RTX 4090 tested)
No commits since March 19, 2026 — effectively unmaintained as of June 2026
Weak on compute-bound matmul vs cuBLAS/CUTLASS
Vertical-specific (GPU kernels only)

What Developers Say

The Hacker News launch thread (which drove roughly 1,000 stars within hours) was respectful of the scoping but skeptical of the benchmark claims:

"This is very cool! I like the scope of this project, keeping it limited to Triton and specific kinds of kernels makes it quite simple and efficient." — HN commenter ademeure

"Something seems off. For the 4kx4kx4k fp16 GEMM, cutlass is like 3x faster than this." — HN commenter aviinuo

"I'm confused by the progress graph though...claims a 1.31x improvement vs cuBLAS...while running at 187 TFLOPS which is 18.9% of peak utilization? Benchmarking is hard!" — HN commenter ademeure

What Changed Since March 2026

Development stopped. Last push March 19, 2026; 1,404 stars and 141 forks accumulated, then silence.
arXiv paper published. "AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search" (arXiv:2603.21331) formalizes the approach.
The category got a heavyweight. Meta published KernelEvolve (April 2, 2026; ISCA 2026 paper) — an agentic kernel-coding framework running in production across NVIDIA, AMD GPUs, MTIA, and CPUs, reporting 60%+ inference throughput gains on its Andromeda ads model and kernel development time cut from weeks to hours. KernelEvolve validates AutoKernel's core thesis at hyperscale while making the standalone tool look like a proof of concept.

Bottom Line

AutoKernel validated that autoresearch works for specialized optimization domains beyond LLM training, and its Amdahl's law scheduler remains an interesting agent orchestration primitive — prioritizing tasks by expected impact. But as of June 2026 it reads as a frozen research artifact: no commits since six days after launch, benchmark claims contested on compute-bound workloads, and Meta's KernelEvolve now demonstrating the same idea in production at vastly larger scale.

Recommended for: studying the agentic kernel-optimization loop, mining the Amdahl's-law scheduler pattern, or one-off optimization runs on memory-bound kernels where you can verify results yourself.

Not recommended for: anyone needing a maintained tool, compute-bound matmul optimization, or non-NVIDIA hardware.

Outlook: absent renewed commits, AutoKernel's lasting contribution is the paper and the pattern — the production future of this category looks like KernelEvolve-style in-house systems, not standalone OSS tools.

Research by Ry Walker Research • methodology

Sources