← Back to research
·2 min read·company

AutoKernel

AutoKernel applies autoresearch to GPU kernel optimization. Give it any PyTorch model, go to sleep, wake up to optimized Triton or CUDA C++ kernels.

Key takeaways

  • Autoresearch applied to GPU kernel optimization — profiles any PyTorch model, extracts bottleneck ops, autonomously optimizes Triton or CUDA C++ kernels
  • Uses Amdahl's law to prioritize which kernels to optimize next, maximizing aggregate speedup per experiment hour
  • ~40 experiments/hour (~90s each), ~320 overnight. 5-stage correctness checks prevent regressions during autonomous operation
  • Ships with pre-built profiles for GPT-2, LLaMA, BERT — start optimizing immediately without model setup

FAQ

What is AutoKernel?

AutoKernel is an autonomous AI agent that optimizes GPU kernels. It profiles a PyTorch model, finds bottleneck operations, and lets an AI agent iteratively optimize the Triton or CUDA kernels — keeping improvements, reverting failures.

How does it decide which kernel to optimize?

It uses Amdahl's law reasoning — prioritizing kernels where optimization would have the largest impact on total model performance, based on GPU time profiling data.

Overview

AutoKernel takes the autoresearch pattern and applies it to a high-value vertical: GPU kernel optimization. Give it any PyTorch model, and an AI agent will profile it, extract bottleneck operations as standalone kernels, and optimize each one autonomously.

~40 experiments/hour. ~320 overnight. Each experiment takes ~90 seconds with 5-stage correctness verification.

Architecture

Pipeline

Any PyTorch ──▶ Rank kernels ──▶ Generate baseline ──▶ Optimize each ──▶ End-to-end
  model         by GPU time     Triton/CUDA kernels     kernel (agent)   verification
              (profile.py)      (extract.py)            (bench.py loop)  (verify.py)

Tools

ToolWhat it does
profile.pyProfiles any PyTorch model with torch.profiler, ranks kernels by GPU time
extract.pyExtracts top-N bottleneck kernels into standalone Triton or CUDA C++ files
orchestrate.pyMulti-kernel scheduler using Amdahl's law to prioritize work
bench.pyFixed benchmark with 5-stage correctness checks + roofline analysis
verify.pyEnd-to-end correctness and speedup verification

Amdahl's Law Scheduling

The orchestrate.py scheduler decides which kernel to optimize next based on potential impact. A kernel consuming 40% of GPU time has more optimization headroom than one at 2%. This is a form of agent task prioritization — "which sub-task has the most leverage?" — applicable beyond kernel optimization.

Strengths & Limitations

Strengths:

  • Solves an expensive, real problem (kernel optimization is expert work)
  • Smart Amdahl's law prioritization
  • Supports both Triton and CUDA C++ backends
  • 5-stage correctness checks prevent silent regressions
  • Ships with model profiles for immediate use

Limitations:

  • NVIDIA GPU required (H100/A100/RTX 4090 tested)
  • 2 days old, unproven at scale
  • Vertical-specific (GPU kernels only)

Bottom Line

AutoKernel validates that autoresearch works for specialized optimization domains beyond LLM training. The Amdahl's law scheduler is an interesting agent orchestration primitive — prioritizing tasks by expected impact. The pattern could apply to any multi-target optimization: which test to speed up first, which API endpoint to optimize, which component to shrink.


Research by Ry Walker Research • methodology