Skip to main content

Command Palette

Search for a command to run...

The Best GPU Kernels Are No Longer Written by Humans

NVIDIA's AVO agents beat FlashAttention-4 by 10.5% on attention kernels. The systems programming moat just evaporated.

Published
6 min read

AI agents just outperformed FlashAttention-4 and cuDNN on attention kernels. Not by a rounding error. By 10.5%.

Let that sink in. The most aggressively hand-optimized piece of GPU code in the entire AI stack, the thing that kernel engineers spend months tuning with intimate knowledge of memory hierarchies and warp scheduling, just got beaten by a swarm of coding agents running an evolutionary loop for seven days.

NVIDIA's new paper, "AVO: Agentic Variation Operators for Autonomous Evolutionary Search" (arXiv:2603.24517), doesn't just push the boundary of what AI agents can do. It obliterates the assumption that low-level systems programming is somehow immune to AI disruption.

What AVO Actually Does

The core idea is deceptively simple. Traditional evolutionary search uses fixed mutation and crossover operators: swap a parameter here, shuffle a gene there. It's mechanical. AVO replaces these fixed operators with autonomous coding agents that can reason about the code they're modifying.

Each "variation" isn't a random perturbation. It's a deliberate, context-aware edit proposed by an agent that has access to:

  • The current lineage of solutions (what worked before, what didn't)
  • A domain-specific knowledge base (GPU architecture docs, prior kernel implementations)
  • Execution feedback (actual benchmark results from running the code)

The agent doesn't just generate candidate code. It proposes edits, runs them, debugs failures, critiques the results, and verifies improvements. It's not "LLM generates code and we pick the best one." It's a self-directed optimization loop where the agent is the optimizer.

Here's the critical distinction from prior work: previous "LLM-in-the-loop" evolutionary systems (like FunSearch or EvoPrompting) confine the language model to candidate generation within a predetermined pipeline. AVO elevates the agent from candidate generator to variation operator. The agent decides how to mutate, not just what to mutate.

The Results Are Hard to Argue With

Over 7 days of continuous autonomous evolution targeting multi-head attention on NVIDIA Blackwell (B200) GPUs:

  • Up to 3.5% faster than cuDNN across evaluated configurations
  • Up to 10.5% faster than FlashAttention-4 across evaluated configurations
  • Optimizations transferred to grouped-query attention with only 30 minutes of additional adaptation
  • GQA gains: up to 7.0% over cuDNN and 9.3% over FlashAttention-4

These aren't synthetic benchmarks. Attention is the single most performance-critical kernel in modern transformer inference. Every percentage point here translates directly to lower latency and cost at scale.

Why This Matters More Than You Think

1. The Optimization Surface Is Bigger Than Human Intuition

The reason hand-tuned kernels have dominated for so long is that GPU optimization requires juggling dozens of interacting constraints simultaneously: register pressure, shared memory bank conflicts, warp divergence, instruction-level parallelism, memory coalescing patterns. Human kernel engineers develop intuitions about these tradeoffs over years.

But intuition has limits. An evolutionary agent that can try thousands of micro-architectural variations, test each one empirically, and learn from the results, can explore regions of the optimization space that no human would think to try. The 10.5% gap over FlashAttention-4 almost certainly comes from combinations of optimizations that would seem counterintuitive to a human engineer.

2. The Transfer Result Is the Real Story

The 30-minute adaptation to GQA deserves more attention than it's getting. Hand-tuning a kernel for a new attention variant typically takes weeks. AVO did it in half an hour while maintaining most of the performance advantage.

This suggests AVO isn't just memorizing one good solution. It's discovering generalizable optimization principles that transfer across attention configurations. That's the difference between a one-time lucky search and a genuine optimization capability.

3. Agentic vs. Generative Code

Most AI code generation tools (Copilot, Cursor, Codex) generate code from natural language descriptions. They're translators. AVO is fundamentally different: it's an optimizer. It doesn't translate intent into code. It takes existing code and makes it faster through iterative experimentation.

This is a much harder problem and a much more valuable one. Translation is a solved problem (good enough for most uses). Optimization at the kernel level, where the difference between a good and great implementation is 10% wall-clock time on million-dollar GPU clusters, is where the real value lives.

What a 7-Day Agent Loop Looks Like

Let's think about the computational structure here. The agent:

  1. Examines the current best kernel implementation
  2. Consults architecture documentation and prior optimization history
  3. Proposes a specific code edit with a hypothesis for why it should help
  4. Compiles and benchmarks the modified kernel
  5. Analyzes the results against the hypothesis
  6. Updates its understanding of what works

Repeat thousands of times across a population of candidate kernels, with the best solutions surviving and being further refined.

This is not brute force. The agent's ability to reason about why changes should help means it explores the search space orders of magnitude more efficiently than random mutation. The knowledge base consultation means it can leverage existing engineering knowledge without being constrained by it.

# Simplified pseudocode for AVO's core loop
def avo_step(agent, population, knowledge_base, benchmarks):
    parent = select_from_population(population)

    # Agent reasons about the code, not just mutates it
    context = {
        "lineage": parent.history,
        "architecture_docs": knowledge_base.query(parent.code),
        "prior_results": benchmarks.get_trends()
    }

    # Propose, don't just perturb
    edit = agent.propose_edit(parent.code, context)
    hypothesis = agent.explain_hypothesis(edit)

    # Test empirically
    new_kernel = apply_edit(parent.code, edit)
    result = benchmark(new_kernel)

    # Learn from outcome
    agent.update_understanding(hypothesis, result)

    return new_kernel, result

The Uncomfortable Implication

If agents can write better attention kernels than the teams at NVIDIA who literally designed the hardware, what other "expert-only" optimization problems are about to fall?

  • Compiler optimization passes. The same evolutionary approach could discover better LLVM IR transformations.
  • Database query planning. Join ordering and index selection are optimization problems with enormous search spaces.
  • Network protocol tuning. TCP congestion control parameters, buffer sizes, batching strategies.
  • Chip design. Placement and routing is already partially automated, but AVO-style agents could push further.

The pattern is clear: any optimization problem where (a) the search space is too large for exhaustive exploration, (b) the quality of a solution can be measured automatically, and (c) domain knowledge helps but doesn't determine the optimal solution, is now potentially in scope for agentic optimization.

What This Means for AI Engineers

If you're building AI infrastructure, this paper should change how you think about performance optimization. The traditional approach (hire kernel engineers, spend months hand-tuning) isn't going away tomorrow, but the ceiling just got raised.

More practically: if your workload involves attention (and whose doesn't?), the kernels discovered by AVO will likely make their way into production libraries within months. The GQA results are particularly relevant for inference workloads using models with grouped-query attention (Llama 3, Mistral, most modern architectures).

For those of us building agent systems, AVO is validation of a broader thesis: agents that can reason about code, execute it, and learn from the results are qualitatively different from agents that only generate code. The feedback loop is everything.

The best GPU kernels are no longer written by humans. They're written by agents that learned from humans and then surpassed them. The kernel engineers aren't obsolete. But their role just shifted from writing optimal code to designing the optimization process itself.

And that, honestly, might be the more interesting job.

Paper: "AVO: Agentic Variation Operators for Autonomous Evolutionary Search" (arXiv:2603.24517) Authors: Terry Chen, Zhifan Ye, Bing Xu, et al. (NVIDIA Research)


Manas Vardhan builds open-source tools for production AI agents, including agent-sentry and llm-cost-guardian. Find all his work on GitHub.

A

The 30-minute GQA transfer result is the signal that AVO is doing something qualitatively different from search.

Most optimization approaches memorize the search space — they find a great solution but cannot adapt it. AVO appears to be learning optimization heuristics, not just optimal configurations. That is the difference between caching and generalizing.

For production AI teams, the implication cuts deeper than kernels. Any system where you can:

  1. Define a measurable objective
  2. Execute variations automatically
  3. Feed results back into the next iteration

...is now a candidate for agent-driven optimization. The pattern AVO demonstrates — propose edit, benchmark, analyze gap, update hypothesis — is portable to database query planning, cache eviction policies, even hyperparameter tuning.

The real moat is not the kernel. It is the optimization loop itself.