ICLR Poster ThunderKittens: Simple, Fast, and $\textit{Adorable}$ Kernels

Poster

ThunderKittens: Simple, Fast, and $\textit{Adorable}$ Kernels

Benjamin Spector · Simran Arora · Aaryan Singhal · Arjun Parthasarathy · Dan Fu · Christopher Re

Hall 3 + Hall 2B #538

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract: The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse capabilities of GPUs suggests we might we need a wide variety of techniques to achieve high performance. However, our work explores if a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like operations, (2) at the thread-block level, we provide templates for asynchronously overlapping operations, and (3) at the grid-level, TK helps hide block launch, tear-down, and memory costs. We show the value of TK by providing simple & diverse kernels that match or outperform prior art. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by

10 - 40

$10-40$ \% on attention backwards,

8 \times

$8\times$ on state space models, and

14 \times

$14\times$ on linear attention.

Live content is unavailable. Log in and register to view live content

Poster

ThunderKittens: Simple, Fast, and AdorableAdorable\textit{Adorable} Kernels

Benjamin Spector · Simran Arora · Aaryan Singhal · Arjun Parthasarathy · Dan Fu · Christopher Re

Hall 3 + Hall 2B #538

ThunderKittens: Simple, Fast, and $\textit{Adorable}$ Kernels