Evolution of Flash Attention
Abstract
Standard attention is a bottleneck. As Large Language Models scale, the memory-bound nature of attention operations and GPU memory hierarchy constraints create performance walls that raw compute power cannot solve alone. This post delivers a deep mathematical and technical breakdown of FlashAttention's evolution from V1 to V4, revealing how IO-aware algorithm design has become critical to unlocking scale. We trace the architectural journey: from V1's tiled exact attention and online softmax, through V2's parallelism refinements and improved work partitioning, to V3's asynchronous execution on Hopper architectures, and finally V4's advanced pipelining designed for Blackwell GPUs. Each version fundamentally reshaped attention computation by reducing data movement, maximizing parallelism, and aligning more closely with evolving GPU hardware capabilities. We analyze the design choices that drive these improvements—examining IO complexity, arithmetic intensity, forward and backward recomputation trade-offs, and their impact on throughput and utilization across A100, H100, and Blackwell-class hardware. These optimizations translate to measurable gains: we present concrete throughput improvements for long-context transformer workloads that demonstrate why memory hierarchy optimization now matters as much as algorithmic complexity. We conclude that modern LLM system performance depends critically on memory hierarchy optimization and execution efficiency, challenging the traditional focus on asymptotic complexity alone.