Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 3 P3-#1009

FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention

Haojie Duanmu ⋅ Size Zheng ⋅ Ningxin Zheng ⋅ Jianqiao Lu ⋅ Xuegui Zheng ⋅ Xingcheng Zhang ⋅ Li-Wen Chang ⋅ Xin Liu ⋅ Dahua Lin

[ Slides] [ OpenReview]

Abstract

The quadratic complexity of softmax attention poses a major bottleneck for long-context modeling, motivating a surge of linear attention variants with linear complexity. Unlike softmax attention, which benefits from optimized kernels, linear attention lacks general-purpose, hardware-efficient support and scalable distributed implementations. We introduce Flexible Linear Attention (FlexLA), a domain-specific compiler that automates the generation of high-performance, scalable kernels for a wide range of linear attention models directly from high-level PyTorch code. At its core, FlexLA employs an intuitive programming abstraction that decomposes any linear attention algorithm into three canonical phases: intra-chunk computation, inter-chunk state propagation, and output merging. This unified abstraction enables FlexLA to perform domain-specific optimizations, automatically generating kernels that fuse computation and communication at a fine-grained tile level and eliminating host synchronization. Our evaluation demonstrates that FlexLA combines programmability with performance: a wide range of linear attention variants can be implemented in just a few dozen lines of code, while the generated kernels deliver 1.01x-4.9x the performance of sate-of-the-art expert-optimized library and scale with near-linear efficiency on scalar gated linear attention to 16 million tokens on 128 GPUs, surpassing the state-of-the-art distributed baseline by up to 7.2x.

Video

Chat is not available.