Poster
in
Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora · Sabri Eyuboglu · Michael Zhang · Aman Timalsina · Silas Alberti · James Y Zou · Atri Rudra · Christopher Re
We seek high-quality and efficient sequence mixers. Recently, the ability to perform a skill called "recall" -- i.e., grounding generations in previously seen tokens -- has become a critical test of sequence mixer quality. We empirically and theoretically study a broad set of attention and attention-free architectures identifying a key tradeoff between the architecture's "state size" and recall ability. Attention excels at recall but maintains a full KV-cache; recurrent models (e.g., H3, Mamba, RWKV) struggle to perform recall as well. We explore a new space on this tradeoff curve in our proposed architecture, BASED, built from a simple approximation of attention via linear attention plus sliding window attention. Relative to strong baselines (FlashAttention-2, Mamba), BASED competes at 1.3Bn parameters on pretraining and downstream language benchmarks and offers 45-55% faster prefill. BASED outperforms prior sub-quadratic architectures on real-world recall tasks.