Ant Group: Scaling Hybrid Linear Attention Architecture to Trillion-Scale
Abstract
In this talk, we present our experience in scaling hybrid linear attention architectures to trillion-scale, through two models from the Ling Team: Ling-2.5-1T and Ring-2.5-1T. These models integrate linear attention with selected softmax attention layers to support efficient long-context training while preserving strong reasoning and representation capability. We share key algorithm–system co-design insights that make trillion-scale hybrid attention practical, including stability techniques for large-scale linear attention training and efficient distributed training for ultra-long sequences.
Successful Page Load