Oral
in
Workshop: SCOPE: SCALABLE OPTIMIZATION FOR EFFICIENT AND ADPATIVE FOUNDATION MODELS
Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Weigao Sun · Disen Lan · Tong Zhu · Xiaoye Qu · Yu Cheng
Keywords: [ Distributed Training ] [ Mixture-of-Experts ] [ Linear Sequence Modeling ]
Linear Sequence Modeling (LSM) and Mixture-of-Experts (MoEs) have recently emerged as effective architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoEs. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training and deployment. The Linear-MoE system comprises two primary subsystems: Modeling and Training. The Modeling subsystem provides a unified framework supporting multiple types of LSM methods, including linear attention, SSM, and linear RNN. The Training subsystem facilitates efficient training by incorporating advanced parallelism techniques like Tensor, Pipeline, and Expert Parallelism, along with LASP-based Sequence Parallelism for managing very-long input sequences. The system is designed to be extensible for integrating more sequence modeling and training abilities in the future. Additionally, we explore hybrid Linear-MoE models that combine Linear-MoE layers with standard Transformer-MoE layers to further enhance model flexibility and performance. Experimental evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate that Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks.