DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression
Boyko Borisov · Xiaozhe Yao · Nezihe Merve Gürel · Ana Klimovic
2025 Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference
Abstract
Sparse Mixture of Experts (SMoEs) have emerged as an efficient architecture for large language models. While recent community efforts have focused on merging multiple models to create SMoEs, deploying these merged models remains challenging due to their substantial memory requirements. In this paper, we present DeltaMoE, a training-free delta compression pipeline that enables efficient deployment of SMoE models through structured sparsity and quantization. Our evaluation shows that DeltaMoE achieves up to a $2.34\times$ compression ratio and $2.57\times$ throughput improvement. DeltaMoE is also scalable with the number of experts, making it particularly suitable for large SMoE models.
Chat is not available.
Successful Page Load