Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

DeltaMoE: Memory-Efficient Inference for Merged Mixture of Experts with Delta Compression

Boyko Borisov · Xiaozhe Yao · Nezihe Merve Gürel · Ana Klimovic


Abstract: Sparse Mixture of Experts (SMoEs) have emerged as an efficient architecture for large language models. While recent community efforts have focused on merging multiple models to create SMoEs, deploying these merged models remains challenging due to their substantial memory requirements. In this paper, we present DeltaMoE, a training-free delta compression pipeline that enables efficient deployment of SMoE models through structured sparsity and quantization. Our evaluation shows that DeltaMoE achieves up to a $2.34\times$ compression ratio and $2.57\times$ throughput improvement. DeltaMoE is also scalable with the number of experts, making it particularly suitable for large SMoE models.

Chat is not available.