Invited Talk by Sewon Min: Are Mixture-of-Experts Modular? Why It Matters and How to Fix It
Abstract
Mixture-of-Experts (MoEs) are designed as modular architectures—but are they functionally modular, i.e., enabling the independent use of expert subsets for downstream domains? We argue they are not, and that this gap matters: as MoEs grow larger, sparser, and more fine-grained, they become increasingly difficult to use, adapt, and fine-tune without heavy infrastructure. We introduce ModMoE, a self-supervised approach that makes modularity a first-class property—without human priors or loss in overall performance. ModMoE induces semantically specialized experts (rather than lexical partitioning) and enables effective selective expert usage across pool sizes, improving efficiency and performance in both zero-shot inference and fine-tuning. These results point toward more accessible and flexible MoEs, and a path to large-scale, sparse, and truly modular expert architectures.