Poster
in
Workshop: Modular, Collaborative and Decentralized Deep Learning
Tight Clusters Make Specialized Experts
Stefan Nielsen · Rachel Teo · Laziz Abdullaev · Tan Nguyen
At the core of Sparse Mixture-of-Experts (MoE) models is the router that learns the clustering structure of the input distribution in order to direct tokens to suitable experts. However these latent clusters may be unidentifiable, causing slow convergence, vulnerability to contamination, and degraded representations. We examine the router through the lens of clustering optimization, deriving optimal feature weights that maximally distinguish these clusters. Using these weights, we compute token-expert assignments in an adaptively transformed space that better separates clusters, helping identify the best-matched expert for each token. In particular, for each expert cluster, we compute weights that scale features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router confers three connected benefits: 1) faster convergence, 2) better robustness, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router in language modeling and image classification in both clean and corrupted settings.