ICLR Poster Soft Merging of Experts with Adaptive Routing

Poster

Soft Merging of Experts with Adaptive Routing

Haokun Liu · Muqeeth Mohammed · Colin Raffel

Hall 3 + Hall 2B #537

[ Abstract ] [ Project Page ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: Neural networks that learn to route their inputs through different "expert" subnetworks provide a form of modularity that standard dense models lack. Despite their possible benefits, modular models with learned routing often underperform their parameter-matched dense counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train modular models that use non-differentiable discrete routing decisions. To address this issue, we introduce

S

$\textbf{S}$ oft

M

$\textbf{M}$ erging of

E

$\textbf{E}$ xperts with

A

$\textbf{A}$ daptive

R

$\textbf{R}$ outing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization.

Live content is unavailable. Log in and register to view live content