Skip to yearly menu bar Skip to main content


Poster

Bilinear MLPs enable weight-based mechanistic interpretability

Michael Pearce · Thomas Dooms · Alice Rigg · Jose Oramas · Lee Sharkey

Hall 3 + Hall 2B #524
[ ]
Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

A mechanistic understanding of how MLPs do computation in deep neural net-works remains elusive. Current interpretability work can extract features fromhidden activations over an input dataset but generally cannot explain how MLPweights construct features. One challenge is that element-wise nonlinearitiesintroduce higher-order interactions and make it difficult to trace computationsthrough the MLP layer. In this paper, we analyze bilinear MLPs, a type ofGated Linear Unit (GLU) without any element-wise nonlinearity that neverthe-less achieves competitive performance. Bilinear MLPs can be fully expressed interms of linear operations using a third-order tensor, allowing flexible analysis ofthe weights. Analyzing the spectra of bilinear MLP weights using eigendecom-position reveals interpretable low-rank structure across toy tasks, image classifi-cation, and language modeling. We use this understanding to craft adversarialexamples, uncover overfitting, and identify small language model circuits directlyfrom the weights alone. Our results demonstrate that bilinear layers serve as aninterpretable drop-in replacement for current activation functions and that weight-based interpretability is viable for understanding deep-learning models.

Live content is unavailable. Log in and register to view live content