Skip to yearly menu bar Skip to main content


Poster

Improving Language Model Distillation through Hidden State Matching

Sayantan Dasgupta · Trevor Cohn

Hall 3 + Hall 2B #609
[ ]
Sat 26 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Hidden State Matching is shown to improve knowledge distillation of language models by encouraging similarity between a student and its teacher's hidden states since DistilBERT. This typically uses a cosine loss, which restricts the dimensionality of the student to the teacher's, severely limiting the compression ratio. We present an alternative technique using Centered Kernel Alignment (CKA) to match hidden states of different dimensionality, allowing for smaller students and higher compression ratios. We show the efficacy of our method using encoder--decoder (BART, mBART \& T5) and encoder-only (BERT) architectures across a range of tasks from classification to summarization and translation. Our technique is competitive with the current state-of-the-art distillation methods at comparable compression rates and does not require already pretrained student models. It can scale to students smaller than the current methods, is no slower in training and inference, and is considerably more flexible. The code is available on github.

Live content is unavailable. Log in and register to view live content