Leveraging Low-Rank Structure for Effective Weight-Sharing in Language Models
Abstract
Small language models are often built by scaling down standard large language model architectures. We argue that this design choice is suboptimal, and that small models can be parameterized more effectively by sharing weights, with differences captured by low-rank adaptation (LoRA) modules. We test this hypothesis by comparing several weight-tying strategies. We find that attention matrices, and even entire layers, can be shared without degrading performance. This increases FLOPs per parameter, reduces optimizer-state memory, and improves over parameter-matched baselines. We also reduce the parameter count of the embedding layer via a factorized construction, which yields additional memory savings. To motivate these design choices, we analyze the effective rank of model weights and the residual stream. Our analysis leads to more efficient compact models.