## HyperGrid Transformers: Towards A Single Model for Multiple Tasks

### Yi Tay · Zhe Zhao · Dara Bahri · Donald Metzler · Da-Cheng Juan

Keywords: [ Transformers ] [ Multi-Task Learning ]

Abstract: Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being $16$ times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.