Can Transformers Really Do It All? On the Compatibility of Inductive Biases Across Tasks
Abstract
Transformers are remarkably versatile and their design is largely consistent across a variety of applications. But are they optimal for any given task or dataset? The answer may be key for pushing AI beyond merely scaling current designs. Method. We present a method to optimize a transformer architecture for a given dataset, which we use as a tool to study optimal task-specific inductive biases. This method replaces the most important non-linearities (GeLUs,;softmax) with functions learned on held-out data. We then train the resulting architectures on other datasets, as a way to evaluate the compatibility between pairs of tasks. Findings. On algorithmic toy tasks, we identify new architectures with dramatic improvements in learning speed, in- and out-of-distribution generalization, and stability across seeds. The new designs prove very task-specific however, and indicate that these tasks require inductive biases very different from those of standard transformers. On code and language modeling datasets, we also find architectures with consistent, yet smaller improvements. These designs transfer much better across datasets and domains (English & computer code). Implications. Our results show that standard transformers are rarely a local optimum in the space of architectures. Simple alternatives can perform much better but sacrifice universality. This suggests that there may be room for improved architectures that better support multiple capabilities simultaneously, such as fluency and robust reasoning.