Skip to yearly menu bar Skip to main content


Poster

A Solvable Attention for Neural Scaling Laws

Bochen Lyu · Di Wang · Zhanxing Zhu

Hall 3 + Hall 2B #634
[ ]
Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Transformers and many other deep learning models are empirically shown to predictably enhance their performance as a power law in training time, model size, or the number of training data points, which is termed as the neural scaling law. This paper studies this intriguing phenomenon particularly for the transformer architecture in theoretical setups. Specifically, we propose a framework for linear self-attention, the underpinning block of transformer without softmax, to learn in an in-context manner, where the corresponding learning dynamics is modeled as a non-linear ordinary differential equation (ODE) system. Furthermore, we establish a procedure to derive a tractable approximate solution for this ODE system by reformulating it as a Riccati equation, which allows us to precisely characterize neural scaling laws for linear self-attention with training time, model size, data size, and the optimal compute. In addition, we reveal that the linear self-attention shares similar neural scaling laws with several other architectures when the context sequence length of the in-context learning is fixed, otherwise it would exhibit a different scaling law of training time.

Live content is unavailable. Log in and register to view live content