ICLR Poster u-$\mu$P: The Unit-Scaled Maximal Update Parametrization

Poster

u- $\mu$ P: The Unit-Scaled Maximal Update Parametrization

Charles Blake · Constantin Eichenberg · Josef Dean · Lukas Balles · Luke Prince · Björn Deiseroth · Andres Felipe Cruz Salinas · Carlo Luschi · Samuel Weinbach · Douglas Orr

Hall 3 + Hall 2B #262

[ Abstract ]

Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: The Maximal Update Parametrization (

$\mu$ P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-

$\mu$ P, which improves upon

$\mu$ P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity:

$\mu$ P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-

$\mu$ P models reaching a lower loss than comparable

$\mu$ P models and working out-of-the-box in FP8.

Live content is unavailable. Log in and register to view live content

Poster

u-μ\muP: The Unit-Scaled Maximal Update Parametrization

Charles Blake · Constantin Eichenberg · Josef Dean · Lukas Balles · Luke Prince · Björn Deiseroth · Andres Felipe Cruz Salinas · Carlo Luschi · Samuel Weinbach · Douglas Orr

Hall 3 + Hall 2B #262

u- $\mu$ P: The Unit-Scaled Maximal Update Parametrization