On the Residual Scaling of Looped Transformers: Stability and Transferability
Shaowen Wang ⋅ Bingrui Li ⋅ Ge Zhang ⋅ Wenhao Huang ⋅ Jian Li
Abstract
Looped (weight-tied) Transformers increase effective depth by repeatedly applying a shared block for $L$ steps. In practice, larger $L$ often improves capability, but requires careful hyperparameter tuning. We study the parameterization of pre-norm looped Transformers and ask which residual scaling enables stable training and transferable hyperparameters across loop counts. In contrast to the common $1/\sqrt{L}$ scale in deep networks, our simplified tied-weight residual MLP analysis shows that looped models require $1/L$ residual scaling. We validate theoratical predictions on a standard pre-norm Transformer architecture. Our experiments with looped LLMs across various loop times and learning rates demonstrate that $1/L$ scaling offers significantly better stability and hyperparameter transfer than $1/\sqrt{L}$ scaling
Chat is not available.
Successful Page Load