Poster
in
Workshop: Latent & Implicit Thinking – Going Beyond CoT Reasoning Mon, Apr 27, 2026 • 12:00 PM – 12:50 PM PDT

On the Residual Scaling of Looped Transformers: Stability and Transferability

Shaowen Wang ⋅ Bingrui Li ⋅ Ge Zhang ⋅ Wenhao Huang ⋅ Jian Li

Project Page [ OpenReview]

Abstract

Looped (weight-tied) Transformers increase effective depth by repeatedly applying a shared block for $L$ steps. In practice, larger $L$ often improves capability, but requires careful hyperparameter tuning. We study the parameterization of pre-norm looped Transformers and ask which residual scaling enables stable training and transferable hyperparameters across loop counts. In contrast to the common $1/\sqrt{L}$ scale in deep networks, our simplified tied-weight residual MLP analysis shows that looped models require $1/L$ residual scaling. We validate theoratical predictions on a standard pre-norm Transformer architecture. Our experiments with looped LLMs across various loop times and learning rates demonstrate that $1/L$ scaling offers significantly better stability and hyperparameter transfer than $1/\sqrt{L}$ scaling

Chat is not available.