Skip to yearly menu bar Skip to main content


Oral 7A

Halle A 8 - 9

Moderator: Eugene Ndiaye

Fri 10 May 1 a.m. PDT — 1:45 a.m. PDT
Chat is not available.

Fri 10 May 1:00 - 1:15 PDT

Small-scale proxies for large-scale Transformer training instabilities

Mitchell Wortsman · Peter Liu · Lechao Xiao · Katie Everett · Alexander Alemi · Ben Adlam · John Co-Reyes · Izzeddin Gur · Abhishek Kumar · Roman Novak · Jeffrey Pennington · Jascha Sohl-Dickstein · Kelvin Xu · Jaehoon Lee · Justin Gilmer · Simon Kornblith

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the MuParam (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model characteristics such as activation and gradient norms.

Fri 10 May 1:15 - 1:30 PDT

An Analytical Solution to Gauss-Newton Loss for Direct Image Alignment

Sergei Solonets · Daniil Sinitsyn · Lukas Von Stumberg · Nikita Araslanov · Daniel Cremers

Direct image alignment is a widely used technique for relative 6DoF pose estimation between two images, but its accuracy strongly depends on pose initialization.Therefore, recent end-to-end frameworks increase the convergence basin of the learned feature descriptors with special training objectives, such as the Gauss-Newton loss.However, the training data may exhibit bias toward a specific type of motion and pose initialization,thus limiting the generalization of these methods.In this work, we derive a closed-form solution to the expected optimum of the Gauss-Newton loss. The solution is agnostic to the underlying feature representation and allows us to dynamically adjust the basin of convergence according to our assumptions about the uncertainty in the current estimates. These properties allow for effective control over the convergence in the alignment process.Despite using self-supervised feature embeddings, our solution achieves compelling accuracy w.r.t. the state-of-the-art direct image alignment methods trained end-to-end with pose supervision, and demonstrates improved robustness to pose initialization.Our analytical solution exposes some inherent limitations of end-to-end learning with the Gauss-Newton loss, and establishes an intriguing connection between direct image alignment and feature-matching approaches.

Fri 10 May 1:30 - 1:45 PDT

Statistically Optimal $K$-means Clustering via Nonnegative Low-rank Semidefinite Programming

Yubo Zhuang · Xiaohui Chen · Yun Yang · Richard Zhang

$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the $K$-means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability.