Spotlight Poster
How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization
Nuoya Xiong · Lijun Ding · Simon Du
Halle B #142
Abstract:
This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements.First, we consider the symmetric setting with the symmetric parameterization where M∗∈Rn×n is a positive semi-definite unknown matrix of rank r≪n, and one uses a symmetric parameterization XX⊤ to learn M∗. Here X∈Rn×k with k>r is the factor matrix. We give a novel Ω(1/T2) lower bound of randomly initialized GD for the over-parameterized case (k>r) where T is the number of iterations. This is in stark contrast to the exact-parameterization scenario (k=r) where the convergence rate is exp(−Ω(T)). Next, we study asymmetric setting where M∗∈Rn1×n2 is the unknown matrix of rank r≪min{n1,n2}, and one uses an asymmetric parameterization FG⊤ to learn M∗ where F∈Rn1×k and G∈Rn2×k. We give the first global exact convergence result of randomly initialized GD for the exact-parameterization case (k=r) with an exp(−Ω(T)) rate. Furthermore, we give the first global exact convergence result for the over-parameterization case (k>r) with an exp(−Ω(α2T)) rate where α is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2) to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between F and G. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on α in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α, recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings.
Chat is not available.