Virtual presentation / top 25% paper
Neural Networks Efficiently Learn Low-Dimensional Representations with SGD
Alireza Mousavi-Hosseini · Sejun Park · Manuela Girotti · Ioannis Mitliagkas · Murat A Erdogdu
Keywords: [ sgd ] [ generalization ] [ feature learning ] [ neural networks ] [ compressibility ] [ Theory ]
Abstract:
We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input x∈Rd is Gaussian and the target y∈R follows a multiple-index model, i.e., y=g(⟨u1,x⟩,...,⟨uk,x⟩) with a noisy link function g. We prove that the first-layer weights in the NN converge to the k-dimensional principal subspace spanned by the vectors u1,...,uk of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when k≪d. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of O(√kd/T) after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, by recovering the principal direction, SGD-trained ReLU NNs can learn a single-index target of the form y=f(⟨u,x⟩)+ϵ with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and ϵ is the noise. This is in contrast to the known dΩ(p) samples required to learn any degree p polynomial in the kernel regime, and shows that SGD-trained NNs can outperform the Neural Tangent Kernel at initialization. Finally, we establish compressibility guarantees for NNs using that SGD produces an approximately rank-k first-layer weight matrix.
Chat is not available.