ICLR Poster Learning Hierarchical Polynomials with Three-Layer Neural Networks

Poster

Learning Hierarchical Polynomials with Three-Layer Neural Networks

Zihao Wang · Eshaan Nichani · Jason Lee

Halle B #129

[ Abstract ]

[ Poster] [ OpenReview]

Abstract: We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form

h = g \circ p

$h = g \circ p$ where

p : R^{d} \to R

$p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree

k

$k$ polynomial and

g : R \to R

$g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree

q

$q$ polynomial. This function class generalizes the single-index model, which corresponds to

k = 1

$k=1$ , and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree

k

$k$ polynomials

p

$p$ , a three-layer neural network trained via layerwise gradient descent on the square loss learns the target

h

$h$ up to vanishing test error in

\tilde{O} (d^{k})

$\widetilde O(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require

\tilde{Θ} (d^{k q})

$\widetilde \Theta(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of

p

$p$ being a quadratic. When

p

$p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity

\tilde{O} (d^{2})

$\widetilde O(d^2)$ , which is an improvement over prior work (Nichani et al., 2023) requiring a sample size of

\tilde{Θ} (d^{4})

$\widetilde\Theta(d^4)$ . Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature

p

$p$ with

\tilde{O} (d^{k})

$\widetilde O(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.

Chat is not available.