A universal compression theory for lottery ticket hypothesis and neural scaling laws
Hong-Yi Wang ⋅ Di Luo ⋅ Isaac Chuang ⋅ Tomaso Poggio ⋅ Liu Ziyin
Abstract
When training large-scale models, the performance typically scales with parameter count and dataset size via a slow power law. We show that comparable performance can be achieved with far smaller models and much less data. We prove that a generic permutation-invariant function of $d$ objects can be compressed into a function of $\operatorname{polylog} d$ objects with vanishing error, and that this rate is optimal. This yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the model’s loss landscape unchanged. (Ia) establishes the \textit{dynamical} lottery ticket hypothesis: ordinary networks can be strongly compressed without changing learning dynamics or outcomes. (Ib) shows that a scaling law $L\sim d^{-\alpha}$ can be accelerated to arbitrarily fast power-law decay, and ultimately to $\exp\left(-\alpha' \sqrt[m]{d}\right)$.
Chat is not available.
Successful Page Load