Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference

Grokking Beyond the Euclidean Norm of Model Parameters

Tikeng Notsawo Pascal Junior · Guillaume Dumas · Guillaume Rabusseau


Abstract: Grokking refers to a delayed generalization following overfitting when optimizing artificial neural networks with gradient-based methods. In this work, we demonstrate that grokking can be induced by regularization, either explicit or implicit. More precisely, we show that when there exists a model with a property $P$ (e.g., sparse or low-rank weights) that generalizes on the problem of interest, gradient descent with a small but non-zero regularization of $P$ (e.g., $\ell_1$ or nuclear norm regularization) result in grokking. This extends previous work showing that small non-zero weight decay induces grokking. Moreover, our analysis shows that over-parameterization by adding depth makes it possible to grok or ungrok without explicitly using regularization, which is impossible in shallow cases. We further show that the $\ell_2$ norm of the model parameters cannot be used as an indicator of grokking in a general setting in place of the regularized property $P$: the $\ell_2$ norm grows in many cases where no weight decay is used, but the model generalizes anyway. We also show that grokking can be amplified through only data selection (with any other hyperparameter fixed).

Chat is not available.