Poster
Grokking at the Edge of Numerical Stability
Lucas Prieto · Melih Barsbey · Pedro Mediano · Tolga Birdal
Hall 3 + Hall 2B #441
Abstract:
Grokking, or sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon that has challenged our understanding of deep learning. While a lot of progress has been made in understanding grokking, it is still not clear why generalization is delayed and why grokking often does not happen without regularization. In this work we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax that we refer to as _Softmax Collapse_ (SC). We show that SC prevents grokking and that mitigating SC leads to grokking _without_ regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the _naïve loss minimization_ (NLM) direction. This component of the gradient does not change the predictions of the model but decreases the loss by scaling the logits, usually through the scaling of the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking, and eventually leads to SC, stopping learning altogether. To validate these hypotheses, we introduce two key contributions that mitigate the issues faced in grokking tasks: (i) StableMax, a new activation function that prevents SC and enables grokking without regularization, and (ii) ⊥Grad, a training algorithm that leads to quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, shedding light on its delayed generalization, reliance on regularization, and the effectiveness of known grokking-inducing methods.
Live content is unavailable. Log in and register to view live content