Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Abstract
We present a theoretical framework connecting Transformer attention to associative memory through three components: distance-based energy for nearest-neighbor search, layer stacking via majorization-minimization, and cross-entropy loss analysis. Empirical scaling laws lack theoretical justification and cannot explain why smaller models sometimes outperform larger ones. Our main result shows that for memorizing well-separated patterns, optimal scaling satisfies N = O(D²), where N is parameters and D is dataset size. To demonstrate practical applications, we present NeuralDB, which scales knowledge editing to 10,000 facts, an order-of-magnitude improvement over existing methods, while maintaining robust generalization. This work positions associative memory as a unified lens for understanding and improving large language models, bridging theory and practice.