Language Diffusion Models are Associative Memories
Abstract
Associative Memory (AM) systems reliably retrieve data points by establishing distinct basins of attraction around them. While historically reliant on explicit and well-defined energy functions, as in Hopfield networks, stable attractors can also be formed via conditional likelihood maximization without the need for such functions. Using this aspect, we demonstrate that Uniform-based Discrete Diffusion Models (UDDMs) behave similarly to AMs via their utilization of conditional likelihood dynamics for sampling and training. By evaluating token recovery, we identify a memorization-to-generalization phase transition governed by training dataset size. With a small amount of training data, UDDMs exhibit a near-perfect memorization, characterized by vanishing conditional entropy. However, as the size of the training set increases, unseen test examples become stable attractors of the system and can be effectively denoised. This behavior highlights an emergent capability, marking the shift to generalization.