Learning When to Be Sparse: Adaptive Activations via Two-Parameter Entropy
Abstract
The softmax operator, while foundational to modern machine learning, arises from Shannon entropy regularization, an assumption rooted in classical statistical mechanics that breaks down for systems with long-range correlations, power-law tails, or fractal structure. Such non-extensive regimes are common in practice: real-world datasets often exhibit Zipfian class frequencies, under which classical entropy misallocates probability mass. Sparse alternatives such as -entmax address this issue via Tsallis entropy, but they rigidly tie sparsity to a single parameter. We introduce SharMiX, a two-parameter activation based on Sharma-Mittal entropy that unifies the Shannon, Rényi, and Tsallis families. We derive closed-form, Lipschitz-continuous Jacobians for the activation outputs with respect to both the input logits and the entropy parameters (q, r), enabling end-to-end learning via implicit differentiation. This allows SharMiX to dynamically adapt to the statistical properties of the data, becoming sparse for heavy-tailed, non-extensive distributions and dense for balanced, extensive ones. Experiments on text classification, CIFAR-100, and ImageNet-1k demonstrate that SharMiX automatically navigates the accuracy-sparsity trade-off, successfully adapting to the underlying class-frequency distribution.