Poster
in
Workshop: Scientific Methods for Understanding Deep Learning (Sci4DL)

Learning When to Be Sparse: Adaptive Activations via Two-Parameter Entropy

Roman Rudamenko ⋅ Dmitry Abulkhanov ⋅ Konstantin Semenov ⋅ Michael Diskin ⋅ Alexander Savchenko

Project Page [ OpenReview]

Abstract

The softmax operator, while foundational to modern machine learning, arises from Shannon entropy regularization, an assumption rooted in classical statistical mechanics that breaks down for systems with long-range correlations, power-law tails, or fractal structure. Such non-extensive regimes are common in practice: real-world datasets often exhibit Zipfian class frequencies, under which classical entropy misallocates probability mass. Sparse alternatives such as -entmax address this issue via Tsallis entropy, but they rigidly tie sparsity to a single parameter. We introduce SharMiX, a two-parameter activation based on Sharma-Mittal entropy that unifies the Shannon, Rényi, and Tsallis families. We derive closed-form, Lipschitz-continuous Jacobians for the activation outputs with respect to both the input logits and the entropy parameters (q, r), enabling end-to-end learning via implicit differentiation. This allows SharMiX to dynamically adapt to the statistical properties of the data, becoming sparse for heavy-tailed, non-extensive distributions and dense for balanced, extensive ones. Experiments on text classification, CIFAR-100, and ImageNet-1k demonstrate that SharMiX automatically navigates the accuracy-sparsity trade-off, successfully adapting to the underlying class-frequency distribution.

Chat is not available.