Soft Gates for Sharp Experts in Tabular Representation Learning
Abstract
Neural networks consistently underperform gradient-boosted trees on tabular data, yet the structural reasons remain poorly understood. We design the Sparse Feature Routing Network (SFR Net)—not as a benchmark entry, but as an experimental apparatus—to test three hypotheses about tabular inductive biases: (H1) per-feature experts improve over shared encoders even with fewer parameters, with gains amplified by instance-wise routing; (H2) instance-wise sparsity helps only when differentiable—hard gating collapses optimization; (H3) the learned routing produces faithful attributions confirmed by deletion tests against random baselines. The most striking finding: hard sparsity degrades accuracy below the dense baseline, while entropy-regularized softmax achieves extreme sparsity (2.9 of 14 effective features) and highest accuracy—soft gates produce sharp experts; hard gates produce dead ones. Controlled ablations and generalization across 13 benchmarks and 12 baselines yield testable design principles for tabular architectures.