Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks
Abstract
In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget, and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation and enhancement capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks where we demonstrate that early-exit capabilities can be introduced without compromising reconstruction, and that our early-exit conditions are well-calibrated on training data and can easily be post-calibrated on validation data, leading to large energy savings when used with early-exit over single-exit baselines. Our framework enables fine-grained dynamic compute-scaling of neural networks while achieving state-of-the-art performance and interpretable exit conditions.