Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath, Wei-Ning Hsu, James Glass

Keywords: nlp, quantization, representation learning, robustness, self supervised learning

Monday: Signals and Systems

Abstract: In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.

Similar Papers

And the Bit Goes Down: Revisiting the Quantization of Neural Networks
Pierre Stock, Armand Joulin, Rémi Gribonval, Benjamin Graham, Hervé Jégou,
Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware
Xiandong Zhao, Ying Wang, Xuyi Cai, Cheng Liu, Lei Zhang,