Skip to yearly menu bar Skip to main content


Poster

VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?

Xize Cheng · Ruofan Hu · Xiaoda Yang · Jingyu Lu · Dongjie Fu · zehan wang · Shengpeng Ji · Rongjie Huang · Boyang Zhang · Tao Jin · Zhou Zhao

Hall 3 + Hall 2B #633
[ ]
Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

With the rapid advancement of large models, voice assistants are gradually acquiring the ability to engage in open-ended daily conversations with humans. However, current spoken dialogue systems often overlook multi-modal information in audio beyond text, such as speech rate, volume, emphasis, and background sounds. Relying solely on Automatic Speech Recognition (ASR) can lead to the loss of valuable auditory cues, thereby weakening the system’s ability to generate contextually appropriate responses. To address this limitation, we propose \textbf{VoxDialogue}, a comprehensive benchmark for evaluating the ability of spoken dialogue systems to understand multi-modal information beyond text. Specifically, we have identified 12 attributes highly correlated with acoustic information beyond words and have meticulously designed corresponding spoken dialogue test sets for each attribute, encompassing a total of 4.5K multi-turn spoken dialogue samples. Finally, we evaluated several existing spoken dialogue models, analyzing their performance on the 12 attribute subsets of VoxDialogue. Experiments have shown that in spoken dialogue scenarios, many acoustic cues cannot be conveyed through textual information and must be directly interpreted from the audio input. In contrast, while direct spoken dialogue systems excel at processing acoustic signals, they still face limitations in handling complex dialogue tasks due to their restricted context understanding capabilities. All data and code will be open source at \url{https://voxdialogue.github.io/}.

Live content is unavailable. Log in and register to view live content