Poster
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Chien-yu Huang · Wei-Chih Chen · Shu-wen Yang · Andy T. Liu · Chen-An Li · Yu-Xiang Lin · Wei-Cheng Tseng · Anuj Diwan · Yi-Jen Shih · Jiatong Shi · William Chen · Xuanjun Chen · Chi-Yuan Hsiao · Puyuan Peng · Shih-Heng Wang · Chun-Yi Kuan · Ke-Han Lu · Kai-Wei Chang · Chih-Kai Yang · Fabian Ritter Gutierrez · Kuan-Po Huang · Siddhant Arora · You-Kuan Lin · CHUANG To · Eunjung Yeo · Kalvin Chang · Chung-Ming Chien · Kwanghee Choi · Cheng-Hsiu Hsieh · Yi-Cheng Lin · Chee-En Yu · I-Hsiang Chiu · Heitor Rodrigues Guimarães · Jionghao Han · Tzu-Quan Lin · Tzu-Yuan Lin · Homu Chang · Ting-Wu Chang · Chun Chen · Shou-Jen Chen · Yu-Hua Chen · Hsi-Chun Cheng · Kunal Dhawan · Jia-Lin Fang · Shi-Xin Fang · KUAN CHIANG · Chi-An Fu · Hsien-Fu Hsiao · Ching Hsu · Shao-Syuan Huang · Lee Wei · Hsi-Che Lin · Hsuan-Hao Lin · Hsuan-Ting Lin · Jian-Ren Lin · Ting-Chun Liu · Li-Chun Lu · Tsung-Min Pai · Ankita Pasad · Shih-Yun Kuan · Suwon Shon · Yuxun Tang · Yun-Shao Tsai · Wei Chiang · Tzu-Chieh Wei · Chengxi Wu · Dien-Ruei Wu · Chao-Han Yang · Chieh-Chi Yang · Jia Qi Yip · Shao-Xiang Yuan · Haibin Wu · Karen Livescu · David Harwath · Shinji Watanabe · Hung-yi Lee
Multimodal foundation models, such as Gemini and GPT-4, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline, which will be available after the paper is published.
Live content is unavailable. Log in and register to view live content