Skip to yearly menu bar Skip to main content


Poster

Aligned Better, Listen Better For Audio-Visual Large Language Models

Yuxin Guo · Shuailei Ma · Shijie Ma · Xiaoyi Bao · Chen-Wei Xie · Kecheng Zheng · Tingyu Weng · Siyang Sun · Yun Zheng · Wei Zou

Hall 3 + Hall 2B #559
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio and audio supplies complementary information to the visual modality. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucination. To solve the issues, we delve into the model architecture and data aspects. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the data perspective, we curate an audio-visual caption \& instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates hallucinations. Our codes and dataset will be made publicly available.

Live content is unavailable. Log in and register to view live content