Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT

OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text

Junyang Ji · Shengjun Zhang · Da Li · Yuxiao Luo · Yan Wang · Di Xu · Biao Yang · Wei Yuan · Fan Yang · Zhihai He · Wenming Yang

Abstract

Composed video retrieval presents a complex challenge: retrieving a target video based on a source video and a textual modification instruction. This task demands fine-grained reasoning over multimodal transformations. However, existing benchmarks predominantly focus on vision–text alignment, largely overlooking the rich semantic signals embedded in audio—such as speech, music, and environmental sounds—which are often decisive for comprehensive video understanding. To bridge this gap, we introduce OmniCVR, a large-scale benchmark for omni-composed video retrieval that establishes vision, audio, and text as first-class modalities. OmniCVR is constructed via a scalable, automated pipeline integrating content-aware segmentation, omni-modal annotation, and a rigorous dual-validation protocol involving both large language models and human experts. The benchmark comprises vision-centric, audio-centric, and integrated queries, with the latter forming the majority to accurately reflect real-world multimodal complexity. Furthermore, we propose AudioVLM2Vec, an audio-aware extension of VLM2Vec. By incorporating explicit audio semantics, AudioVLM2Vec achieves state-of-the-art performance, highlighting fundamental limitations in the audio reasoning capabilities of current multimodal retrieval systems.