Skip to yearly menu bar Skip to main content


Poster

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

Zechen Bai · Tianjun Xiao · Tong He · Pichao WANG · Zheng Zhang · Thomas Brox · Mike Zheng Shou

Hall 3 + Hall 2B #92
[ ]
Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.

Live content is unavailable. Log in and register to view live content