ICLR Poster CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Poster

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Guo Chen · Yicheng Liu · Yifei Huang · Baoqi Pei · Jilan Xu · Yuping He · Tong Lu · Yali Wang · Limin Wang

Hall 3 + Hall 2B #614

[ Abstract ]

Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

The existing video understanding benchmarks for multimodal large language models (MLLMs) mainly focus on short videos. The few benchmarks for long video understanding often rely on multiple-choice questions (MCQs). Due to the limitations of MCQ evaluations and the advanced reasoning abilities of MLLMs, models can often answer correctly by combining short video insights with elimination, without truly understanding the content. To bridge this gap, we introduce CG-Bench, a benchmark for clue-grounded question answering in long videos. CG-Bench emphasizes the model's ability to retrieve relevant clues, enhancing evaluation credibility. It includes 1,219 manually curated videos organized into 14 primary, 171 secondary, and 638 tertiary categories, making it the largest benchmark for long video analysis. The dataset features 12,129 QA pairs in three question types: perception, reasoning, and hallucination. To address the limitations of MCQ-based evaluation, we develop two novel clue-based methods: clue-grounded white box and black box evaluations, assessing whether models generate answers based on accurate video understanding. We evaluated multiple closed-source and open-source MLLMs on CG-Bench. The results show that current models struggle significantly with long videos compared to short ones, and there is a notable gap between open-source and commercial models. We hope CG-Bench will drive the development of more reliable and capable MLLMs for long video comprehension.

Live content is unavailable. Log in and register to view live content