ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Abstract
Understanding long videos requires Multimodal Large Language Models (MLLMs) to grasp multi-timescale information, often organized in hierarchies. However, current long-video understanding benchmarks either overlook multi-timescale design or distribute questions targeting different timescales across different videos. This approach entangles timescales with video content, thereby hindering a clear assessment of MLLM multi-timescale performance. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales\textemdash clip (seconds), shot (tens of seconds), event (minutes), and story (hours)\textemdash all within the same video content. This ``within-content'' multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 videos (avg. 86 min) from 5 main categories and 36 sub-categories, with 4–8 carefully designed questions, with at least one question targeting each timescale. Evaluating 22 MLLMs reveals a distinct U-shaped performance trend: higher accuracy at the shortest (clip) and longest (story) timescales, with a dip at intermediate levels. Furthermore, ablation studies demonstrate that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a crucial fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available at \url{https://anonymous.4open.science/r/ScaleLong-7717}.