RIVER: A Real-Time Interaction Benchmark for Video LLMs
Abstract
Multimodal large language models (MLLMs) have demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering their real-time interactivity. To address this gap, we introduce the Real-tIme intERaction Benchmark for Video LLMs (RIVER Bench), designed for evaluating their real-time interaction ability with humans through perceiving the streaming videos. RIVER Bench introduces a novel evaluation framework comprising Retrospective Memory, Live-Perception, and Proactive Response tasks, closely mimicking interactive dialogues with humans rather than understanding the entire videos at once. We conduct detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online interaction paradigm, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enhances models’ flexibility in real-time interaction. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. The code and data will be released.