Skip to yearly menu bar Skip to main content


Poster

Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

Haomiao Xiong · Zongxin Yang · Jiazuo Yu · Yunzhi Zhuge · Lu Zhang · Jiawen Zhu · Huchuan Lu

Hall 3 + Hall 2B #229
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. StreamChat leverages a novel hierarchical memory system to efficiently process and compress video features over extended sequences, enabling real-time, multi-turn dialogue. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications. Furthermore, we introduce StreamBench, a versatile benchmark that evaluates streaming video understanding across diverse media types and interactive scenarios, including multi-turn interactions and complex reasoning tasks. Extensive evaluations on StreamBench and other public benchmarks demonstrate that StreamChat significantly outperforms existingstate-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. Code is available at StreamChat.

Live content is unavailable. Log in and register to view live content