Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

Hanrong Ye · Haotian Zhang · Erik Daxberger · Lin Chen · Zongyu Lin · Yanghao Li · Bowen Zhang · Haoxuan You · Dan Xu · Zhe Gan · Jiasen Lu · Yinfei Yang

Hall 3 + Hall 2B #205
[ ]
Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding.To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data.This is one of the largest egocentric QA datasets.Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated.Third, we propose a specialized multimodal architecture featuring a novel Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content.With the data, benchmark, and model, we build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.

Live content is unavailable. Log in and register to view live content