Poster
in
Workshop: The 2nd Workshop on World Models: Understanding, Modelling and Scaling

EGO-FLIGHT: Egocentric Grounding of Order for Frame-Level Inference in General Human Timelines

Jiahang He ⋅ Anya Singh ⋅ Udai Relan ⋅ Varun Nair

Project Page [ OpenReview]

Abstract

Multimodal Large Language Models have shown impressive progress across vision-language tasks, but they still struggle with temporal reasoning, a critical skill for understanding dynamic visual content. We introduce EGO-FLIGHT, a benchmark and dataset designed to directly evaluate temporal reasoning in Vision-Language Models (VLMs) through frame-ordering tasks in human-like, first-person visual contexts. Our dataset contains 1,056 continuous, egocentric video clips that capture natural variations in lighting, motion, and occlusion, providing a first-person perspective that mirrors how humans experience and interpret dynamic scenes. Experiments on frame-sorting tasks varying four controlled variables reveal that current models perform substantially below the human baseline, though longer videos, fewer frames, and more annotation generally improve performance. Finally, applying two LoRA fine-tuning strategies to a VLM trained on our hand-collected data improves performance over the base model, providing a promising path toward enhancing temporal reasoning capabilities. We hope this work advances research on temporal understanding and encourages the development of models that more closely align with human perception while supporting realistic learning in embodied systems such as robots.

Chat is not available.