EGO-FLIGHT: Egocentric Grounding of Order for Frame-Level Inference in General Human Timelines
Abstract
Multimodal Large Language Models have shown impressive progress across vision-language tasks, but they still struggle with temporal reasoning, a critical skill for understanding dynamic visual content. We introduce EGO-FLIGHT, a benchmark and dataset designed to directly evaluate temporal reasoning in Vision-Language Models (VLMs) through frame-ordering tasks in human-like, first-person visual contexts. Our dataset contains 1,056 continuous, egocentric video clips that capture natural variations in lighting, motion, and occlusion, providing a first-person perspective that mirrors how humans experience and interpret dynamic scenes. Experiments on frame-sorting tasks varying four controlled variables reveal that current models perform substantially below the human baseline, though longer videos, fewer frames, and more annotation generally improve performance. Finally, applying two LoRA fine-tuning strategies to a VLM trained on our hand-collected data improves performance over the base model, providing a promising path toward enhancing temporal reasoning capabilities. We hope this work advances research on temporal understanding and encourages the development of models that more closely align with human perception while supporting realistic learning in embodied systems such as robots.