HiTeA: Hierarchical Temporal Alignment for Training-Free Long-Video Temporal Grounding
Abstract
Temporal grounding in long, untrimmed videos is critical for real-world video understanding, yet it remains a challenging task owing to complex temporal structures and pervasive visual redundancy. Existing methods rely heavily on supervised training with task-specific annotations, which inherently limits their scalability and adaptability due to the substantial cost of data collection and model retraining. Although a few recent works have explored training-free or zero-shot grounding, they seldom address the unique challenges posed by long videos. In this paper, we propose HiTeA (Hierarchical Temporal Alignment), a novel, training-free framework explicitly designed for long-video temporal grounding. HiTeA introduces a hierarchical temporal decomposition mechanism that structures videos into events, scenes, and actions, thereby aligning natural language queries with the most appropriate temporal granularity. Candidate segments are then matched with queries by leveraging pre-trained vision–language models (VLMs) to directly compute segment–text similarity, thereby obviating the need for any task-specific training or fine-tuning. Extensive experiments on both short- and long-video benchmarks show that HiTeA not only substantially outperforms all existing training-free methods (e.g., achieving 44.94% R\@0.1 on TACoS, representing an absolute gain of 12.4%) but also achieves competitive performance against state-of-the-art supervised baselines under stricter metrics. The code is available at https://anonymous.4open.science/r/HiTeA_code.