The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting
Sarvesh Baskar ⋅ Muhammad Islam ⋅ Zikui Cai ⋅ Ankit Nakhawa ⋅ Anirudh Satheesh ⋅ Tom Goldstein ⋅ Furong Huang
Abstract
Large multimodal models demonstrate strong performance on complex video understanding benchmarks, leading to the expectation that they should trivialy handle simple temporal reasoning tasks. In this work, we show that this assumption is fundamentally flawed. Using parametric profiling -- systematically varying event frequency, event count, and temporal span -- we uncover a striking failure mode: state-of-the-art video–language models fail catastrophically on conceptually simple tasks. While performance generally degrades as event frequency increases (as expected), we observe a counter-intuitive collapse in the easy regime: even at low frequencies (0.5 -- 1 Hz) with visually distinct events, performance plummets once the event count exceeds a trivial threshold (e.g., $N > 4$). Moreover, scaling model size from 8B to 235B does not resolve this limitation; large and small models exhibit nearly identical capability boundaries. Our analysis suggests that errors arise not from high-level reasoning or counting per se, but from systematic temporal misinterpretation, including event merging, hallucinated intermediate states, and color-based temporal interpolation. These results reveal a blind spot in current models’ temporal abstraction that is masked by aggregate benchmark scores and largely invariant to scale. Our findings highlight the need for diagnostic evaluation beyond average accuracy and suggest that scaling alone is insufficient to resolve fundamental limitations in temporal event reasoning.
Chat is not available.
Successful Page Load