Arbitrary Generative Video Interpolation
Abstract
Generative Video Frame Interpolation (VFI), which synthesizes intermediate frames from a given pair of start and end frames, plays a pivotal role in video creation. However, existing generative VFI methods are constrained to producing a fixed number of intermediate frames, which significantly limits the flexibility in adjusting the frame rate or duration of videos during the creation process. In this work, we present \textbf{ArbInterp}, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2× to 32×) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Video demos are provided on the website: https://mcg-nju.github.io/ArbInterp-Web.