One-Step Video Depth Estimation via Self-Distillation
Abstract
Diffusion-based video depth estimation methods have recently set new benchmarks by leveraging rich generative priors learned from video synthesis, delivering exceptional depth accuracy and robust temporal consistency. However, the iterative nature of these models creates a computational bottleneck, hindering their utility in autonomous or dynamic environments that require real-time adaptation. To bridge this gap, we frame the efficiency-accuracy trade-off as a self-improvement challenge. We propose a two-stage self-distillation strategy. In the first stage, we distill a multi-step diffusion model into a one-step student by applying latent-space distillation to the Unet via score matching and latent gradient matching. In the second stage, we further distill the decoder using feature alignment and pixel-wise distillation losses. Our method achieves depth accuracy comparable to state-of-the-art multi-step video depth models, while reducing the denoising time by up to 3× and the decoding time by up to 20×.