ScalingCache: Extreme Acceleration of DiTs through Difference Scaling and Dynamic Interval Caching
Lihui Gu · Jingbin He · Lianghao Su · Kang He · Wenxiao Wang · Yuliang Liu
Abstract
Diffusion Transformers (DiTs) have emerged as powerful generative models, but their iterative denoising structure and deep transformer blocks incur substantial computational overhead, limiting the accessibility and practical deployment of high-quality video generation. To address this bottleneck, we propose ScalingCache, a training-free acceleration framework specifically designed for DiTs. ScalingCache exploits the inherent redundancy in model representations by performing lightweight offline analysis on a small number of samples and dynamically reusing previously computed activations during inference, thereby avoiding full computation at certain denoising steps. Experimental results demonstrate that ScalingCache achieves significant acceleration in both image and video generation tasks while maintaining near-lossless generation quality. On widely used video generation models including Wan2.1 and HunyuanVideo, it achieves approximately 2.5$\times$ acceleration with only 0.5$\%$ drop in VBench scores; on FLUX, it achieves 3.1$\times$ near-lossless acceleration, with human preference tests showing comparable quality to original outputs. Moreover, under similar acceleration ratios, ScalingCache outperforms prior state-of-the-art caching strategies, achieving a 45$\%$ reduction in LPIPS for text-to-image generation and 20$-$30$\%$ reduction for text-to-video generation, highlighting its superior fidelity preservation.
Successful Page Load