Neodragon: Mobile Video Generation Using Diffusion Transformer
Animesh Karnewar · Denis Korzhenkov · Ioannis Lelekas · Noor Fathima · Adil Karjauv · Mohsen Ghafoorian · Amirhossein Habibian
Abstract
We propose Neogradon, a video DiT (Diffusion Transformer) designed to run on a low-power NPU present in devices such as phones and laptop computers. We demonstrate that, despite video transformers' huge memory and compute cost, mobile devices can run these models when carefully optimised for efficiency. To achieve this level of efficiency, i) we replace the original large Text-Encoder with a much smaller one with minimal quality loss through our novel distillation framework which doesn’t require any image or video data. ii) We propose an Asymmetric Decoder distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline. iii) With our Block Pruning strategy, we remove entire blocks from the MMDiT denoiser based on their relative importance and recover original performance through a two-stage distillation process. iv) We reduce the diffusion sampling cost using our novel extended version of DMD (Distribution Matching Distillation) for the Pyramidal Flow-Matching objective. Neodragon generates 49 frames of [640$\times$1024] resolution within 7.6 seconds on the Qualcomm Hexagon NPU with a VBench total score of 81.61, setting a new state-of-the-art for mobile video generation.
Successful Page Load