Oral Session
Oral Session 3B Image generation
201 A/B
Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
Zhuoyang Zhang ⋅ Luke Huang ⋅ Chengyue Wu ⋅ Shang Yang ⋅ Kelly Peng ⋅ Yao Lu ⋅ Song Han
We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256×256 res.) and 1024 to 48 (512×512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4× lower latency than previous parallelized autoregressive models.
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Junsong Chen ⋅ Yuyang Zhao ⋅ Jincheng YU ⋅ Ruihang Chu ⋅ Junyu Chen ⋅ Shuai Yang ⋅ Xianbang Wang ⋅ Yicheng Pan ⋅ Zhou Daquan ⋅ Huan Ling ⋅ Haozhe Liu ⋅ Hongwei Yi ⋅ Hao Zhang ⋅ Muyang Li ⋅ Yukang Chen ⋅ Han Cai ⋅ Sanja Fidler ⋅ Ping Luo ⋅ Song Han ⋅ Enze Xie
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1\% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x} speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.
Partition Generative Modeling: Masked Modeling Without Masks
Justin Deschenaux ⋅ Lan Tran ⋅ Caglar Gulcehre
Masked generative models (MGMs) can generate tokens in parallel and in any order, unlike autoregressive models (ARMs), which decode one token at a time, left-to-right. However, MGMs process the full-length sequence at every sampling step, including \mask tokens that carry no information. In contrast, ARMs process only the previously generated tokens. We introduce ``Partition Generative Models'' (PGMs), which replace masking with partitioning. Tokens are split into two groups that cannot attend to each other, and the model learns to predict each group conditioned on the other, eliminating mask tokens entirely. Because the groups do not interact, PGMs can process only the clean tokens during sampling, like ARMs, while retaining parallel, any-order generation, like MGMs. On OpenWebText, PGMs achieve $5-5.5\times$ higher throughput than MDLM while producing samples with lower Generative Perplexity. On ImageNet, PGMs reach comparable FID to MaskGIT with a $7.5\times$ throughput improvement. With twice as many steps, the FID improves to 4.56 while remaining $3.9\times$ faster than MGMs. Finally, PGMs remain compatible with existing MGM samplers and distillation methods.
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
chunrui han ⋅ Guopeng Li ⋅ Jingwei Wu ⋅ Quan Sun ⋅ Yan Cai ⋅ Yuang Peng ⋅ Zheng Ge ⋅ Deyu Zhou ⋅ Haomiao Tang ⋅ Hongyu Zhou ⋅ Kenkun Liu ⋅ Shu-Tao Xia ⋅ Binxing Jiao ⋅ Daxin Jiang ⋅ Xiangyu Zhang ⋅ Yibo Zhu
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, trained on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we have released our code and models to the community at https://github.com/stepfun-ai/NextStep-1.
TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer ⋅ Ondrej Klejch ⋅ Peter Bell
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for recreating a multilingual test dataset to avoid data leakage; and a benchmark for TTS in 14 languages.
VibeVoice: Expressive Podcast Generation with Next-Token Diffusion
Zhiliang Peng ⋅ Jianwei Yu ⋅ Wenhui Wang ⋅ Yaoyao Chang ⋅ Yutao Sun ⋅ Li Dong ⋅ Yi Zhu ⋅ Weijiang Xu ⋅ Hangbo Bao ⋅ Zehua Wang ⋅ Shaohan Huang ⋅ Yan Xia ⋅ Furu Wei
Generating long-form, multi-speaker conversational audio like podcasts poses significant challenges for traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. We present VibeVoice , a novel model designed to synthesize expressive, long-form speech with multiple speakers in a zero-shot manner. A core component of our approach is the continuous speech tokenizers operating at an ultra-low frame rate of 7.5. This tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. To facilitate training on authentic conversational dynamics, we have developed an annotation pipeline that generates pseudo transcriptions and turn-taking labels for extensive podcast data. Leveraging this data and our efficient tokenizer, VibeVoice employs the next-token diffusion framework. This enables VibeVoice to: (1) synthesize long-form speech (up to 30 minutes) with up to 4 speakers, surpassing the typical 1-2 speaker limits of many prior models; and (2) achieve a high degree of naturalness in turn-taking, pacing, and the rendition of subtle non-lexical cues (such as breaths and lip smacks), which are crucial for listener immersion and capturing the authentic vibe of expressive conversations.
UALM: Unified Audio Language Model for Understanding, Generation and Reasoning
Jinchuan Tian ⋅ Sang-gil Lee ⋅ Zhifeng Kong ⋅ Sreyan Ghosh ⋅ Arushi Goel ⋅ Chao-Han Huck Yang ⋅ Wenliang Dai ⋅ Zihan Liu ⋅ Hanrong Ye ⋅ Shinji Watanabe ⋅ Mohammad Shoeybi ⋅ Bryan Catanzaro ⋅ Rafael Valle ⋅ Wei Ping
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.