Poster Fri, Apr 24, 2026 • 6:30 AM – 9:00 AM PDT Pavilion 4 P4-#5006

Scaling Speech Tokenizers with Diffusion Autoencoders

Yuancheng Wang ⋅ Zhenyu Tang ⋅ Yun Wang ⋅ Arthur Hinsvark ⋅ Yingru Liu ⋅ Yinghao Li ⋅ Kainan Peng ⋅ Junyi Ao ⋅ Mingbo Ma ⋅ Mike Seltzer ⋅ Qing He ⋅ Xubo Liu

[ OpenReview]

Abstract

Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of 12.5 Hz and a bit-rate of 200 bits-per-second.

Video

Chat is not available.