Skip to yearly menu bar Skip to main content


Poster

HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis

Yuto Nishimura · Takumi Hirose · Masanari Ohi · Hideki Nakayama · Nakamasa Inoue

Hall 3 + Hall 2B #155
[ ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Recently, Text-to-speech (TTS) models based on large language models (LLMs)that translate natural language text into sequences of discrete audio tokens havegained great research attention, with advances in neural audio codec (NAC) mod-els using residual vector quantization (RVQ). However, long-form speech synthe-sis remains a significant challenge due to the high frame rate, which increases thelength of audio tokens and makes it difficult for autoregressive language modelsto generate audio tokens for even a minute of speech. To address this challenge,this paper introduces two novel post-training approaches: 1) Multi-Resolution Re-quantization (MReQ) and 2) HALL-E. MReQ is a framework to reduce the framerate of pre-trained NAC models. Specifically, it incorporates multi-resolutionresidual vector quantization (MRVQ) module that hierarchically reorganizes dis-crete audio tokens through teacher-student distillation. HALL-E is an LLM-basedTTS model designed to predict hierarchical tokens of MReQ. Specifically, it incor-porates the technique of using MRVQ sub-modules and continues training from apre-trained LLM-based TTS model. Furthermore, to promote TTS research, wecreate MinutesSpeech, a new benchmark dataset consisting of 40k hours of filteredspeech data for training and evaluating speech synthesis ranging from 3s up to180s. In experiments, we demonstrated the effectiveness of our approaches by ap-plying our post-training framework to VALL-E. We achieved the frame rate downto as low as 8 Hz, enabling the stable minitue-long speech synthesis in a singleinference step. Audio samples, dataset, codes and pre-trained models are availableat https://yutonishimura-v2.github.io/HALL-E_DEMO.

Live content is unavailable. Log in and register to view live content