Skip to yearly menu bar Skip to main content


Poster

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping · Kainan Peng · Andrew Gibiansky · Sercan Arik · Ajay Kannan · SHARAN NARANG · Jonathan Raiman · John Miller

East Meeting level; 1,2,3 #41

Abstract:

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training an order of magnitude faster. We scale Deep Voice 3 to dataset sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on a single GPU server.

Live content is unavailable. Log in and register to view live content