“VALL-E: Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers”, Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei2023-01-05 (, , )⁠:

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems.

VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second recording of an unseen speaker as an acoustic prompt.

Experiment results show that VALL-E outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.

See Github for demos of our work.

Figure 1: The overview of VALL-E. Unlike the previous pipeline (eg. phoneme → mel-spectrogram → waveform), the pipeline of VALL-E is phoneme → discrete code → waveform. VALL-E generates the discrete audio codec codes based on phoneme and acoustic code prompts, corresponding to the target content and the speaker’s voice. VALL-E directly enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation combined with other generative AI models like GPT-3 (Brown et al 2020).

…It is worth noting that existing TTS systems are always trained with dozens of hours of single-speaker data or hundreds of hours of multi-speaker data, which is over hundreds of times smaller than VALL-E.

…We evaluate VALL-E on LibriSpeech [Panayotov et al 2015] and VCTK [Veaux et al 2016] datasets, where all test speakers are unseen in the training corpus. VALL-E substantially outperforms the state-of-the-art zero-shot TTS system [Casanova et al 2022b] in terms of speech naturalness and speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean option score (SMOS) improvement on LibriSpeech. VALL-E also beats the baseline on VCTK with +0.11 SMOS and +0.23 CMOS improvements. It even achieves a +0.04 CMOS score against ground truth, showing the synthesized speech of unseen speakers is as natural as human recordings on VCTK. Moreover, the qualitative analysis shows that VALL-E is able to synthesize diverse outputs with the same text and target speaker, which could benefit pseudo-data creation for the speech recognition task. We also find that VALL-E could keep the acoustic environment (eg. reverberation) and emotion (eg. anger) of the acoustic prompt.

In summary, we make the following contributions.

…In this paper, we follow AudioLM [Borsos et al 2022] to leverage neural codec models to represent speech in discrete tokens.

…The models are trained using 16 NVIDIA TESLA V100 32GB GPUs with a batch size of 6k acoustic tokens per GPU for 800k steps.