“NaturalSpeech: End-To-End Text to Speech Synthesis With Human-Level Quality”, 2022-05-09 ():
[samples] Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge human-level quality and how to achieve it…Using this judge method, we found several previous TTS systems have not achieved it (see Table 1).
In this paper, we answer these questions by first defining human-level quality based on statistical-significance of measurement and describing the guidelines to judge it, and then proposing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key designs to enhance the capacity of prior from text and reduce the complexity of posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and memory mechanism in VAE.
Experiment evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves −0.01 CMOS (comparative mean opinion score) to human recordings on sentence level, with Wilcoxon signed rank test at p-level p ≫0.05, which demonstrates no statistically-significant difference from human recordings for the first time on this dataset.