âTorToise: Better Speech Synthesis through Scalingâ, 2023-05-12 ()â :
In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images.
This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToiseâan expressive, multi-voice text-to-speech system.
All model code and trained weights have been open-sourced at Github.