“StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-To-Image Synthesis”, 2023-01-23 ():
[video, code] Text-to-image synthesis has recently seen progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis.
This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff.
StyleGAN-T improves over previous GANs and outperforms distilled diffusion models—the previous state-of-the-art in fast text-to-image synthesis—in terms of sample quality and speed.
…Using the final configuration developed in §3, we scale the model size, dataset, and training time. Our final model consists of ∼1 billion parameters; we did not observe any instabilities when increasing the model size. [emphasis added] We train on a union of several datasets amounting to 250M text-image pairs in total. We use progressive growing similar to StyleGAN-XL, except that all layers remain trainable. The hyperparameters and dataset details are listed in Appendix A.