[basically InstaFlow] Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis.
In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step.
Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a substantial advancement in the landscape of efficient generative models.
Figure 1: Images generated by our UFOGen Model with 1 sampling step.
The model is trained by fine-tuning Stable Diffusion 1.5 with our introduced techniques.
…Our inspiration stems from previous work that successfully incorporated GANs into the framework of diffusion models58, 59, 62, 68, which have demonstrated the capacity to generate images in as few as 4 steps when trained on small-scale datasets. [cf. NoGAN]
Figure 3: Illustration of the training strategy for UFOGen model.