“EDiff-I: Text-To-Image Diffusion Models With an Ensemble of Expert Denoisers”, Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu2022-11-02 (, , , , )⁠:

[video] Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts.

We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages.

To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. [cf. sparse upcycling warmstarting?]

Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.

In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output…While CLIP text embeddings help determine the global look of the generated images, the outputs tend to miss the fine-grained details in the text. In contrast, images generated with T5 text embeddings alone better reflect the individual objects described in the text, but their global looks are less accurate. Using them jointly produces the best image-generation results in our model.

Lastly, we show a technique that enables eDiff-I’s “paint-with-words” capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind.

The project page is available at https://deepimagination.cc/eDiff-I/.

Figure 1: Example results and capabilities from our proposed method, eDiff-I. The first row shows that eDiff-I can faithfully turn complex input text prompts into artistic and photorealistic images. In the second row, we first show that eDiff-I can combine the text input and a reference image for generating the target output image, where the reference image can be conveniently used to represent a style or concept that is difficult to describe in words, but a visual example exists. We also show the paint-by-word capability of eDiff-I, where phrases in the input text can be painted on a canvas to control the specific layout of objects described in the input text. The paint-with-words capability complements the text-to-image capability and provides an artist with more control over the generation outputs.
Figure 2: Synthesis in diffusion models corresponds to an iterative denoising process that gradually generates images from random noise; a corresponding stochastic process is visualized for a one-dimensional distribution. Usually, the same denoiser neural network is used throughout the denoising process. In eDiff-I, we instead train an ensemble of expert denoisers that are specialized for denoising in different intervals of the generative process.

[eDiff-I beats DALL·E 2 by 3.4 FID points, Stable Diffusion 1.× by 1.6 FID, & Imagen/Parti by ~0.3 FID:]

Table 1: Zero-shot FID comparison with recent state-of-the-art methods on the COCO 2014 validation dataset. We include the text encoder size in our model parameter size calculation.

Datasets: We use a collection of public and proprietary datasets to train our model. To ensure high-quality training data, we apply heavy filtering using a pretrained CLIP model to measure the image-text alignment score as well as an esthetic scorer to rank the image quality. We remove image-text pairs that fail to meet a preset CLIP score threshold and a preset esthetic score. The final dataset to train our model contains about one billion text-image pairs. All the images have the shortest side greater than 64 pixels. We use all of them to train our base model. We only use images with the shortest side greater than 256 and 1,024 pixels to train our SR256 and SR1024 models, respectively. For training our base and SR256 models, we perform resize-central crop. Images are first resized so that the shortest side has the same number of pixels as the input image side. For training the SR1024 model, we randomly crop 256×256 regions during training and apply it on 1,024×1,024 resolution during inference. We use COCO and Visual Genome datasets for evaluation, which are excluded from our training datasets for measuring zero-shot text-to-image generation performance.