[SDXL use of my conditional aspect-ratio training idea] ā¦A notorious shortcoming of the LDM paradigm is the fact that training a model requires a minimal image size, due to its two-stage architecture. The two main approaches to tackle this problem are either to discard all training images below a certain minimal resolution (for example, Stable Diffusion 1.4/1.5 discarded all images with any size below 512 pixels), or, alternatively, upscale images that are too smallā¦For this particular choice of [LAION] data, discarding all samples below our pretraining resolution of 2562 pixels would lead to 39% discarded data.
Figure 2: Height-vs-Width distribution of our pre-training dataset. Without the proposed size-conditioning, 39% of the data would be discarded due to edge lengths smaller than 256 pixels as visualized by the dashed black lines. Color intensity in each visualized cell is proportional to the number of samples.
The second method, on the other hand, usually introduces upscaling artifacts which may leak into the final model outputs, causing, for example, blurry samples.
Instead, we propose to [randomly crop the image to fit and then] condition the U-Net model on the original image resolution, which is trivially available during training. At inference time, a user can then set the desired apparent resolution of the image via this size-conditioning.
Figure 3: The effects of varying the size-conditioning. We show draw 4 samples with the same random seed from SDXL and vary the size-conditioning as depicted above each column. The image quality clearly increases when conditioning on larger image sizes. Samples from the 5122 model, see §2.5. Note: For this visualization, we use the 512 à 512 pixel base model (see §2.5), since the effect of size conditioning is more clearly visible before 1,024 à 1,024px finetuning. Best viewed zoomed in.
Evidently (see Figure 3), the model has learned to associate the conditioning csize with resolution-dependent image features, which can be leveraged to modify the appearance of an output corresponding to a given prompt.
ā¦We quantitatively assess the effects of this simple but effective conditioning technique by training and evaluating 3 LDMs on class conditional ImageNet at spatial size 5122.
Figure 4: Comparison of the output of SDXL with previous versions of Stable Diffusion. For each prompt, we show 3 random samples of the respective model for 50 steps of the DDIM sampler and cfg-scale 8.0. Additional samples in Figure 14.
Table 2: Conditioning on the original spatial size of the training examples improves performance on class-conditional ImageNet on 5122 resolution.
model
FID-5k ā
IS-5k ā
CIN-512-only
43.84
110.64
CIN-nocond
39.76
211.50
CIN-size-cond
36.53
215.34
ā¦Conditioning the Model on Cropping Parameters: The first two rows of Figure 4 illustrate a typical failure mode of previous SD models: Synthesized objects can be cropped, such as the cut-off head of the cat in the left examples for SD-1.5 and SD-2.1. An intuitive explanation for this behavior is the use of random cropping during training of the model: As collating a batch in DL frameworks such as PyTorch requires tensors of the same size, a typical processing pipeline is to (1) resize an image such that the shortest size matches the desired target size, followed by (2) randomly cropping the image along the longer axis. While random cropping is a natural form of data augmentation, it can leak into the generated samples, causing the malicious effects shown above [cf. GAN data-augmentation].
To fix this problem, we propose another simple yet effective conditioning method: During dataloading, we uniformly sample crop coordinates ctop and cleft (integers specifying the amount of pixels cropped from the top-left corner along the height and width axes, respectively) and feed them into the model as conditioning parameters via Fourier feature embeddings, similar to the size conditioning described above. The concatenated embedding ccrop is then used as an additional conditioning parameter. We emphasize that this technique is not limited to LDMs and could be used for any DM. Note that crop/size-conditioning can be readily combined. In such a case, we concatenate the feature embedding along the channel dimension, before adding it to the timestep embedding in the U-Net. Algorithm 1 illustrates how we sample ccrop and csize during training if such a combination is applied.
Given that in our experience large scale datasets are, on average, object-centric, we set (ctop, cleft) = (0, 0) during inference and thereby obtain object-centered samples from the trained model. See Figure 5 for an illustration: By tuning (ctop, cleft), we can successfully simulate the amount of cropping during inference. This is a form of conditioning-augmentation, and has been used in various forms with autoregressive models, and more recently with diffusion models. While other methods like data bucketing successfully tackle the same task, we still benefit from cropping-induced data augmentation, while making sure that it does not leak into the generation processāwe actually use it to our advantage to gain more control over the image synthesis process. Furthermore, it is easy to implement and can be applied in an online fashion during training, without additional data preprocessing.
Figure 5: Varying the crop conditioning as discussed in §2.2 āMicro-conditioningā. See Figure 4 & Figure 14 for samples from SD-1.5 & SD-2.1 which provide no explicit control of this parameter and thus introduce cropping artifacts. Samples from the 5122 model, see §2.5.