NovelAI Improvements on Stable Diffusion

As part of the development process for our NovelAI Diffusion image generation models, we modified the model architecture of Stable Diffusion and its training process.

These changes improved the overall quality of generations and user experience and better suited our use case of enhancing storytelling through image generation.

In this blog post, we’d like to give a technical overview of some of the modifications and additions we performed.

Using Hidden States of CLIP’s Penultimate Layer

Stable Diffusion uses the final hidden states of CLIP’s transformer-based text encoder to guide generations using classifier free guidance.

In Imagen (Saharia et al., 2022), instead of the final layer’s hidden states, the penultimate layer’s hidden states are used for guidance.

Discussions on the EleutherAI Discord also indicated, that the penultimate layer might give superior results for guidance, as the hidden state values change abruptly in the last layer, which prepares them for being condensed into a smaller vector usually used for CLIP based similarity search.

During experimentation, we found that Stable Diffusion is able to interpret the hidden states from the penultimate layer, as long as the final layer norm of CLIP’s text transformer is applied, and generate images that still match the prompt, although with slightly reduced accuracy.

Further testing led us to perform our training with the penultimate layer’s hidden states rather than the final layer’s because we found it let the model make better use of the dense information in tag based prompts, allowing the model to more quickly learn how to disentangle certain concepts. For example, when using the final layer, the model had more difficulties disentangling disparate concepts and, for example, correctly assigning colors.

“Hatsune Miku, Red Dress”

During the experimentation phase, we evaluated training runs with different parameters on different prompts like “Hatsune Miku, Red Dress”, which tended to have the red of the dress leak into Miku’s hair and eye color until a certain point in training, especially when using the final layer’s hidden states. We also used other more complex prompts to evaluate the ability of different training runs to combine tagged concepts accurately, such as:

`Tags: purple eyes, 1girl, short hair, smile, open mouth, ruffled blouse, red blouse, pleated skirt, blonde hair, green scarf, waving at viewer`

Generation with NovelAI Diffusion (Curated), our anime model for “Tags: purple eyes, 1girl, short hair, smile, open mouth, ruffled blouse, red blouse, pleated skirt, blonde hair, green scarf, waving at viewer”

Aspect Ratio Bucketing

One common issue of existing image generation models is that they are very prone to producing images with unnatural crops. This is due to the fact that these models are trained to produce square images. However, most photos and artworks are not square. However, the model can only work on images of the same size at the same time, and during training, it is common practice to operate on multiple training samples at once to optimize the efficiency of the GPUs used. As a compromise, square images are chosen, and during training, only the center of each image is cropped out and then shown to the image generation model as a training example.

Knight wearing a crown with darkened regions removed by the center crop

For example, humans are often generated without feet or heads, and swords consist of only a blade with a hilt and point outside the frame.
As we are creating an image generation model to accompany our storytelling experience, it is important that our model is able to produce proper, uncropped characters, and generated knights should not be holding a metallic-looking straight line extending to infinity.

Another issue with training on cropped images is that it can lead to a mismatch between the text and the image.

For example, an image with a `crown` tag will often no longer contain a crown after a center crop is applied and the monarch has been, thereby, decapitated.

We found that using random crops instead of center crops only slightly improves these issues.

Using Stable Diffusion with variable image sizes is possible, although it can be noticed that going too far beyond the native resolution of 512x512 tends to introduce repeated image elements, and very low resolutions produce indiscernible images.

Still, this indicated to us that training the model on variable sized images should be possible. Training on single, variable sized samples would be trivial, but also extremely slow and more liable to training instability due to the lack of regularization provided by the use of mini batches.

Custom Batch Generation

As no existing solution for this problem seems to exist, we have implemented custom batch generation code for our dataset that allows the creation of batches where every item in the batch has the same size, but the image size of batches may differ.

We do this through a method we call aspect ratio bucketing. An alternative approach would be to use a fixed image size, scale each image to fit within this fixed size and apply padding that is masked out during training. Since this leads to unnecessary computation during training, we have not chosen to follow this alternative approach.

In the following, we describe the original idea behind our custom batch generation scheme for aspect ratio bucketing.

First, we have to define which buckets we want to sort the images of our dataset into. For this purpose, we define a maximum image size of 512x768 with a maximum dimension size of 1024. Since the maximum image size is 512x768, which is larger than 512x512 and requires more VRAM, per-GPU batch size has to be lowered, which can be compensated through gradient accumulation.

We generate buckets by applying the following algorithm:

● Set the width to 256.
● While the width is less than or equal to 1024:
• Find the largest height such that height is less than or equal to 1024 and that width multiplied by height is less than or equal to 512 * 768.
• Add the resolution given by height and width as a bucket.
• Increase the width by 64.

The same is repeated with width and height exchanged. Duplicated buckets are pruned from the list, and an additional bucket sized 512x512 is added.

Next, we assign images to their corresponding buckets. For this purpose, we first store the bucket resolutions in a NumPy array and calculate the aspect ratio of each resolution. For each image in the dataset, we then retrieve its resolution and calculate the aspect ratio. The image aspect ratio is subtracted from the array of bucket aspect ratios, allowing us to efficiently select the closest bucket according to the absolute value of the difference between aspect ratios:

`image_bucket = argmin(abs(bucket_aspects — image_aspect))`

The image’s bucket number is stored associated with its item ID in the dataset. If the image’s aspect ratio is very extreme and too different from even the best-fitting bucket, the image is pruned from the dataset.

Since we train on multiple GPUs, before each epoch, we shard the dataset to ensure that each GPU works on a distinct subset of equal size. To do this, we first copy the list of item IDs in the dataset and shuffle them. If this copied list is not divisible by the number of GPUs multiplied by the batch size, the list is trimmed, and the last items are dropped to make it divisible.

We then select a distinct subset of `1/world_size*bsz` item IDs according to the global rank of the current process. The rest of the custom batch generation will be described as seen from any single of these processes and operate on the subset of dataset item IDs.

For the current shard, lists for each bucket are created by iterating over the list of shuffled dataset item IDs and assigning the ID to the list corresponding to the bucket that was assigned to the image.

Once all images are processed, we iterate over the lists for each bucket. If its length is not divisible by the batch size, we remove the last elements on the list as necessary to make it divisible and add them to a separate catch-all bucket. As the overall shard size is guaranteed to contain a number of elements divisible by the batch size, doing is guaranteed to produce a catch-all bucket with a length divisible by the batch size as well.

When a batch is requested, we draw randomly draw a bucket from a weighted distribution. The bucket weights are set as the size of the bucket divided by the size of all remaining buckets. This ensures that even with buckets of widely varying sizes, the custom batch generation does not introduce strong bias when during training, an image shows up according to image size. If buckets were chosen without weighting, small buckets would empty out early during the training process, and only the biggest buckets would remain towards the end of training. Weighting buckets by size avoids this.

A batch of items is finally taken from the chosen bucket. The items taken are removed from the bucket. If the bucket is now empty, it is deleted for the rest of the epoch. The chosen item IDs and the chosen bucket’s resolution are now passed to an image-loading function. Each item ID’s image is loaded and processed to fit within the bucket resolution. For fitting the image, two approaches are possible.

First, the image could be simply rescaled. This would lead to a slight distortion of the image. For this reason, we have opted for the second approach:

The image is scaled, while preserving its aspect ratio, in such a way that it:

● Either fits the bucket resolution exactly if the aspect ratio happens to match
● or it extends past the bucket resolution on one dimension while fitting it exactly on the other.

In the latter case, a random crop is applied.

As we found that the mean aspect ratio error per image is only 0.033, these random crops only remove very little of the actual image, usually less than 32 pixels.

The loaded and processed images are finally returned as the image part of the batch.

Extending the Stable Diffusion Token Limit by 3x

The original Stable Diffusion model has a maximum prompt length of 75 CLIP tokens, plus a start and end token (77 total). This is due to the fact, that CLIP itself has this limitation and is used for providing the vector used in classifier-free guidance.

Since, we are working with information-dense tags, it becomes easy to exceed this token limit. We managed to extend the maximum prompt length of our model by three times.

This allows for much more information to be packed into a single prompt and allows fine-grained control of the generated image.
It is also perfectly suited to use text snippets from your adventures generated with our AI storyteller!

To do this, we determine the maximum length prompt inside a batch of prompts and round its length up to the nearest higher multiple of 75. All shorter prompts within the batch are padded to the same length as the longest prompt with CLIP’s end-of-sentence token. Should the total length be above our determined cutoff point of 225 tokens, the batch is truncated along the sequence dimension to a length of 225. Thereafter it is split into individual chunks of 75 tokens along the sequence dimension. Each chunk is passed through CLIP’s text encoder individually. The resulting encoded chunks are then concatenated.

Since unconditional conditioning (UC) and prompt conditioning are added together in the form of `uc + (uc — prompt) * scale` for classifier-free guidance, additional care needs to be taken to pad the UC to the same length as the prompt when performing inference on the model.

During early experimentation, we noted that the base Stable Diffusion model is capable of making use of additional information in prompts given in this format to some degree. To ensure the best possible performance, during training, we vary the length of prompts between below 75 tokens up to 225 tokens, to let the model adapt to different sized prompts in an ideal manner.

Hypernetworks

In 2021, Kurumuz began developing Hypernetworks as a new way to control model generations.
The goal was to provide better text generation modules for NovelAI, which are currently based on prompt tuning.

It should be noted that this concept is entirely disparate from the HyperNetworks introduced by Ha et al in 2016, which work by modifying or generating the weights of the model, while our Hypernetworks apply a single small neural network (either a linear layer or multi-layer perceptron) at multiple points within the larger network, modifying the hidden states.

During the course of development, we experimented with numerous different configurations of Hypernetworks, either applying single or multiple networks at different points in the models to be modified. Most experiments in these early stages took place in large-scale transformer models for text generation.

The results from these experiments are going into what will become our future AI Modules V2.

Initial testing at smaller scales has shown very promising results, allowing the creation of modules that have a much stronger influence on model behavior than the previous prompt tunes. Due to this, datasets for various additional modules, which become possible through the new Hypernetwork architecture, are being created and optimized.

An important consideration with this technology is performance. Some more complex architectures are able to achieve higher accuracy after training, but the inference slowdown can become a big problem in a production environment, to the point where the quality improvement becomes moot. Larger Hypernet-free models are able to both outpace it and produce better results at that point.

During the early days of Stable Diffusion development before the launch, we were given access to the models for research purposes. During that time, our researchers dug into the model from many angles to see how we could improve it. One thing that would fit well into our service would be modules, seeing as our users are already familiar with the concept from our text generation service, and they can give unprecedented control over the model outputs. Initially, we tried training an embedding similar to our text generation modules. (This technique is similarly applied in textual inversion.)

However, we found the model wasn’t able to generalize well enough with the learned embedding, and the overall learning capacity was very small as it was limited by that embedding. We then thought to apply our Hypernet tech to StableDiffusion, if it worked, it could have much more capacity to learn while still remaining performant enough to be viable in a production environment.

After many iterations testing many different architectures, Aero was able to come up with one that is both performant and achieves high accuracy with varied dataset sizes. The hypernets are applied to the k and v vectors of CrossAttention layers in StableDiffusion, while not touching any other parts of the U-net. We found that the shallow attention layers overfit quickly with this approach, so we penalize those layers during training. This mostly mitigated the overfitting issue and results in better generalization at the end of training.

We found this architecture to perform as well or even better than fine-tuning in some cases.

The approach performs especially well over fine-tuning when data on the target concept is limited. We believe this is because the original model is preserved, and the hypernets can find sparse areas of the latent space to match the data. While fine-tuning on similar small datasets causes the model to lose generalization quality as it tries to fit the few training examples.

Well, that’s that!
We hope you had fun learning some of our deeper developments discovered over the past three months and have a great time generating with NovelAI Diffusion!

novelai.net Driven by AI, painlessly construct unique stories, thrilling tales, seductive romances, or just fool around. Anything goes!

Share your ideas with millions of readers.

Love podcasts or audiobooks? Learn on the go with our new app.