Anime Neural Net Graveyard

Gwern Branwen

Danbooru AI, BigGAN, ProGAN, autoencoder NN

Post-mortems of failed neural network experiments in generating anime images, pre-StyleGAN/BigGAN.

2019-02-04^–_1y2021-01-29 finished certainty: highly likely importance: 2 backlinks bibliography

GAN
Normalizing Flow
- Glow
VAE
- IntroVAE

[Warning: JavaScript Disabled!]

[For support of key website features (link annotation popups/popovers & transclusions, collapsible sections, backlinks, tablesorting, image zooming, sidenotes etc), you must enable JavaScript.]

My experiments in generating anime faces, tried periodically since 2015, succeeded in 2019 with the release of StyleGAN. But for comparison, here are the failures from some of my older GAN or other NN attempts; as the quality is worse than StyleGAN, I won’t bother going into details in these post-mortems—creating the datasets & training the ProGAN & tuning & transfer-learning were all much the same as already outlined at length for the StyleGAN results.

Included are:

ProGAN

Glow

MSG-GAN

PokeGAN

Self-Attention-GAN-TensorFlow

VGAN

BigGAN unofficial

BigGAN-TensorFlow

BigGAN-PyTorch

(official BigGAN)

GAN-QP

WGAN

IntroVAE

Examples of anime faces generated by neural networks have wowed weebs across the world. Who could not be impressed by TWDNE samples like these:

64 TWDNE face samples selected from social media, in an 8×8 grid.

They are colorful, well-drawn, near-flawless, and attractive to look at. This page is not about them. This page is about… the others—the failed experiments that came before.

GAN

ProGAN

Using Karras et al 2017’s official implementation:

2018-09-08, 512–1024px whole-Asuka images ProGAN samples:

1024px, whole-Asuka images, ProGAN

512px whole-Asuka images, ProGAN
2018-09-18, 512px Asuka faces, ProGAN samples:

512px Asuka faces, ProGAN
2018-10-29, 512px Holo faces, ProGAN:

Random samples of 512px ProGAN Holo faces

After generating ~1k Holo faces, I selected the top decile (n = 103) of the faces (Imgur mirror):

512px ProGAN Holo faces, random samples from top decile (6×6)

The top decile images are, nevertheless, showing distinct signs of both artifacting & overfitting/memorization of data points. Another 2 weeks proved this out further:

ProGAN samples of 512px Holo faces, after badly overfitting (iteration #10,325)

Interpolation video of the October 2018 512px Holo face ProGAN; note the gross overfitting indicated by the abruptness of the interpolations jumping from face (mode) to face (mode) and lack of meaningful intermediate faces in addition to the overall blurriness & low visual quality.
2019-01-17, Danbooru2017 512px SFW images, ProGAN:

512px SFW Danbooru2017, ProGAN
2019-02-05 (stopped in order to train with the new StyleGAN codebase), the 512px anime face dataset used elsewhere, ProGAN:

512px anime faces, ProGAN

Interpolation video of the 2018-02-05 512px anime face ProGAN; while the image quality is low, the diversity is good & shows no overfitting/memorization or blatant mode collapse

Downloads:
- random samples archive (Imgur mirror)
- final model checkpoint (265MB)

PokeGAN

nshepperd’s (unpublished) multi-scale GAN with self-attention layers, spectral normalization, and a few other tweaks:

Self-Attention-GAN-TensorFlow

SAGAN did not have an official implementation released at the time so I used the Junho Kim implementation; 128px SAGAN, WGAN-LP loss, on Asuka faces & whole Asuka images:

Training montage of the 2018-08-18 128px whole-Asuka SAGAN; possibly too-high LR

Self-Attention-GAN-TensorFlow, Asuka faces, 2019-09-13

VGAN

The official VGAN code for Peng et al 2018 had not been released when I began trying VGAN, so I used akanimax’s implementation.

The variational discriminator bottleneck, along with self-attention layers and progressive growing, is one of the few strategies which permit 512px images, and I was intrigued to see that it worked relatively well, although I ran into persistent issues with instability & mode collapse. I suspect that VGAN could’ve worked better than it did with some more work.

BigGAN Unofficial

Brock et al 2018^s official implementation & models were not released until late March 2019 (nor the semi-official compare_gan implementation until February 2019), and I experimented with 2 unofficial implementations in late 2018–early 2019.

BigGAN-TensorFlow

Junho Kim implementation; 128px spectral norm hinge loss, anime faces:

This one never worked well at all, and I am still puzzled what went wrong.

BigGAN-PyTorch

Aaron Leong’s PyTorch BigGAN implementation (not the official BigGAN implementation). As it’s class-conditional, I faked having 1000_1,024ya classes by constructing a variant anime face dataset: taking the top 1000_1,024ya characters by tag count in the Danbooru2017 metadata, I then filtered for those character tags 1 by 1, and copied them & cropped faces into matching subdirectories 1–1000. This let me try out both faces & whole images. I also attempted to hack in gradient accumulation for big minibatches to make it a true BigGAN implementation, but didn’t help too much; the problem here might simply have been that I couldn’t run it long enough.

Results upon abandoning:

Leong BigGAN-PyTorch, 1000-class anime character dataset, 2018-11-30 (#314,000)

Leong BigGAN-PyTorch, 1000-class anime face dataset, 2018-12-24 (#1,006,320)

GAN-QP

Implementation of Su2018:

Training oscillated enormously, with all the samples closely linked and changing simultaneously. This was despite the checkpoint model being enormous (551MB) and I am suspicious that something was seriously wrong—either the model architecture was wrong (too many layers or filters?) or the learning rate was many orders of magnitude too large. Because of the small minibatch, progress was difficult to make in a reasonable amount of wallclock time, so I moved on.

WGAN

WGAN-GP official implementation; I did most of the early anime face work with WGAN on a different machine and didn’t keep copies. However, a sample from a short run gives an idea of what WGAN tended to look like on anime runs:

Normalizing Flow

Glow

Used Glow (Kingma & Dhariwal2018) official implementation.

Due to the enormous model size (4.2GB), I had to modify Glow’s settings to get training working reasonably well, after extensive tinkering to figure out what any meant:

{"verbose": true, "restore_path": "logs/model_4.ckpt", "inference": false, "logdir": "./logs", "problem": "asuka",
"category": "", "data_dir": "../glow/data/asuka/", "dal": 2, "fmap": 1, "pmap": 16, "n_train": 20000, "n_test": 1000,
"n_batch_train": 16, "n_batch_test": 50, "n_batch_init": 16, "optimizer": "adamax", "lr": 0.0005, "beta1": 0.9,
"polyak_epochs": 1, "weight_decay": 1.0, "epochs": 1000000, "epochs_warmup": 10, "epochs_full_valid": 3,
"gradient_checkpointing": 1, "image_size": 512, "anchor_size": 128, "width": 512, "depth": 13, "weight_y": 0.0,
"n_bits_x": 8, "n_levels": 7, "n_sample": 16, "epochs_full_sample": 5, "learntop": false, "ycond": false, "seed": 0,
"flow_permutation": 2, "flow_coupling": 1, "n_y": 1, "rnd_crop": false, "local_batch_train": 1, "local_batch_test": 1,
"local_batch_init": 1, "direct_iterator": true, "train_its": 1250, "test_its": 63, "full_test_its": 1000, "n_bins": 256.0, "top_shape": [4, 4, 768]}
...
{"epoch": 5, "n_processed": 100000, "n_images": 6250, "train_time": 14496, "loss": "2.0090", "bits_x": "2.0090", "bits_y": "0.0000", "pred_loss": "1.0000"}

An additional challenge was numerical instability in the reversing of matrices, giving rise to many ‘invertibility’ crashes.

Final sample before I looked up the compute requirements more carefully & gave up on Glow:

Glow, Asuka faces, 5 epoches (2018-08-02)

VAE

IntroVAE

A hybrid GAN-VAE architecture introduced in mid-2018 by “IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis”, Huang et al 2018, with the official PyTorch implementation released in April 2019, IntroVAE attempts to reuse the encoder-decoder for an adversarial loss as well, to combine the best of both worlds: the principled stable training & reversible encoder of the VAE with the sharpness & high quality of a GAN.

Quality-wise, they show IntroVAE works on CelebA & LSUN BEDROOM at up to 1024px resolution with results they claim are comparable to ProGAN. Performance-wise, for 512px, they give a runtime of 7 days with a minibatch n = 12, or presumably 4 GPUs (since their 1024px run script implies they used 4 GPUs and I can fit a minibatch of n = 4 onto 1×1080ti, so 4 GPUs would be consistent with n = 12), and so 28 GPU-days.

I adapted the 256px suggested settings for my 512px anime portraits dataset:

python main.py --hdim=512 --output_height=512 --channels='32, 64, 128, 256, 512, 512, 512' --m_plus=120 \
    --weight_rec=0.05 --weight_kl=1.0 --weight_neg=0.5 --num_vae=0 \
    --dataroot=/media/gwern/Data2/danbooru2018/portrait/1/ --trainsize=302652 --test_iter=1000 --save_iter=1 \
    --start_epoch=0 --batchSize=4 --nrow=8 --lr_e=0.0001 --lr_g=0.0001 --cuda --nEpochs=500
# ...====> Cur_iter: [187060]: Epoch [3] (5467⁄60531): time: 142675: Rec: 19569, Kl_E: 162, 151, 121, Kl_G: 151, 121,

There was a minor bug in the codebase where it would crash on trying to print out the log data, perhaps because it assumes multi-GPU and I was running on 1 GPU, and was trying to index into an array which was actually a simple scalar, which I fixed by removing the indexing:

-        info += 'Rec: {:.4f}, '.format(loss_rec.data[0])
-        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(lossE_real_kl.data[0],
-                                lossE_rec_kl.data[0], lossE_fake_kl.data[0])
-        info += 'Kl_G: {:.4f}, {:.4f}, '.format(lossG_rec_kl.data[0], lossG_fake_kl.data[0])
-
+
+        info += 'Rec: {:.4f}, '.format(loss_rec.data)
+        info += 'Kl_E: {:.4f}, {:.4f}, {:.4f}, '.format(lossE_real_kl.data,
+                                lossE_rec_kl.data, lossE_fake_kl.data)
+        info += 'Kl_G: {:.4f}, {:.4f}, '.format(lossG_rec_kl.data, lossG_fake_kl.data)

Sample results after ~1.7 GPU-days:

IntroVAE, 512px anime portrait (n = 4, 3 sets: real datapoints, encoded → decoded versions of the real datapoints, and random generated samples)

By this point, StyleGAN would have been generating recognizable faces from scratch, while the IntroVAE random samples are not even face-like, and the IntroVAE training curve was not improving at a notable rate. IntroVAE has some hyperparameters which could probably be tuned better for the anime portrait faces (they briefly discuss the use of the --num_vae option to run in classic VAE mode to let you tune the VAE-related hyperparameters before enabling the GAN-like part), but it should be fairly insensitive overall to hyperparameters and unlikely to help all that much. So IntroVAE probably can’t replace StyleGAN (yet?) for general-purpose image synthesis. This demonstrates again that it seems like everything works on CelebA these days and just because something works on a photographic dataset does not mean it’ll work on other datasets. Image generation papers should probably branch out some more and consider non-photographic tests.