Following my StyleGAN anime face experiments, I explore BigGAN, another recent GAN with SOTA results on one of the most complex image domains tackled by GANs so far (ImageNet). BigGAN’s capabilities come at a steep compute cost, however.
Using the unofficial BigGAN-PyTorch reimplementation, I experimented in 2019 with 128px ImageNet transfer learning (successful) with ~6 GPU-days, and from-scratch 256px anime portraits of 1,000 characters on an 8StyleGAN for many purposes, BigGAN-like approaches may be necessary to scale to whole anime images.
For followup experiments, Shawn Presser, I and others (collectively, “Tensorfork”) have used Tensorflow Research Cloud TPU credits & the compare_gan BigGAN reimplementation. Running this at scale on the full Danbooru2019 dataset in May 2020, we have reached the best anime GAN results to date (later exceeded by This Anime Does Not Exist).
Just StyleGAN approaches have proven difficult and possibly inadequate to the ultimate goal, and motivated my evaluation of NNs which have demonstrated the ability to model much harder datasets like ImageNet at large pixel-sizes. The primary rival GAN to StyleGAN for large-scale image synthesis as of mid-2019 is BigGAN (Brocket al2018; official BigGAN-PyTorch implementation & models).
BigGAN successfully trains on up to 512px images from ImageNet, from all 1,000 categories (conditioned on category), with near-photorealistic results on the best-represented categories (dogs), and apparently can even handle the far larger internal Google JFT dataset. In contrast, StyleGAN, while far less computationally demanding, shows poorer results on more complex categories (Karraset al2018’s LSUN CATS StyleGAN; our whole-Danbooru2018 pilots) and has not been demonstrated to scale to ImageNet, much less beyond.
The downside is that, as the name indicates, BigGAN is both a big model and requires big compute (particularly, big minibatches)—somewhere around $25,424$20k2019, we estimate, based on public TPU pricing.
This present a dilemma: larger-scale portrait modeling or whole-anime image modeling may be beyond StyleGAN’s current capabilities; but while BigGAN may be able to handle those tasks, we can’t afford to train it!
Must it cost that much? Probably not. In particular, BigGAN’s use of a fixed large minibatch throughout training is probably inefficient: it is highly unlikely that the benefits of a n = 2,048 minibatch are necessary at the beginning of training when the Generator is generating static which looks nothing at all like real data, and at the end of training, that may still be too small a minibatch (Brocket al2018 note that the benefits of larger minibatches had not saturated at n = 2,048 but time/compute was not available to test larger still minibatches, which is consistent with the gradient noise scale observation that the harder & more RL-like a problem, the larger the minibatch it needs). Typically, minibatches and/or learning rates are scheduled: imprecise gradients are acceptable early on, while as the model approaches perfection, more exact gradients are necessary. So it should be possible to start out with minibatches a tiny fraction of the size and gradually scale them up during training, saving an enormous amount of compute compared to BigGAN’s reported numbers. The gradient noise scale could be used to automatically set the total minibatch scale, although I didn’t find any examples of anyone using it in PyTorch this way. And using TPU pods provides large amounts of VRAM, but is not necessarily the cheapest form of compute.
Another optimization is to exploit transfer learning from the released models, and reuse the enormous amount of compute invested in them. The practical details there are fiddly. The original BigGAN 2018 release included the 128px/256px/512px Generator Tensorflow models but not their Discriminators, nor a training codebase; the compare_ganTensorflow codebase released in early 2019 includes an independent implementation of BigGAN that can potentially train them, and I believe that the Generator may still be usable for transfer learning on its own and if not—given the arguments that Discriminators simply memorize data and do not learn much beyond that—the Discriminators can be trained from scratch by simply freezing a G while training its D on G outputs for as long as necessary. The 2019 PyTorch release includes a different model, a full 128px model with G/D (at 2 points in its training), and code to convert the original Tensorflow models into PyTorch format; the catch there is that the pretrained model must be loaded intoexactly the same architecture, and while the PyTorch codebase defines the architecture for 32/64/128/256px BigGANs, it does not (as of 2019-06-04) define the architecture for a 512px BigGAN or BigGAN-deep (I tried but couldn’t get it quite right). It would also be possible to do model surgery and promote the 128px model to a 512px model, since the two upscaling blocks (128px → 256px and 256px → 512px) should be easy to learn (similar to my use of waifu2x to fake a 1024px StyleGAN anime face model). Anyway, the upshot is that one can only use the 128px/256px pretrained models; the 512px will be possible with a small update to the PyTorch codebase.
All in all, it is possible that BigGAN with some tweaks could be affordable to train. (At least, with some crowdfunding…)
To test out the water, I ran three BigGAN experiments:
I first experimented with retraining the ImageNet 128px model1.
That resulted in almost total mode collapse when I re-enabled G after 2 days; investigating, I realized that I had misunderstood: it was a brandnew BigGAN model, trained independently, and came with its fully-trained D already. Oops.
Partially successful after ~240 GPU-days: it reached comparable quality to StyleGAN before suffering serious mode collapse due, possibly, being forced to run with small minibatch sizes by BigGAN bugs
Constructing a new Danbooru-1k dataset: as BigGAN requires conditioning information, I constructed new 512px whole-image & portrait datasets by taking the 1,000 most popular Danbooru2018 characters, with characters as categories, and cropped out portraits as usual:
I merged a number of redundant folders by hand2, cleaned as usual, and did further cropping as necessary to reach 1,000. This resulted in 212,359 portrait faces, with the largest class (Hatsune Miku) having 6,624 images and the smallest classes having ~0 or 1 images. (I don’t know if the class imbalance constitutes a real problem for BigGAN, as ImageNet itself is imbalanced on many levels.)
The data-loading code attempts to make the class index/ID number line up with the folder count, so the nth alphabetical folder (character) should have class ID n, which is important to know for generating conditional samples. The final set/IDs (as defined for my Danbooru 1K dataset by find_classes):
(Aside from being potentially useful to stabilize training by providing supervision/metadata, use of classes/categories reduces the need for character-specific transfer learning for specialized StyleGAN models, since you can just generate samples from a specific class. For the 256px model, I provide downloadable samples for each of the 1,000 classes.)
The JPG compression turned out to be too aggressive and result in noticeable artifacting, so in early 2020 I regenerated D1k from Danbooru2019 for future projects, creating D1K-2019-512px: a fresh set of top-1k solo character images, s/q Danbooru2019, no JPEG compression.
Merges of overlapping characters were again necessary; the full set of tag merges:
BigGAN requires the dataset metadata to be defined in utils.py, and then, if using HDF5 archives it must be processed into a HDF5 archive, along with Inception statistics for the periodic testing (although I minimize testing, the preprocessed statistics are still necessary).
HDF5 is not necessary and can be omitted, BigGAN-Pytorch can read image folders, if you prefer to avoid the hassle.
The utils.py must be edited to add metadata per dataset (no CLI), which looks like this to define a 128px Danbooru-1k portrait dataset:
The architecture is specified on the command line and must be correct; examples are in the scripts/ directory. In the above example, --num_D_steps...--D_ch should be left strictly alone and the key parameters are before/after that architecture block. In this example, my 2×1080ti can support a batch size of n = 32 & the gradient accumulation overhead without OOMing. In addition to that, it’s important to enable EMA, which makes a truly remarkable difference in the generated sample quality (which is interesting because EMA sounds redundant with momentum/learning rates, but isn’t). The big batches of BigGAN are implemented by --batch_size times --num_{G/D}_accumulations; I would need an accumulation of 64 to match n = 2,048. Without EMA, samples are low quality and change drastically at each iteration; but after a certain number of iterations, sampling is done with EMA, which averages each iteration offline (but one doesn’t train using the averaged model!3), shows that collectively these iterations are similar because they are ‘orbiting’ around a central point and the image quality is clearly gradually improving when EMA is turned on.
Transfer learning is not supported natively, but a similar trick as with StyleGAN is feasible: just drop the pretrained models into the checkpoint folder and resume (which will work as long as the architecture is identical to the CLI parameters).
The sample sheet functionality can easily overload a GPU and OOM. In utils.py, it may be necessary to simply comment out all of the sampling functionality starting with utils.sample_sheet.
The main problem running BigGAN is odd bugs in BigGAN’s handling of epochs/iterations and changing gradient accumulations. With --use_multiepoch_sampler, it does complicated calculations to try to keep sampling consistent across epoches with precisely the same ordering of samples regardless of how often the BigGAN job is started/stopped (eg. on a cluster), but as one increases the total minibatch size and it progresses through an epoch, it tries to index data which doesn’t exist and crashes; I was unable to figure out how the calculations were going wrong, exactly.4
While with that option disabled and larger total minibatches used, a different bug gets triggered, leading to inscrutable crashes:
# ...# ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).# Traceback (most recent call last):# File "train.py", line 228, in <module># main()# File "train.py", line 225, in main# run(config)# File "train.py", line 172, in run# for i, (x, y) in enumerate(pbar):# File "/root/BigGAN-PyTorch-mooch/utils.py", line 842, in progress# 1 for n, item in enumerate(items):# File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 631, in __next__# idx, batch = self._get_batch()# File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 601, in _get_batch# return self.data_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)# File "/opt/conda/lib/python3.7/queue.py", line 179, in get# self.not_empty.wait(remaining)# File "/opt/conda/lib/python3.7/threading.py", line 300, in wait# gotit = waiter.acquire(True, timeout)# File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 274, in handler# _error_if_any_worker_fails()# RuntimeError: DataLoader worker (pid 21103) is killed by signal: Bus error.
There is no good workaround here: starting with small fast minibatches compromises final quality, while starting with big slow minibatches may work but then costs far more compute. I did find that the G/D accumulations can be imbalanced to allow increasing the G’s total minibatch (which appears to be the key for better quality) but then this risks destabilizing training. These bugs need to be fixed before trying BigGAN for real.
In any case, I ran the 128px ImageNet → Danbooru2018-1K for ~6 GPU-days (or ~3 days on my 2×1080ti workstation) and the training montage indicates it was working fine:
Training montage of the 128px ImageNet → Danbooru2018-1K; successful
Sometime after that, while continuing to play with imbalanced minibatches to avoid triggering the iteration/crash bugs, it diverged badly and mode-collapsed into static, so I killed the run, as the point appears to have been made: transfer learning is indeed possible, and the speed of the adaptation suggests benefits to training time by starting with a highly-trained model already.
More seriously, I began training a 256px model on Danbooru2018-1K portraits. This required rebuilding the HDF5 with 256px settings, and since I wasn’t doing transfer learning, I used the BigGAN-deep architecture settings since that has better results & is smaller than the original BigGAN.
My own 2×1080ti were inadequate for reasonable turnaround on training a 256px BigGAN from scratch—they would take something like 4+ months wallclock— so I decided to shell out for a big cloud instance. AWS/GCP are too expensive, so I used this to investigate Vast.ai as an alternative: they typically have much lower prices.
Vast.ai setup was straightforward, and I found a nice instance: an 8×2080ti machine available for just $2.16$1.702019/hour (AWS, for comparison, would charge closer to $2.75$2.162019/hour for just 8 K80 halves). So I ran 2019-05-02–2019-06-03 their 8×2080ti instance ($2.16$1.702019/hour; total: $1,746$1,3742019).
That is ~250 GPU-days of training, although this is a misleading way to put it since the Vast.ai bill includes bandwidth/hard-drive in that total and the GPU utilization was poor so each ‘GPU-day’ is worth about a third less than with the 128px BigGAN which had good GPU utilization and the 2080tis were overkill. It should be possible to do much better with the same budget in the future.
The system worked well but BigGAN turns out to have serious performance bottlenecks (apparently in synchronizing batchnorm across GPUs) and did not make good use of the 8 GPUs, averaging GPU utilization ~30% according to nvidia-smi. (On my 2×1080tis with the 128px, GPU-utilization was closer to 95%.) In retrospect, I probably should’ve switched to a less expensive instance like a 8×1080ti where it likely would’ve had similar throughput but cost less.
Training progressed well up until iterations #80–90k, when I began seeing signs of mode collapse:
Training montage of the 256px Danbooru2018-1K; semi-successful (note when EMA begins to be used for sampling images at ~8s, and the mode collapse at the end)
I was unable to increase the minibatch to more than ~500 because of the bugs, limiting what I could do against mode collapse, and I suspect the small minibatch was why mode collapse was happening in the first place. (Gokaslan tried the last checkpoint I saved—#95,160—with the same settings, and ran it to #100,000 iterations and experienced near-total mode collapse.)
The last checkpoint I saved from before mode collapse was #83,520, saved on 2019-05-28 after ~24 wallclock days (accounting for various crashes & time setting up & tweaking).
Random samples, interpolation grids (not videos), and class-conditional samples can be generated using sample.py; like train.py, it requires the exact architecture to be specified. I used the following command (many of the options are probably not necessary, but I didn’t know which):
Random samples are already well-represented by the training montage. The interpolations look similar to StyleGAN interpolations. The class-conditional samples are the most fun to look at because one can look at specific characters without the need to retrain the entire model, which while only taking a few hours at most, is a hassle.
Interpolation images and 5 character-specific random samples (Asuka, Holo, Rin, Chen, Ruri) for our 256px BigGAN trained on 1,000 characters from Danbooru2018:
Random interpolation samples (256px BigGAN trained on 1,000 Danbooru2018 character portraits)
Souryuu Asuka Langley (Neon Genesis Evangelion), class #825 random samples
Holo (Spice and Wolf), class #273 random samples
Rin Tohsaka (Fate/Stay Night), class #891
Yakumo Chen (Touhou), class #123 random samples
Ruri Hoshino (Martian Successor Nadesico), class #286 random samples
Sarcastic commentary on BigGAN quality by /u/Klockbox
The best results from the 128px BigGAN model look about as good as could be expected from 128px samples; the 256px model is fairly good, but suffers from much more noticeable artifacting than 512px StyleGAN, and cost $1,745$1,3732019 (a 256px StyleGAN would have been closer to $508$4002019 on AWS). In BigGAN’s defense, it had clearly not converged yet and could have benefited from much more training and much larger minibatches, had that been possible. Qualitatively, looking at the more complex elements of samples, like hair ornaments/hats, I feel like BigGAN was doing a much better job of coping with complexity & fine detail than StyleGAN would have at a similar point.
However, training 512px portraits or whole-Danbooru images is infeasible at this point: while the cost might be only a few thousand dollars, the various bugs mean that it may not be possible to stably train to a useful quality. It’s a dilemma: at small or easy domains, StyleGAN is much faster (if not better); but at large or hard domains, mode collapse is too risky and endangers the big investment necessary to surpass StyleGAN.
To make BigGAN viable, it needs at least:
minibatch size bugs fixed to enable up to n = 2,048 (or larger, as gradient noise scale indicates)
512px architectures defined, to allow transfer learning from the released Tensorflow 512px ImageNet model
optimization work to reduce overhead and allow reasonable GPU utilization on >2-GPU systems
With those done, it should be possible to train 512px portraits for <$1,271$1k2019 and whole-Danbooru images for <$12,712$10k2019. (Given the release of DeepDanbooru as a TensorFlow model, enabling an anime-specific perceptual loss, it would also be interesting to investigate applying “NoGAN” pretraining to BigGAN.)
Release of a 256px BigGAN model trained on Danbooru2019 & e621. This is a prototype model testing our ability to train a BigGAN stably for hundreds of thousands of iterations on a TPU-256 pod on 3 million+ anime/illustration images. While the generated samples are far from ‘photorealistic’, they serve as proof of concept that—unlike our failed StyleGAN2 scaling experiments—BigGAN can successfully model anime images with great generality, and that we can potentially scale up to 512px or even 1024px and match the DeepMind ImageNet BigGAN for quality.
As part of testing our modifications to compare_gan, including sampling from multiple datasets to increase n and using flood loss to stabilize it and adding an additional (crude, limited) kind of self-supervised SimCLR loss to the D, we trained several 256px BigGANs, initially on Danbooru2019 SFW but then adding in the TWDNE portraits & e621/e621-portraits partway through training. This destabilized the models greatly, but the flood loss appears to have stopped divergence and they gradually recovered. Run #39 did somewhat better than run #40; the self-supervised variants never recovered. This indicated to us that our self-supervised loss needed heavy revision (as indeed it did), and that flood loss was more valuable than expected, and we investigated it further; the important part appears—for GANs, anyway—to be the stop-loss part, halting training of G/D when it gets ‘too good’. Freezing models is an old GAN trick which is mostly not used post-WGAN, but appears useful for BigGAN, perhaps because of the spiky loss curve, especially early in training. (The quality of later BigGAN runs as much higher. We only discovered much later in 2020 that the compare_gan default behavior of resizing images to 256px by ‘random cropping’ was seriously destabilizing for both BigGAN & StyleGAN compared to ‘top cropping’, and only in 2021 did Shawn Presser discover what seems to be the fatal flaw of compare_gan’s implementation: an innocent-seeming omission of a ‘+1’ in the gamma parameter of an image processing step.)
We ran it for 607,250 iterations on a TPUv3-256 pod until 2020-05-15. Config:
ImageNet requires you to sign up & be approved to download from them, but 2 months later I had heard nothing back (and still have not as of January 2021). So I used the data from ILSVRC2012_img_train.tar (MD5: 1d675b47d978889d74fa0da5fadfb00e; 138GB) which I downloaded from the ImageNet LSVRC 201213ya Training Set (Object Detection) torrent.
Danbooru can classify the same character under multiple tags: for example, Sailor Moon characters are tagged under their “Sailor X” name for images of their transformed version, and their real names for ‘civilian’ images (eg. ‘Sailor Venus’ or ‘Cure Moonlight’, the former of which I merged with ‘Aino Minako’). Some popular franchises have many variants of each character: the Fate franchise, especially with the success of Fate/Grand Order, is a particular offender, with quite a few variants of characters like Saber.
One would think it would, but I asked Brock and apparently it doesn’t help to occasionally initialize from the EMA snapshots. EMA is a mysterious thing.
As far as I can tell, it has something to do with the dataloader code in utils.py: the calculation of length and the iterator do something weird to adjust for previous training, so the net effect is that you can run with a fixed minibatch accumulation and it’ll be fine, and you can reduce the number of accumulations, and it’ll simply underrun the dataloader, but if you increase the number of accumulations, if you’ve trained enough percentage-wise, it’ll immediately flip over into a negative length and indexing into it becomes completely impossible, leading to crashes. Unfortunately, I only ever want to increase the minibatch accumulation… I tried to fix it but the logic is too convoluted for me to follow it.