GANs Didn’t Fail, They Were Abandoned

Gwern Branwen

GANs Didn’t Fail, They Were Abandoned

Diffusion models supposedly beat GANs because they scale better and stabler. That is unproven, and false. GANs should be revisited.

by: Gwern Branwen 2022-10-04–2022-10-15 notes certainty: possible importance: 4 backlinks similar bibliography

GAN Advantages
Diffusion Advantages
Diffusion Non-Advantages

Image generation has broken out of machine learning into the mainstream as starting in 2021, the latest autoregressive and difusion models (like DALL-E, compviz, Make-A-Scene, or DALL·E 2/Imagen/Parti/Stable Diffusion) reached new heights of photorealism & artistic versatility. This would have surprised anyone who had not been following progress since 2018, when BigGAN astounded people with its 512px ImageNet samples. Where are the GANs? How can autoregressive models generate such images? Heck, what even is a diffusion model–which was only introduced in 2020?

A more detailed history doesn’t help: BigGAN hit in late 2018, scaling up to JFT-300M for the unreleased model, and then… no GANs followed it, even as models from other families were trained on literally billions of images. What happened? Why did GANs appear to die overnight?

Everyone seems to assume it happened for good reasons, but have trouble saying exactly why. It wasn’t because the other kinds of models were strictly superior: in several ways, GANs are still much better than diffusion models. The most common narrative blames it on GAN weaknesses, failing to scale and instability, but neither seems to be true: GANs do not scale conspicuously worse, and that scaling actually fixes the instability issues.

The simplest answer seems to just be historical contingency: the people who were interested in GAN scaling moved on or ran into trivial roadblocks so they didn’t scale up GANs, those who later did want to scale were interested in other kinds of models, and scientific fields go through their own fashion cycles.

So, probably generative modeling researchers should revisit GANs. They are a useful alternative paradigm, have advantages the other paradigms don’t, and their success or failure would be valuable knowledge.

GAN Advantages

fast sampling:
- 1 forward pass, not n passes (few FLOPs, low latency, realtime often easily achievable)
- minimal VRAM use, due to upsampling rather than U-Net architecture
disentangled, interpretable, editable latent space
- edits (eg), controls, easy latent space walk
simple to sample from: you just control z to sample, rather than a score of exotic hyperparameters, multiple ODE optimizers, step/noise schedule, classifier guidance…
precision rather than recall bias

For most applications, it is more useful to have high-quality samples from most of the distribution than low-quality samples from almost all the distribution.

Diffusion Advantages

better quality (at present)
usually text conditional¹
- new tricks like classifier guidance²
much stabler training—supposedly
you can write a lot of papers about them
denoising videos are fun to watch
?

Diffusion Non-Advantages

Scaling

https://openai.com/research/how-ai-training-scales

One suggestion for why GANs were abandoned is that perhaps someone showed they scaled better; eg. Grigorii Sotnikov:

I am convinced that autoregressive and diffusion models just scale better than GANs (w.r.t. compute 📈).

This may be true, but there is little evidence for it because there are only few compute comparisons at equivalent quality metrics, which are not good. In particular, there is only one modest scaling law for any diffusion models (Nichol & Dhariwal2021 for an obsolete DDPM inferior to BigGAN), only toy scaling laws for autoregressive models (Henighan et al 2020: GPT-3 on downscaled 32×32px images), and no scaling laws for any GAN.

GAN Instability

The most fundamental reason people give for abandoning GANs is the claim that GANs are unstable; if they are near-impossible to train on small datasets, then surely someone has proven it to be de facto impossible to train on large contemporary datasets like LAION-4B. GANs might be a nice idea, but if you can’t train them on meaningful datasets, then no wonder they were abandoned to niche uses.

Two typical expressions of this DL researcher sentiment are Tom Goldstein, explaining “Why have diffusion models displaced GANs so quickly?” It is because:

GAN training requires solving a saddle-point/minimax problem, which is quite unstable. Simple tricks can help stabilize this, but the problem remains…In fact, GANs were proposed by Goodfellow in 2014_11ya, but it took 3 years and countless gallons of grad student tears before stable routines for ImageNet appeared in 2017. Training on the very large and diverse DALL·E dataset would have been extremely challenging.³

And Dhariwal & Nichol2021 giving background on why to train diffusion rather than GANs (referencing IAN in 2016, and BigGAN & SN-GAN in 2018⁴):

Furthermore, GANs are often difficult to train, collapsing without carefully selected hyperparameters and regularizers^{[5, 41, 4]}. While GANs hold the state-of-the-art, their drawbacks make them difficult to scale and apply to new domains.

This belief is puzzling because… it’s not true?

First, the references given seem oddly inadequate for such a major finding. Surely, if GANs had been definitively disproven as a dead end, after having been a major research topic for so many years, there would be a thorough demonstration of this? Why in late 2022 would one still be citing papers from 6 years ago (literally using MNIST)?⁵

Second, the most relevant GAN stability paper, BigGAN, shows the opposite: it shows that, aside from instability being tame enough to get good results fairly reliably on 512px ImageNet-1M images⁶, BigGAN gains radically in stability as the dataset scales in size & diversity from ImageNet-1M to JFT-300M. This stability at the ≫1m regime has been replicated at least once, by Tensorfork.

I have additionally asked around on Reddit, Twitter, and of various DL researchers if anyone knows of any additional published or unpublished GAN runs on similar scales, especially GAN runs which diverged or otherwise failed; I received no examples. (I also asked Brock if the lack of BigGAN followups besides EvoNorm was because BigGAN was having any problems; it was not, and his research interest simply changed to more fundamental architecture work like NFNets.)

It seems improbable that GANs would be highly intrinsically unstable at small scales like ImageNet 𝒪(1M), highly intrinsically stable at 𝒪(100M), and then revert to being hopelessly unstable at 𝒪(1,000M) and beyond. So my conclusion is: the evidence that GANs are fundamentally unstable is largely irrelevant as it is mostly based on obsolete architectures over half a decade old used to train models 3 orders of magnitude smaller on datasets 5 orders of magnitude smaller, and the relevant evidence shows increasing stability with scale.

BigGAN JFT-300M

Indeed, Brock observes that he didn’t observe any divergence when training BigGAN (with larger model capacity) on JFT-300M without any tuning or architecture changes. If GANs were fundamentally unstable, and it would be “extremely challenging” to train on any “large and diverse dataset”, then this would be impossible: multiplying the size and variety 300×, and the model size, should have led to near-instantaneous divergence.

This is not mentioned by those citing the BigGAN paper as showing the instability of all GAN training, which is surprising because I thought it was the most exciting finding of the whole paper, and more important than getting then-SOTA FIDs—it emphasized GAN’s links to reinforcement learning by indicating that GAN instability, like reinforcement learning instability or CNN classifier errors or language model stupidity, is merely an artifact of training too-small models on too-small datasets (a flaw because it allows memorization), and that many problems just go away, enjoying the blessings of scale. (In DL, the lesson of the past decade has been that the easy problems are hard, and the hard problems are easy; so if you are having trouble solving a problem, you should go find a harder problem instead.)

Tensorfork Chaos Runs

The BigGAN JFT-300M result is not the only such result, because in 2020, Tensorfork replicated the stability of BigGAN scaling on a dataset of n > 100M.

Tensorfork was an ad hoc collaboration of myself, Shawn Presser, and other individuals, who had been using TFRC TPU pod time to train GPT-2 & StyleGAN 2 & BigGAN models (producing a number of things like ThisAnimeDoesNotExist). Our primary goal was to train & release 512px BigGAN models on not just ImageNet but all the other datasets we had like anime datasets. The compare_gan BigGAN implementation turned out to have a subtle +1 gamma bug which stopped us from reaching results comparable to the model when initialized from scratch but not when sampling from a pretrained model like the official DeepMind ImageNet BigGAN G checkpoint While we beat our heads against the wall trying to figure out why it was working but not well enough,⁷ we experimented with other things, including what we called “chaos runs”.

We had assembled a variety of datasets (ImageNet, FFHQ, Danbooru2019, Portraits/Figures/PALM, YFCC100M ⁸), collectively constituting ~110M images; the meta-dataset option was named “chaos”. The conventional wisdom was that you would have to train separate models on each one, and that they would be unstable. I had read the BigGAN paper and believed otherwise. So, we simply ran unconditional StyleGAN 2 & BigGAN training runs on all the datasets combined. And… the chaos runs worked.

StyleGAN 2, unsurprisingly, trained stably without a problem but was severely underfit, so the chaos results looked mostly like suggestive blobs. StyleGAN 2 trains stably and rarely diverges, but the cost of its extreme regularization which makes it so good at datasets like faces is that it underfits any kind of multiple-object dataset like ImageNet or Danbooru2019. To get the results we did for ThisAnimeDoesNotExist, Aydao had to extensively modify it to increase model capacity & disable the many forms of regularization built into it.

BigGAN likewise trained stably, getting steadily better. The quality was still not great, because it suffered from the gamma bug & we were far from convergence, but the important thing was we observed the same thing Brock did with JFT-300M: the feared instability just never happened.

GANs are typically unconditional or, at most, category-conditional, because most GANs were created before CLIP text embeddings became universal. There is no reason GANs can’t be made text-conditional.↩︎
Not yet implemented for GANs that I know of. Classifier guidance can probably be easily incorporated. One approach that comes to mind is to include a single additional variable 0–1, and that represents the ratio of the true embedding added to a random ‘noise’ embedding like 𝒩(0,1): at 0, the embedding is completely uninformative & the GAN learns to ignore it; at 1, it is simply the true embedding. Then for sampling, one can choose the ‘strength’ 0–1.↩︎
One could remark that, given the extensive material in the DALL·E paper about challenges with reduced-precision, that it involved tears & was challenging too.↩︎
Another popular reference is Figure 4 of “Are GANs Created Equal? A Large-Scale Study”, Lucic et al 2017 (MNIST/Fashion-MNIST/CIFAR-10/CelebA).↩︎
Another example of using weak baselines would be Pinaya et al 2022, which compares GANs from 2017 with the latest diffusion models trained on 40GB GPUs; unsurprisingly, the GANs are worse.↩︎
The instability not being worse than dealt with routinely by, say, researchers trying to get mixed-precision to work with DALL·E 1 or diffusion models, or getting language models to work.↩︎
“Neural nets want to work”—even if they start out being effectively multiplied by zero. We figured it out far too late (after we had disbanded), only by Presser compiling the raw Tensorflow graph to compare with the DM checkpoint (reasoning that they ought to be identical if the compare_gan reimplementation was exactly correct), and then stepping operation by operation, until he finally saw that the gamma batchnorm was initialized to 0 in compare_gan but not the original raw checkpoint. In retrospect, if we had had the guts to try to ‘finetune’ the original checkpoint (which we couldn’t do because DM withheld the Discriminator checkpoint specifically to prevent any training) by simply training a Discriminator ‘from scratch’ (ie. freezing the G for an epoch to let the D ‘catch up’ since Discriminators appear to function mostly by memorization) and starting our training from there and doing transfer learning for all subsequent projects as we planned, we would probably never have experienced this bug in the first place!↩︎
YFCC100M has become well-known by 2022, but in 2019 I had barely heard of it when Aaron Gokaslan contributed his copy—which he had mostly because he’s a data-hoarder and was abusing an unlimited Google Drive account. YFCC100M was too far ahead of its time.↩︎

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]

Backlinks

[Backlinks (what links here)]

Bibliography

[Bibliography of links/references used in page]