GANs Didn’t Fail, They Were Abandoned
Diffusion models supposedly beat GANs because they scale better and stabler. That is unproven, and false. GANs should be revisited.
Image generation has broken out of machine learning into the mainstream as starting in 2021, the latest autoregressive and difusion models (like DALL-E, compviz, Make-A-Scene, or DALL·E 2/
A more detailed history doesn’t help: BigGAN hit in late 2018, scaling up to JFT-300M for the unreleased model, and then… no GANs followed it, even as models from other families were trained on literally billions of images. What happened? Why did GANs appear to die overnight?
Everyone seems to assume it happened for good reasons, but have trouble saying exactly why. It wasn’t because the other kinds of models were strictly superior: in several ways, GANs are still much better than diffusion models. The most common narrative blames it on GAN weaknesses, failing to scale and instability, but neither seems to be true: GANs do not scale conspicuously worse, and that scaling actually fixes the instability issues.
The simplest answer seems to just be historical contingency: the people who were interested in GAN scaling moved on or ran into trivial roadblocks so they didn’t scale up GANs, those who later did want to scale were interested in other kinds of models, and scientific fields go through their own fashion cycles.
So, probably generative modeling researchers should revisit GANs. They are a useful alternative paradigm, have advantages the other paradigms don’t, and their success or failure would be valuable knowledge.
GAN Advantages
fast sampling:
1 forward pass, not n passes (few FLOPs, low latency, realtime often easily achievable)
minimal VRAM use, due to upsampling rather than U-Net architecture
disentangled, interpretable, editable latent space
edits (eg), controls, easy latent space walk
simple to sample from: you just control z to sample, rather than a score of exotic hyperparameters, multiple ODE optimizers, step/
noise schedule, classifier guidance… precision rather than recall bias
For most applications, it is more useful to have high-quality samples from most of the distribution than low-quality samples from almost all the distribution.
Diffusion Advantages
Diffusion Non-Advantages
Scaling
https://
One suggestion for why GANs were abandoned is that perhaps someone showed they scaled better; eg. Grigorii Sotnikov:
I am convinced that autoregressive and diffusion models just scale better than GANs (w.r.t. compute 📈).
This may be true, but there is little evidence for it because there are only few compute comparisons at equivalent quality metrics, which are not good. In particular, there is only one modest scaling law for any diffusion models (2021 for an obsolete DDPM inferior to BigGAN), only toy scaling laws for autoregressive models (et al2020: GPT-3 on downscaled 32×32px images), and no scaling laws for any GAN.
GAN Instability
The most fundamental reason people give for abandoning GANs is the claim that GANs are unstable; if they are near-impossible to train on small datasets, then surely someone has proven it to be de facto impossible to train on large contemporary datasets like LAION-4B. GANs might be a nice idea, but if you can’t train them on meaningful datasets, then no wonder they were abandoned to niche uses.
Two typical expressions of this DL researcher sentiment are Tom Goldstein, explaining “Why have diffusion models displaced GANs so quickly?” It is because:
GAN training requires solving a saddle-point/
minimax problem, which is quite unstable. Simple tricks can help stabilize this, but the problem remains…In fact, GANs were proposed by Goodfellow in 201411ya, but it took 3 years and countless gallons of grad student tears before stable routines for ImageNet appeared in 2017. Training on the very large and diverse DALL·E dataset would have been extremely challenging.3
And 2021 giving background on why to train diffusion rather than GANs (referencing IAN in 2016, and BigGAN & SN-GAN in 20184):
Furthermore, GANs are often difficult to train, collapsing without carefully selected hyperparameters and regularizers[5, 41, 4]. While GANs hold the state-of-the-art, their drawbacks make them difficult to scale and apply to new domains.
This belief is puzzling because… it’s not true?
First, the references given seem oddly inadequate for such a major finding. Surely, if GANs had been definitively disproven as a dead end, after having been a major research topic for so many years, there would be a thorough demonstration of this? Why in late 2022 would one still be citing papers from 6 years ago (literally using MNIST)?5
Second, the most relevant GAN stability paper, BigGAN, shows the opposite: it shows that, aside from instability being tame enough to get good results fairly reliably on 512px ImageNet-1M images6, BigGAN gains radically in stability as the dataset scales in size & diversity from ImageNet-1M to JFT-300M. This stability at the ≫1m regime has been replicated at least once, by Tensorfork.
I have additionally asked around on Reddit, Twitter, and of various DL researchers if anyone knows of any additional published or unpublished GAN runs on similar scales, especially GAN runs which diverged or otherwise failed; I received no examples. (I also asked Brock if the lack of BigGAN followups besides EvoNorm was because BigGAN was having any problems; it was not, and his research interest simply changed to more fundamental architecture work like NFNets.)
It seems improbable that GANs would be highly intrinsically unstable at small scales like ImageNet 𝒪(1M), highly intrinsically stable at 𝒪(100M), and then revert to being hopelessly unstable at 𝒪(1,000M) and beyond. So my conclusion is: the evidence that GANs are fundamentally unstable is largely irrelevant as it is mostly based on obsolete architectures over half a decade old used to train models 3 orders of magnitude smaller on datasets 5 orders of magnitude smaller, and the relevant evidence shows increasing stability with scale.
BigGAN JFT-300M
Indeed, Brock observes that he didn’t observe any divergence when training BigGAN (with larger model capacity) on JFT-300M without any tuning or architecture changes. If GANs were fundamentally unstable, and it would be “extremely challenging” to train on any “large and diverse dataset”, then this would be impossible: multiplying the size and variety 300×, and the model size, should have led to near-instantaneous divergence.
This is not mentioned by those citing the BigGAN paper as showing the instability of all GAN training, which is surprising because I thought it was the most exciting finding of the whole paper, and more important than getting then-SOTA FIDs—it emphasized GAN’s links to reinforcement learning by indicating that GAN instability, like reinforcement learning instability or CNN classifier errors or language model stupidity, is merely an artifact of training too-small models on too-small datasets (a flaw because it allows memorization), and that many problems just go away, enjoying the blessings of scale. (In DL, the lesson of the past decade has been that the easy problems are hard, and the hard problems are easy; so if you are having trouble solving a problem, you should go find a harder problem instead.)
-
GANs are typically unconditional or, at most, category-conditional, because most GANs were created before CLIP text embeddings became universal. There is no reason GANs can’t be made text-conditional.
-
Not yet implemented for GANs that I know of. Classifier guidance can probably be easily incorporated. One approach that comes to mind is to include a single additional variable 0–1, and that represents the ratio of the true embedding added to a random ‘noise’ embedding like 𝒩(0,1): at 0, the embedding is completely uninformative & the GAN learns to ignore it; at 1, it is simply the true embedding. Then for sampling, one can choose the ‘strength’ 0–1.
-
One could remark that, given the extensive material in the DALL·E paper about challenges with reduced-precision, that it involved tears & was challenging too.
-
Another popular reference is Figure 4 of “Are GANs Created Equal? A Large-Scale Study”, et al2017 (MNIST/
Fashion-MNIST/ CIFAR-10/ CelebA). -
Another example of using weak baselines would be et al2022, which compares GANs from 2017 with the latest diffusion models trained on 40GB GPUs; unsurprisingly, the GANs are worse.
-
The instability not being worse than dealt with routinely by, say, researchers trying to get mixed-precision to work with DALL·E 1 or diffusion models, or getting language models to work.
-
“Neural nets want to work”—even if they start out being effectively multiplied by zero. We figured it out far too late (after we had disbanded), only by Presser compiling the raw Tensorflow graph to compare with the DM checkpoint (reasoning that they ought to be identical if the compare_gan reimplementation was exactly correct), and then stepping operation by operation, until he finally saw that the gamma batchnorm was initialized to 0 in compare_gan but not the original raw checkpoint. In retrospect, if we had had the guts to try to ‘finetune’ the original checkpoint (which we couldn’t do because DM withheld the Discriminator checkpoint specifically to prevent any training) by simply training a Discriminator ‘from scratch’ (ie. freezing the G for an epoch to let the D ‘catch up’ since Discriminators appear to function mostly by memorization) and starting our training from there and doing transfer learning for all subsequent projects as we planned, we would probably never have experienced this bug in the first place!
-
YFCC100M has become well-known by 2022, but in 2019 I had barely heard of it when Aaron Gokaslan contributed his copy—which he had mostly because he’s a data-hoarder and was abusing an unlimited Google Drive account. YFCC100M was too far ahead of its time.