“Generating Images With Sparse Representations”, 2021-03-05 (; similar):
The high dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. Previous approaches such as VQ-VAE use deep autoencoders to obtain compact representations, which are more practical as inputs for likelihood-based models.
We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks, which are represented sparsely as a sequence of DCT channel, spatial location, and DCT coefficient triples. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences, and which scales effectively to high resolution images.
On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state-of-the-art methods.
We additionally show that simple modifications to our method yield effective image colorization and super-resolution models.