“Generating Images With Sparse Representations”, Charlie Nash, Jacob Menick, Sander Dieleman, Peter W. Battaglia2021-03-05 (, ; similar)⁠:

The high dimensionality of images presents architecture and sampling-efficiency challenges for likelihood-based generative models. Previous approaches such as VQ-VAE use deep autoencoders to obtain compact representations, which are more practical as inputs for likelihood-based models.

We present an alternative approach, inspired by common image compression methods like JPEG, and convert images to quantized discrete cosine transform (DCT) blocks, which are represented sparsely as a sequence of DCT channel, spatial location, and DCT coefficient triples. We propose a Transformer-based autoregressive architecture, which is trained to sequentially predict the conditional distribution of the next element in such sequences, and which scales effectively to high resolution images.

On a range of image datasets, we demonstrate that our approach can generate high quality, diverse images, with sample metric scores competitive with state-of-the-art methods.

We additionally show that simple modifications to our method yield effective image colorization and super-resolution models.