“GIVT: Generative Infinite-Vocabulary Transformers”, Michael Tschannen, Cian Eastwood, Fabian Mentzer2023-12-04 (, , )⁠:

We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary.

To this end, we propose two surprisingly simple modifications to decoder-only transformers: (1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and (2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model.

Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE.

When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling.

Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.