āMaskBit: Embedding-Free Image Generation via Bit Tokensā, 2024-09-24 (; similar)ā :
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stagesāan initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent spaceāthese frameworks offer promising avenues for image synthesis.
In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQ-GANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokensāa binary quantized representation of tokens with rich semantics.
The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details.
The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256Ć256 benchmark, with a compact generator model of a mere 0.3b parameters.