“Muse: Text-To-Image Generation via Masked Generative Transformers”, 2023-01-02 ():
[Twitter; cf. Paella, MaskGIT followup; video] We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being more efficient than diffusion or autoregressive models.
Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL·E 2, Muse is more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM [T5] enables fine-grained language understanding [ie. it can do text inside images], translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc.
…We train on the Imagen dataset consisting of 460M text-image pairs ( et al 2022). Training is performed for 1M steps, with a batch size of 512 on 512-core TPU-v4 chips ( et al 2020). This takes about 1 week of training time.
…Decoding is performed based on a cosine schedule (MaskGIT) that chooses a certain fixed fraction of the highest confidence masked tokens that are to be predicted at that step. These tokens are then set to unmasked for the remainder of the steps and the set of masked tokens is appropriately reduced. Using this procedure, we are able to perform inference of 256 tokens using only 24 decoding steps in our base model and 4,096 tokens using 8 decoding steps in our super-resolution model, as compared to the 256 or 4,096 steps required for autoregressive models (eg. Parti) and hundreds of steps for diffusion models (eg. et al 2022; Imagen).
…Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse-3b parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing.
More results are available at https://muse-model.github.io/.
[Announcing Muse, a super-fast text-to-image generation model based on masked generative transformers. 2–10× faster than diffusion or autoregressive models! SOTA CLIP score and excellent FID score!
Our training and architecture enable text-to-image generation at 512×512 resolution in under 2s on a TPUv4; and zero-shot editing capabilities right out of the box.
We leverage a T5XXL pre-trained LLM to enable fine-grained text understanding and masking-based training that enables fast parallel decoding without loss of generation quality (as measured by FID or CLIP score).
Muse achieves a SOTA CLIP score of 0.32 and excellent FID of 7.88 on zero-shot COCO evaluation.]