“SANA: Efficient High-Resolution Image Synthesis With Linear Diffusion Transformers”, 2024-10-14 (; similar):
[homepage] We introduce Sana, a text-to-image framework that can efficiently generate images up to 4,096×4,096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.
Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8×, we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with a modern decoder-only small LLM [Gemma-2] as the text encoder, and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence.
As a result, Sana-0.6B is very competitive with modern giant diffusion models (eg. Flux-12B), being 20× smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1,024×1,024 resolution image.
Sana enables content creation at low cost.
Code and model will be publicly released.