“Make-A-Scene: Scene-Based Text-To-Image Generation With Human Priors”, 2022-03-24 (; similar):
[video; reimplementation; impressive hard samples but almost immediately surpassed by DALL·E 2, Imagen, & Parti.] Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality.
We propose a novel text-to-image method that addresses these gaps by (1) enabling a simple control mechanism complementary to text in the form of a scene, (2) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (3) adapting classifier-free guidance for the transformer use case.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512×512 pixels, substantially improving visual quality.
Through scene controllability, we introduce several new capabilities: (1) Scene editing, (2) text editing with anchor scenes, (3) overcoming out-of-distribution text prompts, and (4) story illustration generation, as demonstrated in the story we wrote.
[July blog: “…Since the research paper was released, Make-A-Scene has incorporated a super resolution network that generates imagery at 2048×2048, 4× the resolution, and we’re continuously improving our generative AI models. We aim to provide broader access to our research demos in the future to give more people the opportunity to be in control of their own creations and unlock entirely new forms of expression.” No specifics on release of models or a service/API.]