“Design Guidelines for Prompt Engineering Text-To-Image Generative Models”, 2022-01-07 ():
Text-to-image generative models are a new and powerful way to generate visual artwork. However, the open-ended nature of text as interaction is double-edged; while users can input anything and have access to an infinite range of generations, they also must engage in brute-force trial and error with the text prompt when the result quality is poor.
We conduct a [VQGAN+CLIP] study exploring what prompt keywords and model hyperparameters can help produce coherent outputs. In particular, we study prompts structured to include subject and style keywords and investigate success and failure modes of these prompts. Our evaluation of 5,493 generations over the course of 5 experiments spans 51 abstract and concrete subjects as well as 51 abstract and figurative styles.
From this evaluation, we present design guidelines that can help people produce better outcomes from text-to-image generative models.
…3.1 Methodology: To study different permutations of prompts, we first had to generate a large number of images. To do this, we used the checkpoint and configuration of VQGAN+CLIP pretrained on Imagenet with the 16384 codebook size.35 Each image was generated to be 256×256 pixels and iterated on for 300 steps on a local NVIDIA GeForce RTX 3080 GPU.
Each image was generated according to a prompt involving a subject and style. We chose the following subjects: love, hate, happiness, sadness, man, woman, tree, river, dog, cat, ocean, and forest. These subjects were chosen for their universality across media and across cultures. These subjects additionally were balanced for how abstract or concrete they were as a concept as well as for positive and negative sentiment. We decided on whether a subject fell into the abstract or concrete category based upon ratings taken from a dataset of concreteness values.7 Our set of abstract subjects averaged 2.12 on a scale from one to 5 (one being most abstract), and our set of concrete subjects averaged 4.80.
Similarly, we chose 12 styles spanning different time periods, cultural traditions, and esthetics: Cubist, Islamic geometric art, Surrealism, action painting, ukiyo-e, ancient Egyptian art, High Renaissance, Impressionism, cyberpunk, unreal engine, Disney, VSCO. These styles likewise varied in whether they represented the world in an abstract or figurative manner. Specifically, we chose 4 abstract styles, 4 figurative styles, and 4 esthetics related to the digital age. We balanced for time periods (with 6 styles predating the 20th century, and 6 styles from the 20th and 21st century).
We used these 12×12 subject and style combinations to study the effect of prompt permutations: how different rephrasings of the same keywords affect the image generation. For each of these combinations, we tested 9 permutations derived from the CLIP code repository and discussion within the online community, generating 1,296 images in total.
…We condense our findings from the previous experiments into design guidelines and results to elaborate default parameters and methods for end users interacting with text-to-image models.
When picking the prompt, focus on subject and style keywords instead of connecting words.
Rephrasings using the same keywords do not make a large difference on the quality of the generation as no prompt permutation consistently succeeds over the rest.
When generating, generate between 3–9 different seeds to get a representative idea of what a prompt can return.
Generations may be substantially different owing to the stochastic nature of hyperparameters such as random seeds and initializations. Returning multiple results acknowledges this stochastic nature to users.
When generating for fast iteration, using shorter lengths of optimization 100–500 iteration is sufficient.
We found that the number of iterations and length of optimization did not importantly correlate with user satisfaction of the generation.
When choosing the style of the generation, feel free to try any style, no matter how niche or broad.
The deep learning frameworks capture an impressive breadth of style information, and can be surprisingly good even for niche styles. However, avoid style keywords that may be prone to misinterpretation.
When picking the subject of the generation, pick subjects that can complement the chosen style in level of abstractness.
This could be done by picking subjects for styles considering how abstract or concrete both are or pairing subjects that are easily interpretable or highly relevant to the style.