“Design Guidelines for Prompt Engineering Text-To-Image Generative Models”, Vivian Liu, Lydia B. Chilton2022-01-07 (, )⁠:

Text-to-image generative models are a new and powerful way to generate visual artwork. However, the open-ended nature of text as interaction is double-edged; while users can input anything and have access to an infinite range of generations, they also must engage in brute-force trial and error with the text prompt when the result quality is poor.

We conduct a [VQGAN+CLIP] study exploring what prompt keywords and model hyperparameters can help produce coherent outputs. In particular, we study prompts structured to include subject and style keywords and investigate success and failure modes of these prompts. Our evaluation of 5,493 generations over the course of 5 experiments spans 51 abstract and concrete subjects as well as 51 abstract and figurative styles.

From this evaluation, we present design guidelines that can help people produce better outcomes from text-to-image generative models.

3.1 Methodology: To study different permutations of prompts, we first had to generate a large number of images. To do this, we used the checkpoint and configuration of VQGAN+CLIP pretrained on Imagenet with the 16384 codebook size.35 Each image was generated to be 256×256 pixels and iterated on for 300 steps on a local NVIDIA GeForce RTX 3080 GPU.

Each image was generated according to a prompt involving a subject and style. We chose the following subjects: love, hate, happiness, sadness, man, woman, tree, river, dog, cat, ocean, and forest. These subjects were chosen for their universality across media and across cultures. These subjects additionally were balanced for how abstract or concrete they were as a concept as well as for positive and negative sentiment. We decided on whether a subject fell into the abstract or concrete category based upon ratings taken from a dataset of concreteness values.7 Our set of abstract subjects averaged 2.12 on a scale from one to 5 (one being most abstract), and our set of concrete subjects averaged 4.80.

Similarly, we chose 12 styles spanning different time periods, cultural traditions, and esthetics: Cubist, Islamic geometric art, Surrealism, action painting, ukiyo-e, ancient Egyptian art, High Renaissance, Impressionism, cyberpunk, unreal engine, Disney, VSCO. These styles likewise varied in whether they represented the world in an abstract or figurative manner. Specifically, we chose 4 abstract styles, 4 figurative styles, and 4 esthetics related to the digital age. We balanced for time periods (with 6 styles predating the 20th century, and 6 styles from the 20th and 21st century).

We used these 12×12 subject and style combinations to study the effect of prompt permutations: how different rephrasings of the same keywords affect the image generation. For each of these combinations, we tested 9 permutations derived from the CLIP code repository and discussion within the online community, generating 1,296 images in total.

…We condense our findings from the previous experiments into design guidelines and results to elaborate default parameters and methods for end users interacting with text-to-image models.