[code, dataset, model] The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public.
To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large [k = 37,000 prompts; n = 538,000 images; m = 967,000 comparisons], open dataset of text-to-image prompts and real usersā preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore [using InstructGPT-style RLHF], which exhibits superhuman performance on the task of predicting human preferencesā¦We find that the resulting scoring function, PickScore, achieves superhuman performance in the task of predicting user preferences (a 70.2% accuracy rate, compared to humansā 68.0%), while zero-shot CLIP-H (60.8%) and the popular aesthetics predictor (56.8%) perform closer to chance (56.8%).
Then, we test PickScoreās ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metricsā¦even when evaluated against MS-COCO captions, PickScore exhibits a strong correlation with human preferences (0.917), while ranking with FID yields a negative correlation (ā0.900) [cf. Steinet al2023].
Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO.
Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.
Figure 6: Correlation between the win ratio of different models according to FID and PickScore to human experts on the MS-COCO validation set.
ā¦Figure 6 shows the correlation between model win rates induced by human rankings (horizontal) and model win rates induced by each automatic scoring function. PickScore exhibits a stronger correlation (0.917) with human raters on MS-COCO captions than FID (ā0.900), which surprisingly, exhibits a strong negative correlation. As FID is oblivious to the prompt, one would expect zero correlation, and not a strong negative correlation. We hypothesize that this is related to the classifier-free guidance scale hyperparameterālarger scales tend to produce more vivid images (which humans typically prefer), but differ from the distribution of ground truth images in MS-COCO, yielding worse (higher) FID scores. Figure 7 visualizes these differences by presenting pairs of images generated with the same random seed but with different classifier-free guidance (CFG) scales.
Figure 7: Images generated using the same seed and model, but using different classifier-free guidance (CFG) scales. Even though high guidance scales lead to worse FID, humans usually find them more pleasing than low guidance scales.
ā¦Acknowledgments: We gratefully acknowledge the support of Stability AI, the Google TRC program, and Jonathan Berant, who provided us with invaluable compute resources, credits, and storage that were crucial to the success of this project.