“CommonCanvas: An Open Diffusion Model Trained With Creative-Commons Images”, 2023-10-25 ():
We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce.
In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions [using BLIP-2] paired with curated CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality. These results indicate that we have a sufficient number of CC images ( ~70 million) for training high-quality models.
Our training recipe also implements a variety of optimizations that achieve ~3X training speed-ups, enabling rapid model iteration. We leverage this recipe to train several high-quality text-to-image models, which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on a human evaluation, despite being trained on our CC dataset that is smaller than LAION and using synthetic captions for training.
We release our models, data, and code at https://github.com/mosaicml/diffusion.
4. CommonCatalog: A Dataset of CC Images & Synthetic Captions In this section, we introduce our open dataset, CommonCatalog. First, we describe the collection and curation process for the open-licensed, CC images. This process brings to light two challenges: caption-data incompleteness and image-data scarcity. To address the lack of CC captions, we show concretely how we use telephoning to produce high-quality synthetic captions to accompany our set of curated images. We investigate the topic of data scarcity in the next section, where we also discuss necessary systems-level training optimizations that enable us efficient SD-model iteration.
4.1 Sourcing provenanced, licensed images for CommonCatalog: We focus on locating high-resolution Creative-Commons images that have open licenses. We began with the YFCC100M dataset, which consists of 100 million CC-licensed images and multimedia files, as well as Flickr IDs linking to the original data. The images in the dataset associated with the original paper exhibit two issues that make it ill-suited for direct use to train Stable Diffusion: they are low-resolution, and many of them have licenses that do not expressly allow for the distribution of derivative works, which are an area of unsettled copyright law in the context of model training. We therefore re-scraped these images from Flickr, based on the IDs provided in the YFCC100M metadata. Our scraped images are very high resolution (exceeding 4K), which makes them more suitable for T2I training.
We exclude images with non-derivative (ND) licenses. The remaining images can be further divided into those that can be used for commercial (C) purposes and those that cannot (non-commercial/ NC). As shown in Table 4, we accordingly construct two datasets, CommonCatalog-C and CommonCatalog-NC. We defer additional details about licenses to Appendix B.1.1, but emphasize that all of the images included have open licenses: individuals are free to use, adapt, and remix the images, so long as they attribute them. In total, CommonCatalog contains roughly 70 million NC CC-images, of which a subset of ~25 million images can also be used commercially.
Directly sourcing CommonCatalog avoids some concerns (§2.2); however, it also comes with its own challenges. For one, CC images rarely have the alt-text captions necessary to train a T2I model like Stable Diffusion (Table 4); those that do have associated text often just include the image title or a URL. For another, we could only find roughly 70 million usable CC images, which pales in comparison to the billions of images in LAION used to train SD2 (§5). We take each of these challenges in turn. First, in the next subsection, we show how we instantiate ‘telephoning’ (§3) to produce high-quality, synthetic captions for CC images.
…Based on these preliminary results, we captioned all of the YFCC100M Creative-Commons images, which required about 1,120 GPU A100 hours. To do so, we center-cropped and resized all of the images to a maximum size of 512×512 pixels. We perform these transformations because captioning images at native resolution would be very expensive. At training time of the diffusion model, all images remain in their native resolution. We release our commercial (CommonCatalog-C) and non-commercial (CommonCatalog-NC) CC-image and synthetic-caption datasets on HuggingFace with associated data cards. As an evaluation set, we also release the BLIP-2 captions that we produced for the non-derivative (ND) CC images that we did not use for training.
[If diffusion-generated images which do not visibly resemble a copyrighted character are nevertheless derivative works of the training set, why are BLIP-2-generated descriptive captions not themselves derivative works of the non-CC-licensed images+caption pairs, and thus tainting? The text captions are still colored…]