“LAION-400-Million Open Dataset”, 2021-08-20 ():
We present LAION-400M: 400M English (image, text) pairs—see also our Data Centric AI NeurIPS 2021 paper. The LAION-400M dataset is entirely openly, freely accessible.
The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled 2014–72021…You can find
The CLIP image embeddings (NumPy files)
The parquet files
- KNN index of image embeddings
…LAION-400M Open Dataset structure: We produced the dataset in several formats to address the various use cases:
a 50GB url+caption metadata dataset in parquet files. We can use the metadata to compute statistics and redownload part of the dataset
a 10TB web-dataset with 256×256 images, captions and metadata. It is a full version of the dataset that can be used directly for training (this one is for internal use, you need to redownload images yourself due to licensing issues)
a 1TB set of the 400M text and image clip embeddings, useful to rebuild new k-nn indices
pairs of 16G, 32G, 64G and 128G k-nn indices (running in the web demo)
See Also:
RedCaps: web-curated image-text data created by the people, for the people
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
WebVision Database: Visual Learning and Understanding from Web Data
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
WebVision Challenge: Visual Learning and Understanding With Web Data
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
View HTML: