Bibliography (18):

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
RedCaps: web-curated image-text data created by the people, for the people
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
WebVision Database: Visual Learning and Understanding from Web Data
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
WebVision Challenge: Visual Learning and Understanding With Web Data
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB
The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Wikipedia Bibliography: