Bibliography (18):

  1. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

  2. CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3

  3. RedCaps: web-curated image-text data created by the people, for the people

  4. CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

  5. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

  6. WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

  7. WebVision Database: Visual Learning and Understanding from Web Data

  8. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

  9. WebVision Challenge: Visual Learning and Understanding With Web Data

  10. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

  11. CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

  12. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale

  13. ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision