VinVL: Revisiting Visual Representations in Vision-Language Models
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Wikipedia Bibliography: