Bibliography (10):

  1. VinVL: Revisiting Visual Representations in Vision-Language Models

  2. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

  3. Attention Is All You Need

  4. Microsoft COCO: Common Objects in Context

  5. nocaps: novel object captioning at scale

  6. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

  7. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

  8. Wikipedia Bibliography:

    1. Alt attribute

    2. N-gram

    3. Cross-entropy