Bibliography (4):

  1. https://towardsdatascience.com/your-vision-language-model-might-be-a-bag-of-words-30b1beaef7f8

  2. https://x.com/james_y_zou/status/1638947761562476545

  3. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  4. Contrastive Representation Learning: A Framework and Review