Bibliography (4):

  1. PaLI: A Jointly-Scaled Multilingual Language-Image Model

  2. PaLI-X: On Scaling up a Multilingual Vision and Language Model

  3. Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

  4. Sigmoid Loss for Language Image Pre-Training