Bibliography (6):

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://github.com/microsoft/LAVENDER
VL-T5: Unifying Vision-and-Language Tasks via Text Generation
UNICORN: Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer