BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Robust Open-Vocabulary Translation from Visual Text Representations
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
M3AE: Multimodal Masked Autoencoders Learn Transferable Representations
StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
Building Machine Translation Systems for the Next Thousand Languages
Wikipedia Bibliography: