Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding