Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
Scaling Vision Transformers
ImageNet: A Large-Scale Hierarchical Image Database
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
https://arxiv.org/pdf/2205.04596.pdf#page=19&org=google
Evaluating Machine Accuracy on ImageNet
BASIC: Combined Scaling for Open-Vocabulary Image Classification
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
ImageNet Large Scale Visual Recognition Challenge
CoCa: Contrastive Captioners are Image-Text Foundation Models
Deep Residual Learning for Image Recognition