-
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
-
Scaling Vision Transformers
-
ImageNet: A Large-Scale Hierarchical Image Database
-
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
-
https://arxiv.org/pdf/2205.04596.pdf#page=19&org=google
-
Evaluating Machine Accuracy on ImageNet
-
BASIC: Combined Scaling for Open-Vocabulary Image Classification
-
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
-
ImageNet Large Scale Visual Recognition Challenge
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
-
Deep Residual Learning for Image Recognition
-