Attention Is All You Need
ImageNet Large Scale Visual Recognition Challenge
https://arxiv.org/pdf/2305.09828#page=11
https://arxiv.org/pdf/2305.09828#page=12
Scaling MLPs: A Tale of Inductive Bias
Vision Transformer: An Image is Worth 16Γ16 Words: Transformers for Image Recognition at Scale
ImageNet: A Large-Scale Hierarchical Image Database
https://arxiv.org/pdf/2305.09828#page=3
Language Models are Unsupervised Multitask Learners
Do Vision Transformers See Like Convolutional Neural Networks?
What do Vision Transformers Learn? A Visual Exploration
Training data-efficient image transformers & distillation through attention
Vision Transformers Need Registers
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases