Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
Training data-efficient image transformers & distillation through attention
CoAtNet: Marrying Convolution and Attention for All Data Sizes
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Wikipedia Bibliography: