Bibliography (7):

ImageNet Large Scale Visual Recognition Challenge
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
DeepViT: Towards Deeper Vision Transformer
Towards Learning Convolutions from Scratch
Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias
Homotopy Analysis for Tensor PCA