“Training Data-Efficient Image Transformers & Distillation through Attention”, Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou2020-12-23 (, ; similar)⁠:

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption.

In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data.

More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

[cf. “ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases”, d’Ascoli et al 2021; “DeepViT: Towards Deeper Vision Transformer”, Zhou et al 2021; also of interest: Neyshabur2020, “Towards learning convolutions from scratch”; d’Ascoli et al 2019, “Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias”; Anandkumar et al 2016, “Homotopy Analysis for Tensor PCA”]