“Vision Transformer: An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”, 2020-09-28 (; similar):
One-sentence Summary: Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place.
We show that this reliance on CNNs is not necessary and a pure transformer can perform very well on image classification tasks when applied directly to sequences of image patches.
When pre-trained on large amounts of data [JFT-300M] and transferred to multiple recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc), Vision Transformer attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train…Our Vision Transformer, pre-trained on the JFT-300M dataset, approaches or beats state-of-the-art on multiple image recognition benchmarks, reaching accuracy of 88.36% on ImageNet, 90.77% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.16% on the VTAB suite of 19 tasks…Interestingly, our models took substantially less compute to pre-train than prior state-of-the-art, however, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in §4.4…Finally, [we plan] to further scale ViT, given that the performance does not seem yet to be saturating with the increased model size.
[Keywords: computer vision, image recognition, self-attention, transformer, large-scale training]
[Blog. See also “Not All Images are Worth 16×16 Words: Dynamic Vision Transformers with Adaptive Sequence Length”, et al 2021.]