Exploring the Limits of Weakly Supervised Pretraining
Deep Residual Learning for Image Recognition
Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
ImageNet Large Scale Visual Recognition Challenge
A Simple Framework for Contrastive Learning of Visual Representations
SimCLRv2: Big Self-Supervised Models are Strong Semi-Supervised Learners
SEER: Self-supervised Pretraining of Visual Features in the Wild
BEiT: BERT Pre-Training of Image Transformers
https://arxiv.org/pdf/2201.08371#page=13&org=facebook
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Do ImageNet Classifiers Generalize to ImageNet?
The Devil is in the Tails: Fine-grained Classification in the Wild