âContrasting Contrastive Self-Supervised Representation Learning Modelsâ, 2021-03-25 (; similar)â :
In the past few years, we have witnessed remarkable breakthroughs in self-supervised representation learning. Despite the success and adoption of representations learned through this paradigm, much is yet to be understood about how different training methods and datasets influence performance on downstream tasks.
In this paper, we analyze contrastive approaches as one of the most successful and popular variants of self-supervised representation learning. We perform this analysis from the perspective of the training algorithms, pre-training datasets and end tasks. We examine over 700 training experiments including 30 encoders, 4 pre-training datasets and 20 diverse downstream tasks.
Our experiments address various questions regarding the performance of self-supervised models compared to their supervised counterparts, current benchmarks used for evaluation, and the effect of the pre-training data on end task performance.
We hope the insights and empirical evidence provided by this work will help future research in learning better visual representations.
Here we provide a summary of the analysis:
First, we showed that a backbone trained in a supervised fashion on ImageNet is not the best encoder for end tasks other than ImageNet classification and Pets classification (which is a similar end task).
Second, we showed that in many cases there is little to no correlation between ImageNet accuracy and the performance of end tasks that are not semantic image-level.
Third, we showed different training algorithms provide better encoders for certain classes of end tasks. More specifically, MoCo v2 proved better for pixel-wise tasks and SwAV showed better performance on image-level tasks.
Fourth, we showed that structural end tasks benefit more from self-supervision compared to semantic tasks.
Fifth, we showed pre-training the encoder on the same or similar dataset to that of the end task provides higher performance. This is a well-known fact for supervised representation learning, but it was not evident for self-supervised methods that do not use any labels.
Sixth, we showed that representations learned on unbalanced ImageNet is as good or even slightly better than representations learned from balanced data.