āSolving ImageNet: a Unified Scheme for Training Any Backbone to Top Resultsā, 2022-04-07 ()ā :
ImageNet serves as the primary dataset for evaluating the quality of computer-vision models. The common practice today is training each architecture with a tailor-made scheme, designed and tuned by an expert.
In this paper, we present a unified scheme for training any backbone on ImageNet. The scheme, named USI (Unified Scheme for ImageNet), is based on knowledge distillation [KD] and modern tricks. It requires no adjustments or hyper-parameters tuning between different models, and is efficient in terms of training times.
We test USI on a wide variety of architectures, including CNNs, Transformers, Mobile-oriented and MLP-only. On all models tested, USI outperforms previous state-of-the-art results. Hence, we are able to transform training on ImageNet from an expert-oriented task to an automatic seamless routine. Since USI accepts any backbone and trains it to top results, it also enables to perform methodical comparisons, and identify the most efficient backbones along the speed-accuracy Pareto curve.
Implementation is available at: Github.
Table 5: Accuracy for different numbers of epochs. Model tested: LeViT-384. Training epochs Top1 Accuracy [%] 100 80.0 200 81.9 300 82.7 600 83.0 1,000 83.2 ā¦3.4. Number of Training Epochs: USIās default configuration is KD training, for 300 epochs. However, for reaching the maximal possible accuracy, 300 epochs is not enough, and longer training would further improve the accuracy. This phenomenon, where in KD the student model benefits from very long training, is called patient teacher. In Table 5 we present the accuracies obtained for various training lengths. As can be seen, the accuracy continues to improve as we increase training epoch 300 ā 600 and to 1,000.
Our default training configuration was chosen to be 300 epochs since this value provides a good compromiseātraining times are reasonable (1ā3 days on 8ĆV100 GPU, depending on the model), while we consistently achieve good results, as can be seen in Figure 1c. However, if training times are not a limitation, we recommend increasing the number of training epochs.