Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks.
In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224px size) and 58.8 semantic segmentation mIoU metric on ADE20K (512px size).
…In this work, we provide a unified view of masked image modeling, as illustrated in Equation 1 & Figure 1: a teacher model, a normalization layer, a student model, a MIM head, and a proper loss function. According to it, we conduct a systemic comparison of the recent MIM works and present it in Table 1: The most important difference is the teacher model selection, eg. pixel values, tokenizers, pretrained models, and the momentum updated teacher.
Table 1: Systemic comparisons of masked image modeling methods from a unified view.