[cf. Bachmannet al2023] Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up.
Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs.
The code and models are publicly available at Github.
…The SPACH framework contains a plug-and-play module called mixing block which could be implemented as convolution layers, Transformer layers, or MLP layers. Aside from the mixing block, other components in the framework are kept the same when we explore different structures. This is in stark contrast to previous work which compares different network structures in different frameworks that vary greatly in layer cascade, normalization, and other non-trivial implementation details. As a matter of fact, we found that these structure-free components play an important role in the final performance of the model, and this is commonly neglected in the literature.
Figure 1: Illustration of the proposed experimental framework named SPACH.
With this unified framework, we design a series of controlled experiments to compare the 3 network structures. The results show that all 3 network structures could perform well on the image classification task when pre-trained on ImageNet-1K. In addition, each individual structure has its distinctive properties leading different behaviors when the network size scales up. We also find several common design choices which contribute a lot to the performance of our SPACH framework. The detailed findings are listed in the following.
Multi-stage design is standard in CNN models, but its effectiveness is largely overlooked in Transformer-based or MLP-based models.
We find that the multi-stage framework consistently and notably outperforms the single-stage framework no matter which of the 3 network structures is chosen.
Local modeling is efficient and crucial. With only light-weight depth-wise convolutions, the convolution model can achieve similar performance as a Transformer model in our SPACH framework.
By adding a local modeling bypass in both MLP and Transformer structures, a substantial performance boost is obtained with negligible parameters and FLOPs increase.
MLP can achieve strong performance under small model sizes, but it suffers severely from over-fitting when the model size scales up. We believe that overfitting is the main obstacle that prevents MLP from achieving SOTA performance.
Convolution and Transformer are complementary in the sense that convolution structure has the best generalization capability while Transformer structure has the largest model capacity among the 3 structures.
This suggests that convolution is still the best choice in designing lightweight models but designing large models should take Transformer into account.
Based on these findings, we propose two hybrid models of different scales which are built upon convolution and Transformer layers. Experimental results show that, when a sweet point between generalization capability and model capacity is reached, the performance of these straightforward hybrid models is already on par with SOTA models with sophisticated architecture designs.
Figure 4: Illustration of the over-fitting problem in MLP-based models.
Both multi-stage framework and weight sharing alleviate the problem.
Table 5: The performance of MLP models are greatly boosted when weight sharing is adopted to alleviate over-fitting.
Model
# parameters
FLOPs
throughput (image/second)
ImageNet-1K Top-1 accuracy
MLP-S
41M
8.7G
272
78.6
+Shared
39M
8.7G
274
80.2
MLP-MS-S
46M
8.2G
254
82.1
+Shared
45M
8.2G
244
82.5
…4.4. A Detailed Analysis of MLP: Due to the excessive number of parameters, MLP models suffer severely from over-fitting. We believe that overfitting is the main obstacle for MLP to achieve SOTA performance. In this part, we discuss two mechanisms which can potentially alleviate this problem.
One is the use of multi-stage framework. We have already shown in Table 3 that multi-stage framework brings gain. Such gain is even more prominent for larger MLP models. In particular, the MLP-MS-S models achieves 2.6 accuracy gain over the single-stage model MLP-S. We believe this owes to the strong generalization capability of the multi-stage framework. Figure 4 shows how the test accuracy increases with the decrease of training loss. Overfitting can be observed when the test accuracy starts to flatten. These results also lead to a very promising baseline for MLP-based models. Without bells and whistles, MLP-MSS model achieves 82.1% ImageNet Top-1 accuracy, which is 5.7 points higher than the best results reported by MLP-Mixer when ImageNet-1K is used as training data.
The other mechanism is parameter reduction through weight sharing. We apply weight-sharing on the spatial mixing function 𝐹s. For the single-stage model, all n mixing blocks use the same 𝐹s, while for the multi-stage model, each stage use the same 𝐹s for its Ns mixing blocks. We present the results of S models in Table 5: We can find that the shared-weight variants, denoted by “+Shared”, achieve higher accuracy with almost the same model size and computation cost. Although they are still inferior to Transformer models, the performance is on par with or even better than convolution models. Figure 4 confirms that using shared weights in the MLP-MS model further delays the appearance of over-fitting signs.
Therefore, we conclude that MLP-based models remain competitive if they could solve or alleviate the over-fitting problem.
…Under the SPACH framework, we discover with a little surprise that all 3 network structures are similarly competitive in terms of the accuracy-complexity tradeoff, although they show distinctive properties when the network scales up…Our work also raises several questions worth exploring. First, realizing the fact that the performance of MLP-based models is largely affected by over-fitting, is it possible to design a high-performing MLP model that is not subject to over-fitting?