[code; cf. Liuet al2021, Nieet al2021] We revisit the most fundamental building block in deep learning, the multi-layer perceptron (MLP), and study the limits of its performance on vision tasks. Empirical insights into MLPs are important for multiple reasons.
Given the recent narrative less inductive bias is better, popularized due to transformers eclipsing convolutional models, it is natural to explore the limits of this hypothesis. To that end, MLPs offer an ideal test bed, being completely free of any inductive bias.
MLPs have almost exclusively been the main protagonist in the deep learning theory literature due to their mathematical simplicity, serving as a proxy to explain empirical phenomena observed for more complex architectures. Surprisingly, experimental datapoints for MLPs are very difficult to find in the literature, especially when coupled with large pre-training protocols. This discrepancy between practice and theory is worrying: Do MLPs reflect the empirical advances exhibited by practical models? Or do theorists need to rethink the role of MLPs as a proxy?
We provide insights into both these aspects. We show that the performance of MLPs drastically improves with scale (93% on CIFAR-10, 79% on CIFAR-100, 69% on Tiny ImageNet), highlighting that lack of inductive bias can indeed be compensated. We observe that MLPs mimic the behavior of their modern counterparts faithfully, however with some components in the learning setting exhibiting stronger or unexpected behaviors.
Due to their inherent computational efficiency, large pre-training experiments become more accessible for academic researchers. All of our experiments were run on a single GPU.
Figure 1: Test error on CIFAR-100 as a function of PFLOPS.
…Due to their inferior performance, MLPs are rarely used and very little is known regarding their behavior in more modern settings. For instance, to the best of our knowledge, there is not a single published result showcasing an MLP trained on ImageNet-1k, the de-facto standard benchmark in vision, let alone any pre-training/transfer learning studies. This lack of empirical data is concerning as theory aims to understand the characteristics of modern architectures through the lens of MLPs, yet only little assessments are made regarding how well such a proxy works.
…The MLP architecture is the ideal candidate to test the limits of such a hypothesis, as it exhibits the least inductive bias for vision due to its invariance to permutations of pixels. Unfortunately, the scale where Transformers and MLP-Mixers start to outperform convolutional models is out of reach for most researchers, requiring billions of annotated images and thousands of TPUs.
…In contrast to previous work however, we find that compute-optimal MLPs allocate their budget more strongly into sample size, highlighting again their lack of inductive bias. While regularization in the form of data augmentation is also helpful for CNNs, its role is substantially amplified for MLPs even at large sample sizes, leading to fatal degradation if turned off. We further investigate how the implicit bias of SGD affects performance, and we make a very counter-intuitive discovery: contrary to CNNs, we find that larger batch sizes generalize substantially better for MLPs.
…Standard MLP: As a first starting point, we investigate simple MLPs with ReLU activations and isotropic design, i.e. except for the first, every layer has the same width m ∈ ℕ. In order to avoid training instabilities we further enhance the standard MLP with layer normalizations (Baet al2016) placed after the activations…To embed the image x ∈ ℝd×d×3 we use a linear layer emb(x) = Wembvec(x) with Wemb ∈ ℝm×3d2. Such an embedding layer is crucial since for high resolution images, 3d2 can be quite large and thus m needs to be chosen smaller. We empirically find that such a network design is the minimal choice in order to guarantee successful training across all scales of parameter count and sample size.
Inverted Bottleneck MLP: Inspired by Linet al2015 & Tolstikhinet al2021, we add a bottleneck structure to an MLP block as well as skip connections.
…In order to limit the size of the embedding layer and the computational needs, we downscale all images to resolution 64 × 64 × 3 (if needed) as done in Chrabaszczet al2017.
…4.2 Training from Scratch: We start the empirical exploration of MLPs by training them from scratch (ie. without any extra data) on popular vision benchmarks. All models were trained with the LION optimizer (Chenet al2023) with a learning rate η = 5 × 10−5. In order to combat overfitting we use strong label smoothing α = 0.3. We display the resulting test accuracies in Table 1. We first observe that the standard architecture of depth L = 6 and width m = 1,024 without any data augmentation suffers from severe overfitting, leading to very suboptimal performance. Even when turning on data augmentation, it struggles to learn and performance gains are very modest. As observed in Linet al2015, switching to the inverted bottleneck architecture improves performance across all datasets. Moreover, data augmentation as a regularizer really unfolds its full power, substantially pushing the performance by roughly 20% across all tasks. Learning on the other hand substantially slows down with strong augmentations such as MixUp, enabling training for up to 5,000 epochs without suffering from overfitting. However, compared to simple modern baselines such as a ResNet18(Heet al2015), a large discrepancy in performance remains, highlighting the importance of inductive bias in the small sample regime. We remark that ViTs and MLP-Mixers as well exhibit more learning difficulties if the dataset size is small (Dosovitskiyet al2021;Tolstikhinet al2021). We provide more ablation studies in Appendix A.2.
4.3 Transfer Learning: In this section, we aim to analyze how transferable features learnt by MLPs are across different vision tasks. Transferability is one of the hallmark characteristics of modern deep learning, enabling practitioners to fine-tune large models on their specific dataset, leading to superior performance. We are, to the best of our knowledge, the first to measure transferability of MLPs, which is crucial to assess in order to build a theoretical understanding of the process
…Surprisingly, the learnt features are highly transferable [rather than being non-robust features], improving in Table 2 the performances reported previously in Table 1 dramatically. While of course pre-trained on a large quantity of data, we nevertheless want to highlight that such an MLP becomes competitive with a ResNet18 trained from scratch for all the datasets, except for ImageNet-1k where performance falls surprisingly short. We hypothesize that MLPs struggle with the more fine-grained distinctions between classes, in combination with the reduced resolution of the images.
Overall however, these results underline that a bad inductive bias as exhibited by an MLP can indeed be overcome if subjected to enough scale. For theory, the results are double-edged; while MLPs prove to be a good proxy to understand transfer learning, data augmentation and the inverted bottleneck structure seem to be an essential component to the success. Both these characteristics on the other hand remain rather understudied in theoretical works.
Figure 4: Linear downstream error on CIFAR-100 (in %) when pretrained for varying batch-sizes on ImageNet-21k, on a log-log scale.
Large batch-sizes: We further make the counter-intuitive observation that training with larger batch sizes substantially boosts performance both upstream & downstream. In Figure 4 we plot pre-training batch size against resulting linear downstream accuracy on CIFAR-100 for different number of pre-training epochs. We observe that across all training times, using a larger batch size leads to substantially better performance. Moreover, we want to highlight that such a plot is even favoring small batch-sizes since those models perform more gradient updates for a fixed number of epochs.
This effect is in stark contrast to convolutional architectures where entire lines of works have focused on preserving the performance of the small batch-size regime for larger ones (Goyalet al2017; Youet al2017;Hofferet al2017;Keskaret al2017). Training with large batch-sizes without degradation is of high interest as it can lead to potentially more efficient training pipelines since computation can be sharded among more devices. This observation about optimal batch-sizes is in line with similar recent conclusions in Transformers (Kaplanet al2020; Touvronet al2023).
Role of augmentations: The role of data augmentation is very pronounced for MLPs, largely since it provides indirect inductive bias to the model. Remarkably, a model pre-trained on 12 million examples without data augmentation shows inferior performance on CIFAR-10 compared to a network trained from scratch with augmentations turned on. This emphasizes that augmentations go beyond merely leading to a bigger dataset but provide the model with useful invariances. We investigate the learnt weights in-depth in Appendix B, showing that very evidently, more localized features are learnt if data augmentation is employed. [cf. learning convolutional priors]
Figure 10: Visualization of the first layer weights for models trained with and without data augmentation.
…4.4 Scaling Laws: …The test error can be measured upstream (ie. on the pre-training task) or downstream when fine-tuning on a different task. We investigate various pre-training schemes with different number of examples, parameter counts and training times. We subsample ImageNet-21k proportionally across classes and pre-train variously sized inverted bottleneck MLPs. We summarize the configurations in Table 3. We then measure test error on the downstream task of CIFAR-100 in Figure 1 as well as CIFAR-10 and ImageNet-1k in Figure 5 by linearly transferring the learnt features. The plotting style is inspired by Zhaiet al2022. Each point in the curve is the downstream performance of an MLP, where the color of the point indicates the model type (blue denotes smaller and red larger models) and the size of the point indicates the number of pre-training examples. Points connected by a line indicates longer training times where T ∈ {50, 100, 200, 400, 800} is measured in epochs. In all experiments, we employ data augmentation for pre-training. We observe that the compute-optimal performance of MLPs strongly exhibits characteristics of a power-law with coefficients α ∈ {0.12, 0.25, 0.35}. This is very encouraging for future theoretical work, showing that MLPs indeed mirror the scaling behavior of modern models. We further study how performance E evolves when compute is either bottlenecked by the number of parameters P or the dataset size N.
We visualize the resulting scaling laws in Figure 6. We find a very steep decay rate in terms of parameters P where roughly αP ≈ 1, whereas for dataset size N we identify a substantially slower rate of αN ≈ 0.35. This shows that the performance of MLPs is statistically-significantly more limited by the dataset size, which is in-line with the fact that MLPs exhibit a bad inductive bias. We investigate the role of dataset size and parameters more in the next paragraph.
Figure 5: Test error (in %) on CIFAR-10 (left) and ImageNet-1k (right) when linearly transferred as a function of PFLOPS, measured according to Equation 4, on a log-log scale.
Figure 6: Power law in linear evaluation error on CIFAR-100 (in %) when either bottlenecked by the number of parameters (left) or the number of examples (right), on a log-log scale.
The dotted line visualizes the fitted functional form.
…Parameters or examples: [MLPs have supra-Chinchilla data scaling] Given a fixed level of compute C, what is the optimal way to allocate it to parameter count P and number of examples N? In order to be more comparable to previous work, we assume a fixed training time T = 50. To answer this question, we follow the approach outlined in Hoffmannet al2022 and plot the optimal compute models identified in Figure 1 both against model size P and number of examples N. We visualize the results in Figure 7. We empirically observe that the optimal parameter count P✱(C) and dataset size N✱(C) as a function of compute C exhibit power-law behavior of the approximate form
P✱(C) ∝ C0.35 N✱(C) ∝ C0.65
While for transformers, the number of examples (or tokens) N and parameters P are scaled equally 1:1 (Hoffmannet al2022) (ie. αP ≈ αN ≈ 0.5), in contrast we observe that the optimal strategy for MLPs invests substantially more compute into dataset size N. [That is, there is no simple 1:0.66 ratio here; rather, the ratio increasingly skews towards data as compute scales, C0.30; eg. doubling compute would increase model size by only 20.35 = 1.3×, but data size by 20.65 = 1.6×. Implication: highly-efficient, small, intelligent MLPs—if you can get the data.] This is further evidence for the weaker inductive bias present in MLPs, which needs more examples in order to be compensated for.
Figure 7: Optimal model size (left) and number of examples (right) for a given level of compute for linear evaluation on CIFAR-100, on a log-log scale.
4.5 Computational Feasibility: We believe that a further exciting feature of our study is its computational feasibility, while at the same time preserving the main characteristics of large-scale pre-training. All of our experiments were conducted on a single NVIDIA RTX A5000 GPU with 24GB of memory. In conjunction with the strongly optimized FFCV dataloading framework and the inherent efficiency of MLPs, we are able to perform very rapid training. For instance we complete a single epoch on ImageNet-21k with the B-12/Wi-1024 architecture, equipped with 124 million parameters, in only roughly 450 seconds, while the smaller variant B-6/Wi-1024 at a parameter count of 74 million requires roughly 250 seconds on the specified hardware. Low memory requirements allow us to train with a batch-size of 16,384 without having to shard computation among multiple GPUs. We compare the computational efficiency of MLPs with contemporary networks of similar size such as ResNet-152, ViT-B/4 and ViT-B/8 in Appendix A.5…We highlight the fact that although MLPs require a lot of training data, inference is extremely efficient from a computational perspective…As it quickly becomes evident, MLPs require substantially less FLOPs to make predictions on individual images, in essence using their parameters a lot more methodically. As a result, latency and throughput are substantially better compared to other candidate architectures. We measure throughput using the optimal batch size on an NVIDIA RTX A5000. We highlight, that our MLPs, in contrast to the other architectures are memory bound, meaning that their throughput is determined by the prefetching bandwidth of our GPU. Hardware advancement and specialized architectures could mitigate this effect. Neglecting memory transfer time by propagating the same input through our network gives a further 6× increase in the potential throughput.
…Architecture: We make the following observations/recommendations to boost the model’s performance, in line with results reported in the literature (Liuet al2022); (1) replacing ReLUs and GELUs boosts results substantially, (2) adding skip connections every two layers helps with optimization, especially for deeper networks. (3) Using an inverted bottleneck increases performance even more. (4) Using a normalization layer in the PRE-LN configuration helps with optimization and (5) layer normalization leads to substantially better results compared to batch normalization, while also being more stable during training. Optimization. As discussed in the main text, augmentations are crucial, and disabling them can have a detrimental effect. We also found that clipping gradients, using weight decay and dropout have a small positive effect on downstream performance. Finally, replacing LION (Chenet al2023) with Adam(W), leads to a decrease in performance.
Figure 8: Ablations on different architectures and optimizations choices when training on ImageNet. Numbers indicate linear probing Top-1 accuracies on CIFAR-100.