Is MLP-Mixer a CNN in Disguise?

As part of this blog post, we look at the MLP Mixer architecture in detail and also understand why it is not considered conv free.

Introduction

Recently, a new kind of architecture - MLP-Mixer: An all-MLP Architecture for Vision (Tolstikhin et al, 2021) - was proposed which claims to have competitive performance with SOTA models on ImageNet without using convolutions or attention. But is this really true? Are the token-mixing or channel-mixing layers in the MLP Mixer architecture actually "Conv-free"? (Figure-1)
The deep learning community is split on this idea.
While on one side, Yann Le Cun tweeted that the architecture is not exactly Conv-free.
On the other side, Lucas Beyer defended the idea wonderfully by proposing "Mixer is a CNN problem."
"It's a touchy subject that boils down to semantics", says Ross Wightman.
As part of this report, both Dr. Habib Bukhari and I got together to look into the MLP Mixer architecture in detail and also try and explain why the community thinks that the architecture is not "Conv-free."
But first, let's understand everything that goes on inside the MLP Mixer architecture.

MLP Mixer

The overall MLP Mixer architecture is really simple to understand and can be implemented in few lines of code using popular frameworks such as PyTorch/JAX.
Figure-1: MLP Architecture
As can be seen, the overall MLP Mixer architecture looks like the above in Figure-1. Thanks to DrHB, this figure can also be represented as Figure-2 below.