Skip to main content

Fully-Connected Neural Nets

Bibliography of ML papers related to multi-layer perceptrons (fully-connected neural nets), often showing surprising efficacy despite their reputation for being too general to be usable (representing a possible future Bitter Lesson).


  1. Why now, if MLPs were always roughly data & compute-competitive with Transformers, and thus, CNNs?

    My current theory is that the critical ingredient is normalization and/or gating (to enable signal propagation, like residual layers for CNNs or self-attention over history for Transformers): MLPs, while always acknowledged as extremely powerful, underperform in practice or are highly unstable. Normalization & gating are relatively recent, typically post-2015, and they stabilize MLPs to the point where they Just Work.

    If you look at the current crop of MLP papers, what they all seem to have in common is normalization/gating (sometimes hidden or dismissed as an ‘Affine’ layer), and if you remove those ingredient, your loss may go from a perplexity of ~4 to >100, eg; and ones which don’t use these tricks, like many NeRF papers, are also extremely shallow.

    Combined with the great success of resnet CNNs & then Transformers, and it’s unsurprising if MLPs were not trial-and-errored enough post-2015 to discover that they worked until the cost of self-attention in Transformers drove interest in removing as much self-attention as possible—eventually leading to the discover that you can remove all of it with surprisingly little damage.↩︎