Bibliography (92):

https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Revisiting Simple Neural Probabilistic Language Models
PairConnect: A Compute-Efficient MLP Alternative to Attention
Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis
Extraction de séquences numériques dans des documents manuscrits quelconques
Deep Big Multilayer Perceptrons for Digit Recognition
Do Deep Nets Really Need to be Deep?
Network In Network
How far can we go without convolution: Improving fully-connected networks
Deep Neural Networks for Large Vocabulary Handwritten Text Recognition
Tensorizing Neural Networks
Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
https://arxiv.org/pdf/1603.05691.pdf#page=7
The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers
Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU
face#sussman-attains-enlightenment

[Transclude the forward-link's context]
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
NFNet: High-Performance Large-Scale Image Recognition Without Normalization
Fixup Initialization: Residual Learning Without Normalization
Improving Transformer Optimization Through Better Initialization
Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping
ZerO Initialization: Initializing Residual Networks with only Zeros and Ones
Understanding the Covariance Structure of Convolutional Filters
Mimetic Initialization of Self-Attention Layers
https://x.com/hi_tysam/status/1721764010159477161
The Goldilocks zone: Towards better understanding of neural network loss landscapes
Skip Connections Eliminate Singularities
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks
NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations
SwitchNet: a neural network model for forward and inverse scattering problems
Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science
Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias
ReZero is All You Need: Fast Convergence at Large Depth
Towards Learning Convolutions from Scratch
Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?
Towards Biologically Plausible Convolutional Networks
Adapting the Function Approximation Architecture in Online Reinforcement Learning
Data-driven emergence of convolutional structure in neural networks
Noise Transforms Feed-Forward Networks into Sparse Coding Networks
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
Scaling MLPs: A Tale of Inductive Bias
Gesticulator: A framework for semantically-aware speech-driven gesture generation
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition
Less is More: Pay Less Attention in Vision Transformers
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Well-tuned Simple Nets Excel on Tabular Datasets
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
MLPs Learn In-Context
MLP-Mixer: An all-MLP Architecture for Vision
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis
S²-MLP: Spatial-Shift MLP Architecture for Vision
S²-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
ResMLP: Feedforward networks for image classification with data-efficient training
Pay Attention to MLPs
MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
Container: Context Aggregation Network
CycleMLP: A MLP-like Architecture for Dense Prediction
PointMixer: MLP-Mixer for Point Cloud Understanding
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
AS-MLP: An Axial Shifted MLP Architecture for Vision
Hire-MLP: Vision MLP via Hierarchical Rearrangement
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation
ConvMLP: Hierarchical Convolutional MLPs for Vision
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
ConvMixer: Patches Are All You Need?
Exploring the Limits of Large Scale Pre-training
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
pNLP-Mixer: an Efficient all-MLP Architecture for Language
Masked Mixers for Language Generation and Retrieval
MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
‘self-attention’ directory
AFT: An Attention Free Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models
Linformer: Self-Attention with Linear Complexity
Luna: Linear Unified Nested Attention
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks (EAMLP)
MetaFormer is Actually What You Need for Vision
MoGlow: Probabilistic and controllable motion synthesis using normalizing flows
A Style-Based Generator Architecture for Generative Adversarial Networks
2018-karras-stylegan-figure1-styleganarchitecture.png
Image Generators with Conditionally-Independent Pixel Synthesis
Fourier Neural Operator for Parametric Partial Differential Equations
SIREN: Implicit Neural Representations with Periodic Activation Functions
Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation