- See Also
-
Links
- “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers”, Bozic et al 2023
- “HyperFields: Towards Zero-Shot Generation of NeRFs from Text”, Babu et al 2023
- “Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Shamir et al 2023
- “Absolute Unit NNs: Regression-Based MLPs for Everything”, Gwern 2023
- “Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla”, Lieberum et al 2023
- “Self Expanding Neural Networks”, Mitchell et al 2023
- “Scaling MLPs: A Tale of Inductive Bias”, Bachmann et al 2023
- “Any Deep ReLU Network Is Shallow”, Villani & Schoots 2023
- “Does the First Letter of One’s Name Affect Life Decisions? A Natural Language Processing Examination of Nominative Determinism”, Chatterjee et al 2023
- “Modular Brain AUNNs for Uploads”, Gwern 2023
- “How Does GPT-2 Compute Greater-than?: Interpreting Mathematical Abilities in a Pre-trained Language Model”, Hanna et al 2023
- “Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag”, Yan et al 2023
- “TSMixer: An All-MLP Architecture for Time Series Forecasting”, Chen et al 2023
- “TSMixer: An All-MLP Architecture for Time Series Forecasting”, Chen et al 2023
- “Loss Landscapes Are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent”, Chiang et al 2023
- “Organic Reaction Mechanism Classification Using Machine Learning”, Burés & Larrosa 2023
- “DataMUX: Data Multiplexing for Neural Networks”, Murahari et al 2023
- “Merging Enzymatic and Synthetic Chemistry With Computational Synthesis Planning”, Levin et al 2022
- “Magic3D: High-Resolution Text-to-3D Content Creation”, Lin et al 2022
- “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Hassid et al 2022
- “Language-Conditioned Absolute Unit NNs”, Gwern 2022
- “The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Kocsis et al 2022
- “Scaling Forward Gradient With Local Losses”, Ren et al 2022
- “Omnigrok: Grokking Beyond Algorithmic Data”, Liu et al 2022
- “DreamFusion: Text-to-3D Using 2D Diffusion”, Poole et al 2022
-
“
g.pt
: Learning to Learn With Generative Models of Neural Network Checkpoints”, Peebles et al 2022 - “Random Initializations Performing above Chance and How to Find Them”, Benzing et al 2022
- “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Tay et al 2022
- “Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?”, Grinsztajn et al 2022
- “Revisiting Pretraining Objectives for Tabular Deep Learning”, Rubachev et al 2022
- “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Mindermann et al 2022
- “MLP-3D: A MLP-like 3D Architecture With Grouped Time Mixing”, Qiu et al 2022
- “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022
- “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-Thorp & Ainslie 2022
- “Towards Understanding Grokking: An Effective Theory of Representation Learning”, Liu et al 2022
- “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022
- “Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?”, Zhang & Wang 2022
- “MLP-ASR: Sequence-length Agnostic All-MLP Architectures for Speech Recognition”, Sakuma et al 2022
- “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs”, Zheng et al 2022
- “PNLP-Mixer: an Efficient All-MLP Architecture for Language”, Fusco et al 2022
- “Data-driven Emergence of Convolutional Structure in Neural Networks”, Ingrosso & Goldt 2022
- “When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)”, Wang et al 2022
- “ConvMixer: Patches Are All You Need?”, Trockman & Kolter 2022
- “MAXIM: Multi-Axis MLP for Image Processing”, Tu et al 2022
- “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]”, Power et al 2022
- “The GatedTabTransformer: An Enhanced Deep Learning Architecture for Tabular Modeling”, Cholakov & Kolev 2022
- “MLP Architectures for Vision-and-Language Modeling: An Empirical Study”, Nie et al 2021
- “Noether Networks: Meta-Learning Useful Conserved Quantities”, Alet et al 2021
- “Zero-Shot Text-Guided Object Generation With Dream Fields”, Jain et al 2021
- “MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Zhang et al 2021
- “MetaFormer Is Actually What You Need for Vision”, Yu et al 2021
- “PointMixer: MLP-Mixer for Point Cloud Understanding”, Choe et al 2021
- “Deep Learning without Shortcuts: Shaping the Kernel With Tailored Rectifiers”, Zhang et al 2021
- “ZerO Initialization: Initializing Residual Networks With Only Zeros and Ones”, Zhao et al 2021
- “ADOP: Approximate Differentiable One-Pixel Point Rendering”, Rückert et al 2021
- “Exploring the Limits of Large Scale Pre-training”, Abnar et al 2021
- “Rapid Training of Deep Neural Networks without Skip Connections or Normalization Layers Using Deep Kernel Shaping”, Martens et al 2021
- “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”, Tang et al 2021
- “ConvMLP: Hierarchical Convolutional MLPs for Vision”, Li et al 2021
- “Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Lou et al 2021
- “Hire-MLP: Vision MLP via Hierarchical Rearrangement”, Guo et al 2021
- “RaftMLP: How Much Can Be Done Without Attention and With Less Spatial Locality?”, Tatsunami & Taki 2021
- “S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision”, Yu et al 2021
- “CycleMLP: A MLP-like Architecture for Dense Prediction”, Chen et al 2021
- “AS-MLP: An Axial Shifted MLP Architecture for Vision”, Lian et al 2021
- “Real-time Neural Radiance Caching for Path Tracing”, Müller et al 2021
- “Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”, Hou et al 2021
- “Towards Biologically Plausible Convolutional Networks”, Pogodin et al 2021
- “Well-tuned Simple Nets Excel on Tabular Datasets”, Kadra et al 2021
- “PairConnect: A Compute-Efficient MLP Alternative to Attention”, Xu et al 2021
- “MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis”, Tae et al 2021
- “S2-MLP: Spatial-Shift MLP Architecture for Vision”, Yu et al 2021
- “When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations”, Chen et al 2021
- “Container: Context Aggregation Network”, Gao et al 2021
- “MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation”, Cazenavette & Guevara 2021
- “One4all User Representation for Recommender Systems in E-commerce”, Shin et al 2021
- “Pay Attention to MLPs”, Liu et al 2021
- “FNet: Mixing Tokens With Fourier Transforms”, Lee-Thorp et al 2021
- “ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Touvron et al 2021
- “Multi-scale Inference of Genetic Trait Architecture Using Biologically Annotated Neural Networks”, Demetci et al 2021
- “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-Kyriazi 2021
- “RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, Ding et al 2021
- “MLP-Mixer: An All-MLP Architecture for Vision”, Tolstikhin et al 2021
- “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”, Power et al 2021
- “Sifting out the Features by Pruning: Are Convolutional Networks the Winning Lottery Ticket of Fully Connected Ones?”, Pellegrini & Biroli 2021
- “Fully-Connected Neural Nets”, Gwern 2021
- “Revisiting Simple Neural Probabilistic Language Models”, Sun & Iyyer 2021
- “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021
- “KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Reiser et al 2021
- “Clusterability in Neural Networks”, Filan et al 2021
- “Training Larger Networks for Deep Reinforcement Learning”, Ota et al 2021
- “Explaining Neural Scaling Laws”, Bahri et al 2021
- “Neural Geometric Level of Detail: Real-time Rendering With Implicit 3D Shapes”, Takikawa et al 2021
- “Is MLP-Mixer a CNN in Disguise? As Part of This Blog Post, We Look at the MLP Mixer Architecture in Detail and Also Understand Why It Is Not Considered Convolution Free.”
- “AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction”, Wang et al 2020
- “TabTransformer: Tabular Data Modeling Using Contextual Embeddings”, Huang et al 2020
- “Scaling down Deep Learning”, Greydanus 2020
- “Image Generators With Conditionally-Independent Pixel Synthesis”, Anokhin et al 2020
- “D2RL: Deep Dense Architectures in Reinforcement Learning”, Sinha et al 2020
- “Fourier Neural Operator for Parametric Partial Differential Equations”, Li et al 2020
- “AFT: An Attention Free Transformer”, Anonymous 2020
- “Towards Learning Convolutions from Scratch”, Neyshabur 2020
- “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern 2020
- “SIREN: Implicit Neural Representations With Periodic Activation Functions”, Sitzmann et al 2020
- “Linformer: Self-Attention With Linear Complexity”, Wang et al 2020
- “A Map of Object Space in Primate Inferotemporal Cortex”, Bao et al 2020
- “Synthesizer: Rethinking Self-Attention in Transformer Models”, Tay et al 2020
- “Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems”, Naumov et al 2020
- “NeRF: Representing Scenes As Neural Radiance Fields for View Synthesis”, Mildenhall et al 2020
- “ReZero Is All You Need: Fast Convergence at Large Depth”, Bachlechner et al 2020
- “Cryptanalytic Extraction of Neural Network Models”, Carlini et al 2020
- “Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?”, Ota et al 2020
- “Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Kucherenko et al 2020
- “Understanding the Generalization of ‘lottery Tickets’ in Neural Networks”, Morcos & Tian 2019
- “The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
- “3D Human Pose Estimation via Human Structure-aware Fully Connected Network”, Zhang et al 2019d
- “Finding the Needle in the Haystack With Convolutions: on the Benefits of Architectural Bias”, d’Ascoli et al 2019
- “MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalizing Flows”, Henter et al 2019
- “Fixup Initialization: Residual Learning Without Normalization”, Zhang et al 2019
- “SwitchNet: a Neural Network Model for Forward and Inverse Scattering Problems”, Khoo & Ying 2018
- “Scalable Training of Artificial Neural Networks With Adaptive Sparse Connectivity Inspired by Network Science”, Mocanu et al 2018
- “Deep Learning Generalizes Because the Parameter-function Map Is Biased towards Simple Functions”, Valle-Pérez et al 2018
- “Bidirectional Learning for Robust Neural Networks”, Pontes-Filho & Liwicki 2018
- “NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations”, Ciccone et al 2018
- “Improving Palliative Care With Deep Learning”, An et al 2018
- “Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery”, Simm et al 2018
- “Learning to Play Chess With Minimal Lookahead and Deep Value Neural Networks”, Sabatelli 2017 (page 3)
- “Neural Collaborative Filtering”, He et al 2017
- “Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”, Devlin 2017
- “Research Ideas”, Gwern 2017
- “The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question?”, Balduzzi et al 2017
- “Gender-From-Iris or Gender-From-Mascara?”, Kuehlkamp et al 2017
- “Skip Connections Eliminate Singularities”, Orhan & Pitkow 2017
- “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, Keskar et al 2016
- “Learning to Optimize”, Li & Malik 2016
- “Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
- “Network Morphism”, Wei et al 2016
- “Adding Gradient Noise Improves Learning for Very Deep Networks”, Neelakantan et al 2015
- “How Far Can We Go without Convolution: Improving Fully-connected Networks”, Lin et al 2015
- “BinaryConnect: Training Deep Neural Networks With Binary Weights during Propagations”, Courbariaux et al 2015
- “Tensorizing Neural Networks”, Novikov et al 2015
- “A Neural Attention Model for Abstractive Sentence Summarization”, Rush et al 2015
- “Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, Bluche 2015
- “The Loss Surfaces of Multilayer Networks”, Choromanska et al 2014
- “One Weird Trick for Parallelizing Convolutional Neural Networks”, Krizhevsky 2014
- “Do Deep Nets Really Need to Be Deep?”, Ba & Caruana 2013
- “Network In Network”, Lin et al 2013
- “Deep Big Multilayer Perceptrons for Digit Recognition”, Cireşan et al 2012
- “Compositional Pattern Producing Networks: A Novel Abstraction of Development”, Stanley 2007
- “Extraction De Séquences Numériques Dans Des Documents Manuscrits Quelconques”, Chatelain 2006
- “Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, Simard et al 2003
- “NEAT: Evolving Neural Networks through Augmenting Topologies”, Stanley & Miikkulainen 2002
- “Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”, Goodacre et al 1996
- “On the Ability of the Optimal Perceptron to Generalize”, Opper et al 1990
- “Learning To Tell Two Spirals Apart”, Lang & Witbrock 1988
- “Learning Internal Representations by Error Propagation”, Rumelhart et al 1986
- Sort By Magic
- Wikipedia
- Miscellaneous
- Link Bibliography
See Also
Links
“Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers”, Bozic et al 2023
“HyperFields: Towards Zero-Shot Generation of NeRFs from Text”, Babu et al 2023
“HyperFields: Towards Zero-Shot Generation of NeRFs from Text”
“Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Shamir et al 2023
“Polynomial Time Cryptanalytic Extraction of Neural Network Models”
“Absolute Unit NNs: Regression-Based MLPs for Everything”, Gwern 2023
“Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla”, Lieberum et al 2023
“Self Expanding Neural Networks”, Mitchell et al 2023
“Scaling MLPs: A Tale of Inductive Bias”, Bachmann et al 2023
“Any Deep ReLU Network Is Shallow”, Villani & Schoots 2023
“Does the First Letter of One’s Name Affect Life Decisions? A Natural Language Processing Examination of Nominative Determinism”, Chatterjee et al 2023
“Modular Brain AUNNs for Uploads”, Gwern 2023
“How Does GPT-2 Compute Greater-than?: Interpreting Mathematical Abilities in a Pre-trained Language Model”, Hanna et al 2023
“Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag”, Yan et al 2023
“Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag”
“TSMixer: An All-MLP Architecture for Time Series Forecasting”, Chen et al 2023
“TSMixer: An All-MLP Architecture for Time Series Forecasting”
“TSMixer: An All-MLP Architecture for Time Series Forecasting”, Chen et al 2023
“TSMixer: An All-MLP Architecture for Time Series Forecasting”
“Loss Landscapes Are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent”, Chiang et al 2023
“Organic Reaction Mechanism Classification Using Machine Learning”, Burés & Larrosa 2023
“Organic reaction mechanism classification using machine learning”
“DataMUX: Data Multiplexing for Neural Networks”, Murahari et al 2023
“Merging Enzymatic and Synthetic Chemistry With Computational Synthesis Planning”, Levin et al 2022
“Merging enzymatic and synthetic chemistry with computational synthesis planning”
“Magic3D: High-Resolution Text-to-3D Content Creation”, Lin et al 2022
“How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Hassid et al 2022
“Language-Conditioned Absolute Unit NNs”, Gwern 2022
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”, Kocsis et al 2022
“The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes”
“Scaling Forward Gradient With Local Losses”, Ren et al 2022
“Omnigrok: Grokking Beyond Algorithmic Data”, Liu et al 2022
“DreamFusion: Text-to-3D Using 2D Diffusion”, Poole et al 2022
“g.pt
: Learning to Learn With Generative Models of Neural Network Checkpoints”, Peebles et al 2022
“g.pt
: Learning to Learn with Generative Models of Neural Network Checkpoints”
“Random Initializations Performing above Chance and How to Find Them”, Benzing et al 2022
“Random initializations performing above chance and how to find them”
“Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, Tay et al 2022
“Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?”
“Why Do Tree-based Models Still Outperform Deep Learning on Tabular Data?”, Grinsztajn et al 2022
“Why do tree-based models still outperform deep learning on tabular data?”
“Revisiting Pretraining Objectives for Tabular Deep Learning”, Rubachev et al 2022
“Revisiting Pretraining Objectives for Tabular Deep Learning”
“RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, Mindermann et al 2022
“RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt”
“MLP-3D: A MLP-like 3D Architecture With Grouped Time Mixing”, Qiu et al 2022
“MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing”
“ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Khalitov et al 2022
“ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths”
“Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, Lee-Thorp & Ainslie 2022
“Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT”
“Towards Understanding Grokking: An Effective Theory of Representation Learning”, Liu et al 2022
“Towards Understanding Grokking: An Effective Theory of Representation Learning”
“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Yu et al 2022
“Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?”, Zhang & Wang 2022
“Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?”
“MLP-ASR: Sequence-length Agnostic All-MLP Architectures for Speech Recognition”, Sakuma et al 2022
“MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition”
“Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs”, Zheng et al 2022
“Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs”
“PNLP-Mixer: an Efficient All-MLP Architecture for Language”, Fusco et al 2022
“pNLP-Mixer: an Efficient all-MLP Architecture for Language”
“Data-driven Emergence of Convolutional Structure in Neural Networks”, Ingrosso & Goldt 2022
“Data-driven emergence of convolutional structure in neural networks”
“When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)”, Wang et al 2022
“ConvMixer: Patches Are All You Need?”, Trockman & Kolter 2022
“MAXIM: Multi-Axis MLP for Image Processing”, Tu et al 2022
“Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]”, Power et al 2022
“Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]”
“The GatedTabTransformer: An Enhanced Deep Learning Architecture for Tabular Modeling”, Cholakov & Kolev 2022
“The GatedTabTransformer: An enhanced deep learning architecture for tabular modeling”
“MLP Architectures for Vision-and-Language Modeling: An Empirical Study”, Nie et al 2021
“MLP Architectures for Vision-and-Language Modeling: An Empirical Study”
“Noether Networks: Meta-Learning Useful Conserved Quantities”, Alet et al 2021
“Noether Networks: Meta-Learning Useful Conserved Quantities”
“Zero-Shot Text-Guided Object Generation With Dream Fields”, Jain et al 2021
“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”, Zhang et al 2021
“MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video”
“MetaFormer Is Actually What You Need for Vision”, Yu et al 2021
“PointMixer: MLP-Mixer for Point Cloud Understanding”, Choe et al 2021
“Deep Learning without Shortcuts: Shaping the Kernel With Tailored Rectifiers”, Zhang et al 2021
“Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers”
“ZerO Initialization: Initializing Residual Networks With Only Zeros and Ones”, Zhao et al 2021
“ZerO Initialization: Initializing Residual Networks with only Zeros and Ones”
“ADOP: Approximate Differentiable One-Pixel Point Rendering”, Rückert et al 2021
“ADOP: Approximate Differentiable One-Pixel Point Rendering”
“Exploring the Limits of Large Scale Pre-training”, Abnar et al 2021
“Rapid Training of Deep Neural Networks without Skip Connections or Normalization Layers Using Deep Kernel Shaping”, Martens et al 2021
“Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”, Tang et al 2021
“Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”
“ConvMLP: Hierarchical Convolutional MLPs for Vision”, Li et al 2021
“Sparse-MLP: A Fully-MLP Architecture With Conditional Computation”, Lou et al 2021
“Sparse-MLP: A Fully-MLP Architecture with Conditional Computation”
“Hire-MLP: Vision MLP via Hierarchical Rearrangement”, Guo et al 2021
“RaftMLP: How Much Can Be Done Without Attention and With Less Spatial Locality?”, Tatsunami & Taki 2021
“RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?”
“S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision”, Yu et al 2021
“S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision”
“CycleMLP: A MLP-like Architecture for Dense Prediction”, Chen et al 2021
“AS-MLP: An Axial Shifted MLP Architecture for Vision”, Lian et al 2021
“Real-time Neural Radiance Caching for Path Tracing”, Müller et al 2021
“Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”, Hou et al 2021
“Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”
“Towards Biologically Plausible Convolutional Networks”, Pogodin et al 2021
“Well-tuned Simple Nets Excel on Tabular Datasets”, Kadra et al 2021
“PairConnect: A Compute-Efficient MLP Alternative to Attention”, Xu et al 2021
“PairConnect: A Compute-Efficient MLP Alternative to Attention”
“MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis”, Tae et al 2021
“MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis”
“S2-MLP: Spatial-Shift MLP Architecture for Vision”, Yu et al 2021
“When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations”, Chen et al 2021
“When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations”
“Container: Context Aggregation Network”, Gao et al 2021
“One4all User Representation for Recommender Systems in E-commerce”, Shin et al 2021
“One4all User Representation for Recommender Systems in E-commerce”
“Pay Attention to MLPs”, Liu et al 2021
“FNet: Mixing Tokens With Fourier Transforms”, Lee-Thorp et al 2021
“ResMLP: Feedforward Networks for Image Classification With Data-efficient Training”, Touvron et al 2021
“ResMLP: Feedforward networks for image classification with data-efficient training”
“Multi-scale Inference of Genetic Trait Architecture Using Biologically Annotated Neural Networks”, Demetci et al 2021
“Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks”
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Melas-Kyriazi 2021
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”
“RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, Ding et al 2021
“RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”
“MLP-Mixer: An All-MLP Architecture for Vision”, Tolstikhin et al 2021
“Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”, Power et al 2021
“Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”
“Sifting out the Features by Pruning: Are Convolutional Networks the Winning Lottery Ticket of Fully Connected Ones?”, Pellegrini & Biroli 2021
“Fully-Connected Neural Nets”, Gwern 2021
“Revisiting Simple Neural Probabilistic Language Models”, Sun & Iyyer 2021
“Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Liu et al 2021
“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”
“KiloNeRF: Speeding up Neural Radiance Fields With Thousands of Tiny MLPs”, Reiser et al 2021
“KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs”
“Clusterability in Neural Networks”, Filan et al 2021
“Training Larger Networks for Deep Reinforcement Learning”, Ota et al 2021
“Explaining Neural Scaling Laws”, Bahri et al 2021
“Neural Geometric Level of Detail: Real-time Rendering With Implicit 3D Shapes”, Takikawa et al 2021
“Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes”
“Is MLP-Mixer a CNN in Disguise? As Part of This Blog Post, We Look at the MLP Mixer Architecture in Detail and Also Understand Why It Is Not Considered Convolution Free.”
“AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction”, Wang et al 2020
“AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction”
“TabTransformer: Tabular Data Modeling Using Contextual Embeddings”, Huang et al 2020
“TabTransformer: Tabular Data Modeling Using Contextual Embeddings”
“Scaling down Deep Learning”, Greydanus 2020
“Image Generators With Conditionally-Independent Pixel Synthesis”, Anokhin et al 2020
“Image Generators with Conditionally-Independent Pixel Synthesis”
“D2RL: Deep Dense Architectures in Reinforcement Learning”, Sinha et al 2020
“Fourier Neural Operator for Parametric Partial Differential Equations”, Li et al 2020
“Fourier Neural Operator for Parametric Partial Differential Equations”
“AFT: An Attention Free Transformer”, Anonymous 2020
“Towards Learning Convolutions from Scratch”, Neyshabur 2020
“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern 2020
“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”
“SIREN: Implicit Neural Representations With Periodic Activation Functions”, Sitzmann et al 2020
“SIREN: Implicit Neural Representations with Periodic Activation Functions”
“Linformer: Self-Attention With Linear Complexity”, Wang et al 2020
“A Map of Object Space in Primate Inferotemporal Cortex”, Bao et al 2020
“Synthesizer: Rethinking Self-Attention in Transformer Models”, Tay et al 2020
“Synthesizer: Rethinking Self-Attention in Transformer Models”
“Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems”, Naumov et al 2020
“Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems”
“NeRF: Representing Scenes As Neural Radiance Fields for View Synthesis”, Mildenhall et al 2020
“NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”
“ReZero Is All You Need: Fast Convergence at Large Depth”, Bachlechner et al 2020
“Cryptanalytic Extraction of Neural Network Models”, Carlini et al 2020
“Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?”, Ota et al 2020
“Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?”
“Gesticulator: A Framework for Semantically-aware Speech-driven Gesture Generation”, Kucherenko et al 2020
“Gesticulator: A framework for semantically-aware speech-driven gesture generation”
“Understanding the Generalization of ‘lottery Tickets’ in Neural Networks”, Morcos & Tian 2019
“Understanding the generalization of ‘lottery tickets’ in neural networks”
“The Bouncer Problem: Challenges to Remote Explainability”, Merrer & Tredan 2019
“3D Human Pose Estimation via Human Structure-aware Fully Connected Network”, Zhang et al 2019d
“3D human pose estimation via human structure-aware fully connected network”
“Finding the Needle in the Haystack With Convolutions: on the Benefits of Architectural Bias”, d’Ascoli et al 2019
“Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias”
“MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalizing Flows”, Henter et al 2019
“MoGlow: Probabilistic and controllable motion synthesis using normalizing flows”
“Fixup Initialization: Residual Learning Without Normalization”, Zhang et al 2019
“Fixup Initialization: Residual Learning Without Normalization”
“SwitchNet: a Neural Network Model for Forward and Inverse Scattering Problems”, Khoo & Ying 2018
“SwitchNet: a neural network model for forward and inverse scattering problems”
“Scalable Training of Artificial Neural Networks With Adaptive Sparse Connectivity Inspired by Network Science”, Mocanu et al 2018
“Deep Learning Generalizes Because the Parameter-function Map Is Biased towards Simple Functions”, Valle-Pérez et al 2018
“Deep learning generalizes because the parameter-function map is biased towards simple functions”
“Bidirectional Learning for Robust Neural Networks”, Pontes-Filho & Liwicki 2018
“NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations”, Ciccone et al 2018
“NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations”
“Improving Palliative Care With Deep Learning”, An et al 2018
“Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery”, Simm et al 2018
“Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery”
“Learning to Play Chess With Minimal Lookahead and Deep Value Neural Networks”, Sabatelli 2017 (page 3)
“Learning to Play Chess with Minimal Lookahead and Deep Value Neural Networks”
“Neural Collaborative Filtering”, He et al 2017
“Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”, Devlin 2017
“Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU”
“Research Ideas”, Gwern 2017
“The Shattered Gradients Problem: If Resnets Are the Answer, Then What Is the Question?”, Balduzzi et al 2017
“The Shattered Gradients Problem: If resnets are the answer, then what is the question?”
“Gender-From-Iris or Gender-From-Mascara?”, Kuehlkamp et al 2017
“Skip Connections Eliminate Singularities”, Orhan & Pitkow 2017
“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”, Keskar et al 2016
“On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”
“Learning to Optimize”, Li & Malik 2016
“Do Deep Convolutional Nets Really Need to Be Deep and Convolutional?”, Urban et al 2016
“Do Deep Convolutional Nets Really Need to be Deep and Convolutional?”
“Network Morphism”, Wei et al 2016
“Adding Gradient Noise Improves Learning for Very Deep Networks”, Neelakantan et al 2015
“Adding Gradient Noise Improves Learning for Very Deep Networks”
“How Far Can We Go without Convolution: Improving Fully-connected Networks”, Lin et al 2015
“How far can we go without convolution: Improving fully-connected networks”
“BinaryConnect: Training Deep Neural Networks With Binary Weights during Propagations”, Courbariaux et al 2015
“BinaryConnect: Training Deep Neural Networks with binary weights during propagations”
“Tensorizing Neural Networks”, Novikov et al 2015
“A Neural Attention Model for Abstractive Sentence Summarization”, Rush et al 2015
“A Neural Attention Model for Abstractive Sentence Summarization”
“Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”, Bluche 2015
“Deep Neural Networks for Large Vocabulary Handwritten Text Recognition”
“The Loss Surfaces of Multilayer Networks”, Choromanska et al 2014
“One Weird Trick for Parallelizing Convolutional Neural Networks”, Krizhevsky 2014
“One weird trick for parallelizing convolutional neural networks”
“Do Deep Nets Really Need to Be Deep?”, Ba & Caruana 2013
“Network In Network”, Lin et al 2013
“Deep Big Multilayer Perceptrons for Digit Recognition”, Cireşan et al 2012
“Compositional Pattern Producing Networks: A Novel Abstraction of Development”, Stanley 2007
“Compositional pattern producing networks: A novel abstraction of development”
“Extraction De Séquences Numériques Dans Des Documents Manuscrits Quelconques”, Chatelain 2006
“Extraction de séquences numériques dans des documents manuscrits quelconques”
“Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”, Simard et al 2003
“Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis”
“NEAT: Evolving Neural Networks through Augmenting Topologies”, Stanley & Miikkulainen 2002
“NEAT: Evolving Neural Networks through Augmenting Topologies”
“Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”, Goodacre et al 1996
“On the Ability of the Optimal Perceptron to Generalize”, Opper et al 1990
“Learning To Tell Two Spirals Apart”, Lang & Witbrock 1988
“Learning Internal Representations by Error Propagation”, Rumelhart et al 1986
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
neural-architectures
nerf
mlp-architectures
Wikipedia
Miscellaneous
-
/doc/ai/nn/fully-connected/2023-08-17-gwern-aunn-architecture.svg
-
/doc/ai/nn/fully-connected/2023-08-17-gwern-aunn-architecture.png
-
/doc/ai/nn/fully-connected/2023-bachmann-figure8-mlparchitectureablations.png
-
/doc/ai/nn/fully-connected/2023-bachmann-figure7-suprachinchilladatascalingformlpsoncifar100loss.png
-
/doc/ai/nn/fully-connected/2023-bachmann-figure5-scalingofmlpsoncifar10andimagenet1k.png
-
/doc/ai/nn/fully-connected/2023-bachmann-figure4-mlpsscalewellwithincreasingbatchsize.png
-
/doc/ai/nn/fully-connected/2023-bachmann-figure1-mlpcomputescalingoncifar100.png
-
/doc/ai/nn/fully-connected/2021-power-figure1-grokkinglearningcurves.png
-
/doc/ai/nn/fully-connected/2021-ni-figure2-vilmlpvstransformerbypretrainingdatafraction.png
-
/doc/ai/nn/fully-connected/2021-muller-figure7-fullyfusedfullyconnectednetworkspeedupongpu.png
-
/doc/ai/nn/fully-connected/2020-ota-figure2-overallofenetarchitectureshematic.png
-
/doc/ai/nn/fully-connected/2020-ota-figure1-densenetmlpschematicarchitecture.png
-
https://colab.research.google.com/github/murphyka/ml_colabs/blob/main/Simple_MLP_Visualization.ipynb
-
https://cprimozic.net/blog/reverse-engineering-a-small-neural-network/
-
https://twitter.com/francoisfleuret/status/1714531085512544760
-
https://twitter.com/stephenroller/status/1579993017234382849
-
https://www.lesswrong.com/s/5omSW4wNKbEvYsyje/p/GpSzShaaf8po4rcmA
Link Bibliography
-
https://arxiv.org/abs/2310.08708
: “Polynomial Time Cryptanalytic Extraction of Neural Network Models”, Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, Nitin Satpute -
aunn
: “Absolute Unit NNs: Regression-Based MLPs for Everything”, Gwern -
https://arxiv.org/abs/2306.13575
: “Scaling MLPs: A Tale of Inductive Bias”, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann -
aunn-brain
: “Modular Brain AUNNs for Uploads”, Gwern -
https://arxiv.org/abs/2303.06053#google
: “TSMixer: An All-MLP Architecture for Time Series Forecasting”, Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O. Arik, Tomas Pfister -
2023-bures.pdf
: “Organic Reaction Mechanism Classification Using Machine Learning”, Jordi Burés, Igor Larrosa -
https://www.nature.com/articles/s41467-022-35422-y
: “Merging Enzymatic and Synthetic Chemistry With Computational Synthesis Planning”, Itai Levin, Mengjie Liu, Christopher A. Voigt, Connor W. Coley -
https://arxiv.org/abs/2211.03495
: “How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers”, Michael Hassid, Hao Peng, Daniel Rotem, Jungo Kasai, Ivan Montero, Noah A. Smith, Roy Schwartz -
aunn-papyrus
: “Language-Conditioned Absolute Unit NNs”, Gwern -
https://arxiv.org/abs/2210.03310#google
: “Scaling Forward Gradient With Local Losses”, Mengye Ren, Simon Kornblith, Renjie Liao, Geoffrey Hinton -
https://arxiv.org/abs/2209.12892
: “g.pt
: Learning to Learn With Generative Models of Neural Network Checkpoints”, William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A. Efros, Jitendra Malik -
https://arxiv.org/abs/2207.10551#google
: “Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?”, -
https://arxiv.org/abs/2206.07137
: “RHO-LOSS: Prioritized Training on Points That Are Learnable, Worth Learning, and Not Yet Learnt”, -
https://arxiv.org/abs/2206.05852
: “ChordMixer: A Scalable Neural Attention Model for Sequences With Different Lengths”, Ruslan Khalitov, Tong Yu, Lei Cheng, Zhirong Yang -
https://arxiv.org/abs/2205.12399#google
: “Sparse Mixers: Combining MoE and Mixing to Build a More Efficient BERT”, James Lee-Thorp, Joshua Ainslie -
https://arxiv.org/abs/2204.10670
: “Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang -
https://arxiv.org/abs/2202.06510#microsoft
: “Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs”, Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou -
https://arxiv.org/abs/2201.10801
: “When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)”, Guangting Wang, Yucheng Zhao, Chuanxin Tang, Chong Luo, Wenjun Zeng -
https://arxiv.org/abs/2201.09792
: “ConvMixer: Patches Are All You Need?”, Asher Trockman, J. Zico Kolter -
https://arxiv.org/abs/2111.11418
: “MetaFormer Is Actually What You Need for Vision”, Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan -
https://arxiv.org/abs/2110.02095#google
: “Exploring the Limits of Large Scale Pre-training”, Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi -
https://arxiv.org/abs/2109.05422
: “Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?”, Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng -
https://arxiv.org/abs/2109.04454
: “ConvMLP: Hierarchical Convolutional MLPs for Vision”, Jiachen Li, Ali Hassani, Steven Walton, Humphrey Shi -
https://arxiv.org/abs/2108.13341#huawei
: “Hire-MLP: Vision MLP via Hierarchical Rearrangement”, Jianyuan Guo, Yehui Tang, Kai Han, Xinghao Chen, Han Wu, Chao Xu, Chang Xu, Yunhe Wang -
https://arxiv.org/abs/2108.04384
: “RaftMLP: How Much Can Be Done Without Attention and With Less Spatial Locality?”, Yuki Tatsunami, Masato Taki -
https://arxiv.org/abs/2108.01072#baidu
: “S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision”, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li -
https://arxiv.org/abs/2107.10224
: “CycleMLP: A MLP-like Architecture for Dense Prediction”, Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo -
https://arxiv.org/abs/2107.08391
: “AS-MLP: An Axial Shifted MLP Architecture for Vision”, Dongze Lian, Zehao Yu, Xing Sun, Shenghua Gao -
https://arxiv.org/abs/2106.12372#nvidia
: “Real-time Neural Radiance Caching for Path Tracing”, Thomas Müller, Fabrice Rousselle, Jan Novák, Alexander Keller -
https://arxiv.org/abs/2106.12368
: “Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition”, Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, Jiashi Feng -
https://arxiv.org/abs/2106.07477#baidu
: “S2-MLP: Spatial-Shift MLP Architecture for Vision”, Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li -
https://arxiv.org/abs/2106.01548
: “When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations”, Xiangning Chen, Cho-Jui Hsieh, Boqing Gong -
https://arxiv.org/abs/2106.01401
: “Container: Context Aggregation Network”, Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi -
https://arxiv.org/abs/2105.08050#google
: “Pay Attention to MLPs”, Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le -
https://arxiv.org/abs/2105.03824#google
: “FNet: Mixing Tokens With Fourier Transforms”, James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon -
https://arxiv.org/abs/2105.02723
: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet”, Luke Melas-Kyriazi -
https://arxiv.org/abs/2105.01883
: “RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition”, Xiaohan Ding, Chunlong Xia, Xiangyu Zhang, Xiaojie Chu, Jungong Han, Guiguang Ding -
https://arxiv.org/abs/2105.01601#google
: “MLP-Mixer: An All-MLP Architecture for Vision”, -
2021-power.pdf#openai
: “Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets”, Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, Vedant Misra -
fc
: “Fully-Connected Neural Nets”, Gwern -
https://arxiv.org/abs/2103.14030
: “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows”, Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo -
https://greydanus.github.io/2020/12/01/scaling-down/
: “Scaling down Deep Learning”, Sam Greydanus -
https://arxiv.org/abs/2011.13775
: “Image Generators With Conditionally-Independent Pixel Synthesis”, Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Korzhenkov -
attention
: “Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern -
https://arxiv.org/abs/2005.00743#google
: “Synthesizer: Rethinking Self-Attention in Transformer Models”, Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng -
https://arxiv.org/abs/2003.01629
: “Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?”, Kei Ota, Tomoaki Oiki, Devesh K. Jha, Toshisada Mariyama, Daniel Nikovski -
2017-sabatelli.pdf#page=3
: “Learning to Play Chess With Minimal Lookahead and Deep Value Neural Networks”, Matthia Sabatelli -
idea
: “Research Ideas”, Gwern