Efficient Attention: Breaking The Quadratic Transformer Bottleneck
Flexible task abstractions emerge in linear networks with fast and bounded units
SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
GSoC 2024: Differentiable Logic for Interactive Systems and Generative Music
When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models
Probing the Decision Boundaries of In-context Learning in Large Language Models
MAR: Autoregressive Image Generation without Vector Quantization
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Neural Spline Fields for Burst Image Fusion and Layer Separation
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
HyperFields: Towards Zero-Shot Generation of NeRFs from Text
Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity
To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Polynomial Time Cryptanalytic Extraction of Neural Network Models
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
Does the First Letter of One’s Name Affect Life Decisions? A Natural Language Processing Examination of Nominative Determinism
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag
HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion
TSMixer: An All-MLP Architecture for Time Series Forecasting
Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Organic reaction mechanism classification using machine learning
Merging enzymatic and synthetic chemistry with computational synthesis planning
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes
g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints
Random initializations performing above chance and how to find them
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Why do tree-based models still outperform deep learning on tabular data?
Revisiting Pretraining Objectives for Tabular Deep Learning
RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
Towards Understanding Grokking: An Effective Theory of Representation Learning
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?
HyperMixer: An MLP-based Low Cost Alternative to Transformers
MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition
Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
pNLP-Mixer: an Efficient all-MLP Architecture for Language
Data-driven emergence of convolutional structure in neural networks
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]
The GatedTabTransformer: An enhanced deep learning architecture for tabular modeling
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Noether Networks: Meta-Learning Useful Conserved Quantities
MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video
Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
ZerO Initialization: Initializing Residual Networks with only Zeros and Ones
ADOP: Approximate Differentiable One-Pixel Point Rendering
Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
Sparse-MLP: A Fully-MLP Architecture with Conditional Computation
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis
PairConnect: A Compute-Efficient MLP Alternative to Attention
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation
One4all User Representation for Recommender Systems in E-commerce
ResMLP: Feedforward networks for image classification with data-efficient training
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition
Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets
Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?
KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes
Is MLP-Mixer a CNN in Disguise? As Part of This Blog Post, We Look at the MLP Mixer Architecture in Detail and Also Understand Why It Is Not Considered Convolution Free.
AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction
TabTransformer: Tabular Data Modeling Using Contextual Embeddings
Image Generators with Conditionally-Independent Pixel Synthesis
Fourier Neural Operator for Parametric Partial Differential Equations
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
SIREN: Implicit Neural Representations with Periodic Activation Functions
Synthesizer: Rethinking Self-Attention in Transformer Models
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Train-by-Reconnect: Decoupling Locations of Weights from their Values (LaPerm)
Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?
Gesticulator: A framework for semantically-aware speech-driven gesture generation
Understanding the generalization of ‘lottery tickets’ in neural networks
3D human pose estimation via human structure-aware fully connected network
Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias
MoGlow: Probabilistic and controllable motion synthesis using normalizing flows
Fixup Initialization: Residual Learning Without Normalization
SwitchNet: a neural network model for forward and inverse scattering problems
A jamming transition from under-parameterization to over-parameterization affects loss landscape and generalization
The Goldilocks zone: Towards better understanding of neural network loss landscapes
Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science
Deep learning generalizes because the parameter-function map is biased towards simple functions
NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations
Meta-Learning Update Rules for Unsupervised Representation Learning
Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery
Learning to Play Chess with Minimal Lookahead and Deep Value Neural Networks
Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU
The Shattered Gradients Problem: If resnets are the answer, then what is the question?
Topology and Geometry of Half-Rectified Network Optimization
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
Adding Gradient Noise Improves Learning for Very Deep Networks
How far can we go without convolution: Improving fully-connected networks
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
A Neural Attention Model for Abstractive Sentence Summarization
Deep Neural Networks for Large Vocabulary Handwritten Text Recognition
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
On the number of response regions of deep feed forward networks with piece-wise linear activations
Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition
Compositional pattern producing networks: A novel abstraction of development
Extraction de séquences numériques dans des documents manuscrits quelconques
Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis
NEAT: Evolving Neural Networks through Augmenting Topologies
Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra
Neural Networks and Physical Systems With Emergent Collective Computational Abilities
2024-chang-figure7-mlpandattentionheadsbypredictioncorrectnessshowsbothcanworkforiclmetalearning.png
2024-zhao-figure1-llmshavemuchrougherdecisionboundariesthanmlpsorsvmsordecisiontrees.png
2023-bachmann-figure10-dataaugmentationinducesmoresparselocalfeaturesinfirstlayermlpweights.png
2023-bachmann-figure4-mlpsscalewellwithincreasingbatchsize.jpg
2023-bachmann-figure5-scalingofmlpsoncifar10andimagenet1k.png
2023-bachmann-figure6-powerlawincifar100losswhenconstrainingparametersordatasetsize.jpg
2023-bachmann-figure7-suprachinchilladatascalingformlpsoncifar100loss.jpg
2023-mitchell-figure2-2dvisualizationofannbeingexpandedbysenntobetterapproximatetheline.png
2023-mitchell-figure3-visualizationofsennlossoveradditionsforhalfmoonstoydataset.jpg
2022-grinsztajn-figure10-treesvsneuralnetson3regressiontasksusingnumericalfeaturesonmediumvslargedatasets.png
2022-grinsztajn-figure11-treesvsneuralnetson2classificationtasksusingallfeaturesonmediumvslargedatasets.png
2022-grinsztajn-figure12-treesvsneuralnetson5regressiontasksusingallfeaturesonmediumvslargedatasets.png
2022-hassid-figure2-contributionoftransformerattentionwhenablatedtomlbenchmarkperformance.jpg
2021-muller-figure7-fullyfusedfullyconnectednetworkspeedupongpu.jpg
2021-ni-figure2-vilmlpvstransformerbypretrainingdatafraction.png
2021-ni-figure3-scalingofmlpvilvsmlpviltinyattentionvstransformeronvisualquestionansweringaccuracy.png
2021-zhao-figure4-mlpsoverfitbutcanberegularizedbyweightsharingandmultistagearchitecture.jpg
2021-zhao-multistagespachframeworkforcomparingmodularblocksofmlpsvscnnsvstransformers.png
2014-montufar-figure1-binaryclassificationdecisionboundaryofshallowvsdeepneuralnetworkshowingdeeperequalssmoother.png
2014-pascanu-figure2-topologyofdeepnetworksinfoldingaroundaxislayerbylayer.png
2014-pascanu-figure3-spacefoldingof2dspaceassheetofpapermodeledbydeepneuralnetworks.png
1988-lang-figure3-densenetresidualarchitectureforneuralnetsolvingswissspiralproblem.jpg
https://colab.research.google.com/github/murphyka/ml_colabs/blob/main/Simple_MLP_Visualization.ipynb
https://cpldcpu.wordpress.com/2024/04/24/implementing-neural-networks-on-the-10-cent-risc-v-mcu-without-multiplier/
https://cprimozic.net/blog/reverse-engineering-a-small-neural-network/
https://transformer-circuits.pub/2024/jan-update/index.html#mnist-sparse
https://www.lesswrong.com/posts/7fxusXdkMNmAhkAfc/finding-sparse-linear-connections-between-features-in-llms
https://www.lesswrong.com/posts/K7AyY8LMrcKhwfbyj/no-really-attention-is-all-you-need-attention-can-do
https://www.lesswrong.com/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit
https://www.lesswrong.com/postsiGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
https://www.lesswrong.com/s/5omSW4wNKbEvYsyje/p/GpSzShaaf8po4rcmA
https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FLncYobrn3vRr7qkZW%252Fthe-slingshot-helps-with-learning.html
When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models
Probing the Decision Boundaries of In-context Learning in Large Language Models
Grokfast: Accelerated Grokking by Amplifying Slow Gradients
To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets
Polynomial Time Cryptanalytic Extraction of Neural Network Models
TSMixer: An All-MLP Architecture for Time Series Forecasting
https%253A%252F%252Farxiv.org%252Fabs%252F2303.06053%2523google.html
Organic reaction mechanism classification using machine learning
Merging enzymatic and synthetic chemistry with computational synthesis planning
https%253A%252F%252Fwww.nature.com%252Farticles%252Fs41467-022-35422-y.html
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
https%253A%252F%252Farxiv.org%252Fabs%252F2210.06313%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2210.03310%2523google.html
g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
https%253A%252F%252Farxiv.org%252Fabs%252F2207.10551%2523google.html
RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths
Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT
https%253A%252F%252Farxiv.org%252Fabs%252F2205.12399%2523google.html
Towards Understanding Grokking: An Effective Theory of Representation Learning
Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
HyperMixer: An MLP-based Low Cost Alternative to Transformers
Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
https%253A%252F%252Farxiv.org%252Fabs%252F2202.06510%2523microsoft.html
When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)
https%253A%252F%252Farxiv.org%252Fabs%252F2110.11526%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2110.02095%2523google.html
Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?
A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP
https%253A%252F%252Farxiv.org%252Fabs%252F2108.13002%2523microsoft.html
https%253A%252F%252Farxiv.org%252Fabs%252F2108.13341%2523huawei.html
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision
https%253A%252F%252Farxiv.org%252Fabs%252F2108.01072%2523baidu.html
Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2106.12372%2523nvidia.html
https%253A%252F%252Farxiv.org%252Fabs%252F2106.07477%2523baidu.html
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
https%253A%252F%252Farxiv.org%252Fabs%252F2105.08050%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2105.03824%2523google.html
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition
https%253A%252F%252Farxiv.org%252Fabs%252F2105.01601%2523google.html
Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets
%252Fdoc%252Fai%252Fnn%252Ffully-connected%252F2021-power.pdf%2523openai.html
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
https%253A%252F%252Fgreydanus.github.io%252F2020%252F12%252F01%252Fscaling-down%252F.html
Image Generators with Conditionally-Independent Pixel Synthesis
Synthesizer: Rethinking Self-Attention in Transformer Models
https%253A%252F%252Farxiv.org%252Fabs%252F2005.00743%2523google.html
Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?
Meta-Learning Update Rules for Unsupervised Representation Learning
https%253A%252F%252Farxiv.org%252Fabs%252F1804.00222%2523google.html
Learning to Play Chess with Minimal Lookahead and Deep Value Neural Networks
%252Fdoc%252Freinforcement-learning%252Fchess%252F2017-sabatelli.pdf%2523page%253D3.html
Wikipedia Bibliography: