Bibliography:

  1. Fully-Connected Neural Nets

  2. ‘neural net’ tag

  3. ‘RNN’ tag

  4. ‘Transformer’ tag

  5. Absolute Unit NNs: Regression-Based MLPs for Everything

  6. Research Ideas

  7. Modular Brain AUNNs for Uploads

  8. Language-Conditioned Absolute Unit NNs

  9. Fully-Connected Neural Nets

  10. Efficient Attention: Breaking The Quadratic Transformer Bottleneck

  11. AUNN: Simple Implementation of Gwern’s AUNN Proposal

  12. Flexible task abstractions emerge in linear networks with fast and bounded units

  13. The slingshot helps with learning

  14. SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning

  15. nGPT: Normalized Transformer with Representation Learning on the Hypersphere

  16. How Feature Learning Can Improve Neural Scaling Laws

  17. Magika: AI-Powered Content-Type Detection

  18. On the Complexity of Neural Computation in Superposition

  19. Masked Mixers for Language Generation and Retrieval

  20. GSoC 2024: Differentiable Logic for Interactive Systems and Generative Music

  21. What Matters in Transformers? Not All Attention is Needed

  22. When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

  23. Probing the Decision Boundaries of In-context Learning in Large Language Models

  24. MAR: Autoregressive Image Generation without Vector Quantization

  25. Grokking Modular Polynomials

  26. Grokfast: Accelerated Grokking by Amplifying Slow Gradients

  27. Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion

  28. MLPs Learn In-Context

  29. Verified Neural Compressed Sensing

  30. Neural Redshift: Random Networks are not Random Functions

  31. Neural Spline Fields for Burst Image Fusion and Layer Separation

  32. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

  33. SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration

  34. Grokking Group Multiplication with Cosets

  35. Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

  36. HyperFields: Towards Zero-Shot Generation of NeRFs from Text

  37. Grokking Beyond Neural Networks: An Empirical Exploration with Model Complexity

  38. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

  39. Polynomial Time Cryptanalytic Extraction of Neural Network Models

  40. One Wide Feedforward is All You Need

  41. Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

  42. Self Expanding Neural Networks

  43. The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

  44. Scaling MLPs: A Tale of Inductive Bias

  45. Any Deep ReLU Network is Shallow

  46. Does the First Letter of One’s Name Affect Life Decisions? A Natural Language Processing Examination of Nominative Determinism

  47. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

  48. Two-Step Training: Adjustable Sketch Colorization via Reference Image and Text Tag

  49. HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion

  50. The Quantization Model of Neural Scaling

  51. TSMixer: An All-MLP Architecture for Time Series Forecasting

  52. Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent

  53. A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

  54. Looped Transformers as Programmable Computers

  55. Organic reaction mechanism classification using machine learning

  56. DataMUX: Data Multiplexing for Neural Networks

  57. Merging enzymatic and synthetic chemistry with computational synthesis planning

  58. Magic3D: High-Resolution Text-to-3D Content Creation

  59. How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  60. Deep Differentiable Logic Gate Networks

  61. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

  62. The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes

  63. Scaling Forward Gradient With Local Losses

  64. Omnigrok: Grokking Beyond Algorithmic Data

  65. DreamFusion: Text-to-3D using 2D Diffusion

  66. g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints

  67. Random initializations performing above chance and how to find them

  68. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  69. Why do tree-based models still outperform deep learning on tabular data?

  70. Revisiting Pretraining Objectives for Tabular Deep Learning

  71. RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

  72. MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

  73. ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

  74. Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

  75. Towards Understanding Grokking: An Effective Theory of Representation Learning

  76. Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

  77. Deep Learning meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?

  78. Efficient Language Modeling with Sparse All-MLP

  79. HyperMixer: An MLP-based Low Cost Alternative to Transformers

  80. MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition

  81. Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

  82. pNLP-Mixer: an Efficient all-MLP Architecture for Language

  83. Data-driven emergence of convolutional structure in neural networks

  84. When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)

  85. ConvMixer: Patches Are All You Need?

  86. MAXIM: Multi-Axis MLP for Image Processing

  87. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]

  88. The GatedTabTransformer: An enhanced deep learning architecture for tabular modeling

  89. MLP Architectures for Vision-and-Language Modeling: An Empirical Study

  90. Noether Networks: Meta-Learning Useful Conserved Quantities

  91. Zero-Shot Text-Guided Object Generation with Dream Fields

  92. MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

  93. PointMixer: MLP-Mixer for Point Cloud Understanding

  94. MetaFormer is Actually What You Need for Vision

  95. Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

  96. ZerO Initialization: Initializing Residual Networks with only Zeros and Ones

  97. Wide Neural Networks Forget Less Catastrophically

  98. ADOP: Approximate Differentiable One-Pixel Point Rendering

  99. Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

  100. Exploring the Limits of Large Scale Pre-training

  101. Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

  102. ConvMLP: Hierarchical Convolutional MLPs for Vision

  103. Sparse-MLP: A Fully-MLP Architecture with Conditional Computation

  104. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  105. Hire-MLP: Vision MLP via Hierarchical Rearrangement

  106. RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

  107. S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

  108. CycleMLP: A MLP-like Architecture for Dense Prediction

  109. AS-MLP: An Axial Shifted MLP Architecture for Vision

  110. Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

  111. Real-time Neural Radiance Caching for Path Tracing

  112. Towards Biologically Plausible Convolutional Networks

  113. Well-tuned Simple Nets Excel on Tabular Datasets

  114. MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

  115. PairConnect: A Compute-Efficient MLP Alternative to Attention

  116. S2-MLP: Spatial-Shift MLP Architecture for Vision

  117. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

  118. Container: Context Aggregation Network

  119. MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation

  120. One4all User Representation for Recommender Systems in E-commerce

  121. Pay Attention to MLPs

  122. FNet: Mixing Tokens with Fourier Transforms

  123. ResMLP: Feedforward networks for image classification with data-efficient training

  124. Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

  125. Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks

  126. RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

  127. MLP-Mixer: An all-MLP Architecture for Vision

  128. Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets

  129. Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?

  130. Revisiting Simple Neural Probabilistic Language Models

  131. KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

  132. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  133. Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

  134. Clusterability in Neural Networks

  135. Training Larger Networks for Deep Reinforcement Learning

  136. Explaining Neural Scaling Laws

  137. Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes

  138. Is MLP-Mixer a CNN in Disguise? As Part of This Blog Post, We Look at the MLP Mixer Architecture in Detail and Also Understand Why It Is Not Considered Convolution Free.

  139. Transformer Feed-Forward Layers Are Key-Value Memories

  140. AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction

  141. TabTransformer: Tabular Data Modeling Using Contextual Embeddings

  142. Scaling down Deep Learning

  143. Image Generators with Conditionally-Independent Pixel Synthesis

  144. D2RL: Deep Dense Architectures in Reinforcement Learning

  145. Fourier Neural Operator for Parametric Partial Differential Equations

  146. AFT: An Attention Free Transformer

  147. Towards Learning Convolutions from Scratch

  148. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains

  149. SIREN: Implicit Neural Representations with Periodic Activation Functions

  150. Linformer: Self-Attention with Linear Complexity

  151. A map of object space in primate inferotemporal cortex

  152. Synthesizer: Rethinking Self-Attention in Transformer Models

  153. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

  154. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

  155. Cryptanalytic Extraction of Neural Network Models

  156. ReZero is All You Need: Fast Convergence at Large Depth

  157. Train-by-Reconnect: Decoupling Locations of Weights from their Values (LaPerm)

  158. Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?

  159. Quasi-Equivalence of Width and Depth of Neural Networks

  160. Gesticulator: A framework for semantically-aware speech-driven gesture generation

  161. What’s Hidden in a Randomly Weighted Neural Network?

  162. Understanding the generalization of ‘lottery tickets’ in neural networks

  163. The Bouncer Problem: Challenges to Remote Explainability

  164. 3D human pose estimation via human structure-aware fully connected network

  165. Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias

  166. MoGlow: Probabilistic and controllable motion synthesis using normalizing flows

  167. Fixup Initialization: Residual Learning Without Normalization

  168. SwitchNet: a neural network model for forward and inverse scattering problems

  169. A jamming transition from under-parameterization to over-parameterization affects loss landscape and generalization

  170. Neural Arithmetic Logic Units

  171. The Goldilocks zone: Towards better understanding of neural network loss landscapes

  172. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science

  173. Deep learning generalizes because the parameter-function map is biased towards simple functions

  174. Bidirectional Learning for Robust Neural Networks

  175. NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations

  176. Meta-Learning Update Rules for Unsupervised Representation Learning

  177. Learning and Memorization

  178. Repurposing High-Throughput Image Assays Enables Biological Activity Prediction for Drug Discovery

  179. Improving palliative care with deep learning

  180. Learning to Play Chess with Minimal Lookahead and Deep Value Neural Networks

  181. Neural Collaborative Filtering

  182. Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU

  183. The Shattered Gradients Problem: If resnets are the answer, then what is the question?

  184. Gender-From-Iris or Gender-From-Mascara?

  185. Skip Connections Eliminate Singularities

  186. Deep Information Propagation

  187. Topology and Geometry of Half-Rectified Network Optimization

  188. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

  189. Decoupled Neural Interfaces using Synthetic Gradients

  190. Learning to Optimize

  191. Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

  192. Network Morphism

  193. Adding Gradient Noise Improves Learning for Very Deep Networks

  194. How far can we go without convolution: Improving fully-connected networks

  195. BinaryConnect: Training Deep Neural Networks with binary weights during propagations

  196. Tensorizing Neural Networks

  197. A Neural Attention Model for Abstractive Sentence Summarization

  198. Deep Neural Networks for Large Vocabulary Handwritten Text Recognition

  199. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

  200. The Loss Surfaces of Multilayer Networks

  201. On the Number of Linear Regions of Deep Neural Networks

  202. Do Deep Nets Really Need to be Deep?

  203. On the number of response regions of deep feed forward networks with piece-wise linear activations

  204. Network In Network

  205. Deep Big Multilayer Perceptrons for Digit Recognition

  206. Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition

  207. Compositional pattern producing networks: A novel abstraction of development

  208. Extraction de séquences numériques dans des documents manuscrits quelconques

  209. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis

  210. NEAT: Evolving Neural Networks through Augmenting Topologies

  211. DARPA and the Quest for Machine Intelligence, 1983–1993

  212. Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra

  213. Statistical Mechanics of Generalization

  214. On the ability of the optimal perceptron to generalize

  215. Learning To Tell Two Spirals Apart

  216. Learning Internal Representations by Error Propagation

  217. Neural Networks and Physical Systems With Emergent Collective Computational Abilities

  218. 2024-chang-figure7-mlpandattentionheadsbypredictioncorrectnessshowsbothcanworkforiclmetalearning.png

  219. 2024-zhao-figure1-llmshavemuchrougherdecisionboundariesthanmlpsorsvmsordecisiontrees.png

  220. 2023-08-17-gwern-aunn-architecture.png

  221. 2023-08-17-gwern-aunn-architecture.svg

  222. 2023-bachmann-figure1-mlpcomputescalingoncifar100.jpg

  223. 2023-bachmann-figure10-dataaugmentationinducesmoresparselocalfeaturesinfirstlayermlpweights.png

  224. 2023-bachmann-figure4-mlpsscalewellwithincreasingbatchsize.jpg

  225. 2023-bachmann-figure5-scalingofmlpsoncifar10andimagenet1k.png

  226. 2023-bachmann-figure6-powerlawincifar100losswhenconstrainingparametersordatasetsize.jpg

  227. 2023-bachmann-figure7-suprachinchilladatascalingformlpsoncifar100loss.jpg

  228. 2023-bachmann-figure8-mlparchitectureablations.png

  229. 2023-mitchell-figure2-2dvisualizationofannbeingexpandedbysenntobetterapproximatetheline.png

  230. 2023-mitchell-figure3-visualizationofsennlossoveradditionsforhalfmoonstoydataset.jpg

  231. 2022-grinsztajn-figure10-treesvsneuralnetson3regressiontasksusingnumericalfeaturesonmediumvslargedatasets.png

  232. 2022-grinsztajn-figure11-treesvsneuralnetson2classificationtasksusingallfeaturesonmediumvslargedatasets.png

  233. 2022-grinsztajn-figure12-treesvsneuralnetson5regressiontasksusingallfeaturesonmediumvslargedatasets.png

  234. 2022-hassid-figure2-contributionoftransformerattentionwhenablatedtomlbenchmarkperformance.jpg

  235. 2021-muller-figure7-fullyfusedfullyconnectednetworkspeedupongpu.jpg

  236. 2021-ni-figure2-vilmlpvstransformerbypretrainingdatafraction.png

  237. 2021-ni-figure3-scalingofmlpvilvsmlpviltinyattentionvstransformeronvisualquestionansweringaccuracy.png

  238. 2021-power-figure1-grokkinglearningcurves.jpg

  239. 2021-power-poster.png#openai

  240. 2021-zhao-figure4-mlpsoverfitbutcanberegularizedbyweightsharingandmultistagearchitecture.jpg

  241. 2021-zhao-multistagespachframeworkforcomparingmodularblocksofmlpsvscnnsvstransformers.png

  242. 2020-ota-figure1-densenetmlpschematicarchitecture.jpg

  243. 2020-ota-figure2-overallofenetarchitectureshematic.png

  244. 2014-montufar-figure1-binaryclassificationdecisionboundaryofshallowvsdeepneuralnetworkshowingdeeperequalssmoother.png

  245. 2014-pascanu-figure2-topologyofdeepnetworksinfoldingaroundaxislayerbylayer.png

  246. 2014-pascanu-figure3-spacefoldingof2dspaceassheetofpapermodeledbydeepneuralnetworks.png

  247. 1988-lang-figure3-densenetresidualarchitectureforneuralnetsolvingswissspiralproblem.jpg

  248. https://colab.research.google.com/github/murphyka/ml_colabs/blob/main/Simple_MLP_Visualization.ipynb

  249. https://cpldcpu.wordpress.com/2024/04/24/implementing-neural-networks-on-the-10-cent-risc-v-mcu-without-multiplier/

  250. 2caa403f2b8648b5d69eb9d973bb2cd075167b22.html

  251. https://cprimozic.net/blog/reverse-engineering-a-small-neural-network/

  252. 84ba227dcd1fa0f187c014ec2c7df8277f747aa6.html

  253. https://fourmilab.ch/documents/commodore/BrainSim/

  254. f9021906fb5c6ee81d81dec3421f566f2322349f.html

  255. https://github.com/thomasahle/fastchess

  256. https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

  257. https://thenumb.at/Neural-Graphics/

  258. e7ceb7403a05f0ffdac37ce7d3d42c674e525c61.html

  259. https://transformer-circuits.pub/2024/jan-update/index.html#mnist-sparse

  260. https://www.lesswrong.com/posts/7fxusXdkMNmAhkAfc/finding-sparse-linear-connections-between-features-in-llms

  261. https://www.lesswrong.com/posts/K7AyY8LMrcKhwfbyj/no-really-attention-is-all-you-need-attention-can-do

  262. https://www.lesswrong.com/posts/YmkjnWtZGLbHRbzrP/transcoders-enable-fine-grained-interpretable-circuit

  263. 4b453c37d83b1cba64646b8594192a37db31f70f.html

  264. https://www.lesswrong.com/postsiGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

  265. https://www.lesswrong.com/s/5omSW4wNKbEvYsyje/p/GpSzShaaf8po4rcmA

  266. https://x.com/francoisfleuret/status/1714531085512544760

  267. https://x.com/stephenroller/status/1579993017234382849

  268. The slingshot helps with learning

  269. https%253A%252F%252Fwww.lesswrong.com%252Fposts%252FLncYobrn3vRr7qkZW%252Fthe-slingshot-helps-with-learning.html

  270. What Matters in Transformers? Not All Attention is Needed

  271. https%253A%252F%252Farxiv.org%252Fabs%252F2406.15786.html

  272. When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

  273. https%253A%252F%252Farxiv.org%252Fabs%252F2406.13131.html

  274. Probing the Decision Boundaries of In-context Learning in Large Language Models

  275. Aditya Grover

  276. https%253A%252F%252Farxiv.org%252Fabs%252F2406.11233.html

  277. Grokfast: Accelerated Grokking by Amplifying Slow Gradients

  278. https%253A%252F%252Farxiv.org%252Fabs%252F2405.20233.html

  279. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

  280. https%253A%252F%252Farxiv.org%252Fabs%252F2310.13061.html

  281. Polynomial Time Cryptanalytic Extraction of Neural Network Models

  282. https%253A%252F%252Farxiv.org%252Fabs%252F2310.08708.html

  283. Scaling MLPs: A Tale of Inductive Bias

  284. https%253A%252F%252Farxiv.org%252Fabs%252F2306.13575.html

  285. The Quantization Model of Neural Scaling

  286. https%253A%252F%252Farxiv.org%252Fabs%252F2303.13506.html

  287. TSMixer: An All-MLP Architecture for Time Series Forecasting

  288. https%253A%252F%252Farxiv.org%252Fabs%252F2303.06053%2523google.html

  289. Organic reaction mechanism classification using machine learning

  290. %252Fdoc%252Fscience%252F2023-bures.pdf.html

  291. Merging enzymatic and synthetic chemistry with computational synthesis planning

  292. https%253A%252F%252Fwww.nature.com%252Farticles%252Fs41467-022-35422-y.html

  293. How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  294. https%253A%252F%252Farxiv.org%252Fabs%252F2211.03495.html

  295. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

  296. Sanjiv Kumar

  297. https%253A%252F%252Farxiv.org%252Fabs%252F2210.06313%2523google.html

  298. Scaling Forward Gradient With Local Losses

  299. https%253A%252F%252Farxiv.org%252Fabs%252F2210.03310%2523google.html

  300. Omnigrok: Grokking Beyond Algorithmic Data

  301. https%253A%252F%252Farxiv.org%252Fabs%252F2210.01117.html

  302. g.pt: Learning to Learn with Generative Models of Neural Network Checkpoints

  303. https%253A%252F%252Farxiv.org%252Fabs%252F2209.12892.html

  304. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  305. Yi Tay

  306. https%253A%252F%252Farxiv.org%252Fabs%252F2207.10551%2523google.html

  307. RHO-LOSS: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

  308. https%253A%252F%252Farxiv.org%252Fabs%252F2206.07137.html

  309. ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

  310. https%253A%252F%252Farxiv.org%252Fabs%252F2206.05852.html

  311. Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

  312. https%253A%252F%252Farxiv.org%252Fabs%252F2205.12399%2523google.html

  313. Towards Understanding Grokking: An Effective Theory of Representation Learning

  314. https%253A%252F%252Farxiv.org%252Fabs%252F2205.10343.html

  315. Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

  316. https%253A%252F%252Farxiv.org%252Fabs%252F2204.10670.html

  317. Efficient Language Modeling with Sparse All-MLP

  318. https%253A%252F%252Farxiv.org%252Fabs%252F2203.06850.html

  319. HyperMixer: An MLP-based Low Cost Alternative to Transformers

  320. https%253A%252F%252Farxiv.org%252Fabs%252F2203.03691.html

  321. Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

  322. https%253A%252F%252Farxiv.org%252Fabs%252F2202.06510%2523microsoft.html

  323. When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism (ShiftViT)

  324. https%253A%252F%252Farxiv.org%252Fabs%252F2201.10801.html

  325. ConvMixer: Patches Are All You Need?

  326. https%253A%252F%252Farxiv.org%252Fabs%252F2201.09792.html

  327. MetaFormer is Actually What You Need for Vision

  328. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11418.html

  329. Wide Neural Networks Forget Less Catastrophically

  330. https://sites.google.com/view/razp/home

  331. https%253A%252F%252Farxiv.org%252Fabs%252F2110.11526%2523deepmind.html

  332. Exploring the Limits of Large Scale Pre-training

  333. Behnam Neyshabur

  334. https%253A%252F%252Farxiv.org%252Fabs%252F2110.02095%2523google.html

  335. Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

  336. https%253A%252F%252Farxiv.org%252Fabs%252F2109.05422.html

  337. ConvMLP: Hierarchical Convolutional MLPs for Vision

  338. https%253A%252F%252Farxiv.org%252Fabs%252F2109.04454.html

  339. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

  340. https%253A%252F%252Farxiv.org%252Fabs%252F2108.13002%2523microsoft.html

  341. Hire-MLP: Vision MLP via Hierarchical Rearrangement

  342. https%253A%252F%252Farxiv.org%252Fabs%252F2108.13341%2523huawei.html

  343. RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?

  344. https%253A%252F%252Farxiv.org%252Fabs%252F2108.04384.html

  345. S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

  346. https%253A%252F%252Farxiv.org%252Fabs%252F2108.01072%2523baidu.html

  347. CycleMLP: A MLP-like Architecture for Dense Prediction

  348. https%253A%252F%252Farxiv.org%252Fabs%252F2107.10224.html

  349. AS-MLP: An Axial Shifted MLP Architecture for Vision

  350. https%253A%252F%252Farxiv.org%252Fabs%252F2107.08391.html

  351. Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

  352. https%253A%252F%252Farxiv.org%252Fabs%252F2106.12368.html

  353. Real-time Neural Radiance Caching for Path Tracing

  354. https%253A%252F%252Farxiv.org%252Fabs%252F2106.12372%2523nvidia.html

  355. S2-MLP: Spatial-Shift MLP Architecture for Vision

  356. https%253A%252F%252Farxiv.org%252Fabs%252F2106.07477%2523baidu.html

  357. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

  358. https%253A%252F%252Farxiv.org%252Fabs%252F2106.01548.html

  359. Container: Context Aggregation Network

  360. https%253A%252F%252Farxiv.org%252Fabs%252F2106.01401.html

  361. Pay Attention to MLPs

  362. Zihang Dai

  363. https%253A%252F%252Farxiv.org%252Fabs%252F2105.08050%2523google.html

  364. FNet: Mixing Tokens with Fourier Transforms

  365. https%253A%252F%252Farxiv.org%252Fabs%252F2105.03824%2523google.html

  366. Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

  367. https%253A%252F%252Farxiv.org%252Fabs%252F2105.02723.html

  368. RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

  369. https%253A%252F%252Farxiv.org%252Fabs%252F2105.01883.html

  370. MLP-Mixer: An all-MLP Architecture for Vision

  371. Neil Houlsby

  372. Lucas Beyer

  373. Jakob Uszkoreit

  374. https%253A%252F%252Farxiv.org%252Fabs%252F2105.01601%2523google.html

  375. Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets

  376. Vedant Misra

  377. %252Fdoc%252Fai%252Fnn%252Ffully-connected%252F2021-power.pdf%2523openai.html

  378. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  379. https%253A%252F%252Farxiv.org%252Fabs%252F2103.14030.html

  380. Scaling down Deep Learning

  381. About Sam Greydanus

  382. https%253A%252F%252Fgreydanus.github.io%252F2020%252F12%252F01%252Fscaling-down%252F.html

  383. Image Generators with Conditionally-Independent Pixel Synthesis

  384. https%253A%252F%252Farxiv.org%252Fabs%252F2011.13775.html

  385. Synthesizer: Rethinking Self-Attention in Transformer Models

  386. Yi Tay

  387. https%253A%252F%252Farxiv.org%252Fabs%252F2005.00743%2523google.html

  388. Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?

  389. https%253A%252F%252Farxiv.org%252Fabs%252F2003.01629.html

  390. What’s Hidden in a Randomly Weighted Neural Network?

  391. https%253A%252F%252Farxiv.org%252Fabs%252F1911.13299.html

  392. Meta-Learning Update Rules for Unsupervised Representation Learning

  393. Luke Metz

  394. Jascha Sohl-Dickstein

  395. https%253A%252F%252Farxiv.org%252Fabs%252F1804.00222%2523google.html

  396. Learning to Play Chess with Minimal Lookahead and Deep Value Neural Networks

  397. %252Fdoc%252Freinforcement-learning%252Fchess%252F2017-sabatelli.pdf%2523page%253D3.html

  398. On the Number of Linear Regions of Deep Neural Networks

  399. https://sites.google.com/view/razp/home

  400. Kyunghyun Cho

  401. https%253A%252F%252Farxiv.org%252Fabs%252F1402.1869.html