‘NN sparsity’ directory

Gwern

‘NN sparsity’ directory

Neural nets are extremely ‘overparameterized’ in the sense that they have orders of magnitude more parameters than necessary to solve the problems they are trained on, as can be proven by the regular improvements in training⁠ smaller/faster but still performant networks, but also in directly creating smaller neural nets with similar or identical performance on those problems. Major techniques are: deleting parameters (pruning⁠)/reducing precision of the numeric encoding (quantizing)⁠/training a smaller network from scratch using the original large network somehow (distillation⁠).

Mysteriously, these smaller networks typically cannot be trained from scratch; performance gains can be obtained without the original data; models can be trained to imitate themselves in self-distillation; despite this indicating overfitting ought to be a major concern, they generalize well; and many of these smaller networks are in some sense already present in the original neural network. This is frequently taken to indicate some sort of ⁠blessing of scale⁠ in large NNs having smoother loss landscapes, which simple optimizers can successfully traverse to good optima no matter how hard the problem, as compared to smaller networks which may wind up ‘trapped’ at a bad place with no free parameters to let it slip around obstacles and find some way to improve (much less the loss landscape of equivalently powerful but extremely brittle encodings such as Brainf—k⁠ or X86 assembler programs). As well as their great theoretical interest—How can we train these small models directly? What does this tell us about how NNs work?—such smaller NNs are critical to practical real-world deployment to servers & smartphones at scale, the design of accelerator hardware supporting reduced-precision operations, and also are an interesting case of capability growth for AI risk: as soon as any NN exists which can achieve performance goal X, it is likely that a much more efficient NN (potentially orders of magnitude smaller or faster) can be created to achieve X thereafter. (These are merely one way that ⁠your software can be much faster⁠⁠.)

This tag covers some examples of NNs being compressed in size or FLOPs by anywhere from 50% to ~17,000% (an incomplete bibliography, merely papers I have noted during my reading).

Links

“Convolutional Differentiable Logic Gate Networks ”, Petersen et al 2024

Convolutional Differentiable Logic Gate Networks⁠

“LoRA vs Full Fine-Tuning: An Illusion of Equivalence ”, Shuttleworth et al 2024

LoRA vs Full Fine-tuning: An Illusion of Equivalence⁠

“On the Complexity of Neural Computation in Superposition ”, Adler & Shavit 2024

On the Complexity of Neural Computation in Superposition⁠

“GSoC 2024: Differentiable Logic for Interactive Systems and Generative Music ”

⁠GSoC 2024: Differentiable Logic for Interactive Systems and Generative Music

“High-Performance Deep Spiking Neural Networks With 0.3 Spikes per Neuron ”, Stanojevic et al 2024

High-performance deep spiking neural networks with 0.3 spikes per neuron⁠

“LoRA Learns Less and Forgets Less ”, Biderman et al 2024

LoRA Learns Less and Forgets Less⁠

“CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models ”, Lee et al 2024

CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models⁠

“Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers? ”, Jin et al 2024

Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?⁠

“ReFT: Representation Finetuning for Language Models ”, Wu et al 2024

ReFT: Representation Finetuning for Language Models⁠

“Mechanistic Design and Scaling of Hybrid Architectures ”, Poli et al 2024

Mechanistic Design and Scaling of Hybrid Architectures⁠

“LTE: Training Neural Networks from Scratch With Parallel Low-Rank Adapters ”, Huh et al 2024

LTE: Training Neural Networks from Scratch with Parallel Low-Rank Adapters⁠

“Scaling Laws for Fine-Grained Mixture of Experts ”, Krajewski et al 2024

Scaling Laws for Fine-Grained Mixture of Experts⁠

“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ”

⁠Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet⁠

“Exponentially Faster Language Modeling ”, Belcak & Wattenhofer 2023

Exponentially Faster Language Modeling⁠

“DiLoCo: Distributed Low-Communication Training of Language Models ”, Douillard et al 2023

DiLoCo: Distributed Low-Communication Training of Language Models⁠

“Language Models Are Super Mario (DARE): Absorbing Abilities from Homologous Models As a Free Lunch ”, Yu et al 2023

Language Models are Super Mario (DARE): Absorbing Abilities from Homologous Models as a Free Lunch⁠

“ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-Like Language Models ”, Luo et al 2023

ProSG: Using Prompt Synthetic Gradients to Alleviate Prompt Forgetting of RNN-like Language Models⁠

“An Exact Mapping from ReLU Networks to Spiking Neural Networks ”, Stanojevic et al 2023

An exact mapping from ReLU networks to spiking neural networks⁠

“The Impact of Depth and Width on Transformer Language Model Generalization ”, Petty et al 2023

The Impact of Depth and Width on Transformer Language Model Generalization⁠

“Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ”, Liu et al 2023

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time⁠

“Fast Feedforward Networks ”, Belcak & Wattenhofer 2023

Fast Feedforward Networks⁠

“Any Deep ReLU Network Is Shallow ”, Villani & Schoots 2023

Any Deep ReLU Network is Shallow⁠

“Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning [Updated] ”, Lie 2023

Cerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning [Updated]⁠

“JaxPruner: A Concise Library for Sparsity Research ”, Lee et al 2023

JaxPruner: A concise library for sparsity research⁠

“Reusing Deep Neural Network Models through Model Re-Engineering ”, Qi et al 2023

Reusing Deep Neural Network Models through Model Re-engineering⁠

“Accelerating Large GPT Training With Sparse Pre-Training and Dense Fine-Tuning ”, Thangarasa 2023

Accelerating Large GPT Training with Sparse Pre-Training and Dense Fine-Tuning⁠

“MUX-PLMs: Pre-Training Language Models With Data Multiplexing ”, Murahari et al 2023

MUX-PLMs: Pre-training Language Models with Data Multiplexing⁠

“DataMUX: Data Multiplexing for Neural Networks ”, Murahari et al 2023

DataMUX: Data Multiplexing for Neural Networks⁠

“Deep Differentiable Logic Gate Networks ”, Petersen et al 2022

Deep Differentiable Logic Gate Networks⁠

“The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers ”, Li et al 2022

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers⁠

“Neural Net Sparsity ”, Gwern 2022

⁠Neural Net Sparsity⁠

“Noise Transforms Feed-Forward Networks into Sparse Coding Networks ”, Anonymous 2022

Noise Transforms Feed-Forward Networks into Sparse Coding Networks⁠

“Exploring Low Rank Training of Deep Neural Networks ”, Kamalakara et al 2022

Exploring Low Rank Training of Deep Neural Networks⁠

“Monolith: Real Time Recommendation System With Collisionless Embedding Table ”, Liu et al 2022

Monolith: Real Time Recommendation System With Collisionless Embedding Table⁠

“More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 Using Sparsity (SLaK) ”, Liu et al 2022

More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 using Sparsity (SLaK)⁠

“Building Machine Translation Systems for the Next Thousand Languages ”, Bapna et al 2022

Building Machine Translation Systems for the Next Thousand Languages⁠

“Monarch: Expressive Structured Matrices for Efficient and Accurate Training ”, Dao et al 2022

Monarch: Expressive Structured Matrices for Efficient and Accurate Training⁠

“Efficient Language Modeling With Sparse All-MLP ”, Yu et al 2022

Efficient Language Modeling with Sparse All-MLP⁠

“NeuPL: Neural Population Learning ”, Liu et al 2022

NeuPL: Neural Population Learning⁠

“Datamodels: Predicting Predictions from Training Data ”, Ilyas et al 2022

Datamodels: Predicting Predictions from Training Data⁠

“Spiking Neural Networks and Their Applications: A Review ”, Yamazaki et al 2022

Spiking Neural Networks and Their Applications: A Review⁠

“Persia: An Open, Hybrid System Scaling Deep Learning-Based Recommenders up to 100 Trillion Parameters ”, Lian et al 2021

Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters⁠

“EvilModel: Hiding Malware Inside of Neural Network Models ”, Wang et al 2021

EvilModel: Hiding Malware Inside of Neural Network Models⁠

“LoRA: Low-Rank Adaptation of Large Language Models ”, Hu et al 2021

LoRA: Low-Rank Adaptation of Large Language Models⁠

“On the Distribution, Sparsity, and Inference-Time Quantization of Attention Values in Transformers ”, Ji et al 2021

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers⁠

“The Neural Basis of Intelligence in Fine-Grained Cortical Topographies ”, Feilong et al 2021

The neural basis of intelligence in fine-grained cortical topographies⁠

“Clusterability in Neural Networks ”, Filan et al 2021

Clusterability in Neural Networks⁠

“Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks ”, Hoefler et al 2021

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks⁠

“Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning ”, Aghajanyan et al 2020

⁠Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning⁠

“Scaling down Deep Learning ”, Greydanus 2020

Scaling down Deep Learning

“Extreme Model Compression for On-Device Natural Language Understanding ”, Sathyendra et al 2020

Extreme Model Compression for On-device Natural Language Understanding⁠

“Training Independent Subnetworks for Robust Prediction ”, Havasi et al 2020

Training independent subnetworks for robust prediction⁠

“EventProp: Event-Based Backpropagation Can Compute Exact Gradients for Spiking Neural Networks ”, Wunderlich & Pehle 2020

EventProp: Event-Based Backpropagation can compute Exact Gradients for Spiking Neural Networks⁠

“On Linear Identifiability of Learned Representations ”, Roeder et al 2020

On Linear Identifiability of Learned Representations⁠

“Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited ”, Maddox et al 2020

Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited⁠

“Bayesian Deep Learning and a Probabilistic Perspective of Generalization ”, Wilson & Izmailov 2020

Bayesian Deep Learning and a Probabilistic Perspective of Generalization⁠

“Neural Arithmetic Units ”, Madsen & Johansen 2020

Neural Arithmetic Units⁠

“Linear Mode Connectivity and the Lottery Ticket Hypothesis ”, Frankle et al 2019

Linear Mode Connectivity and the Lottery Ticket Hypothesis⁠

“Learning to Seek: Autonomous Source Seeking With Deep Reinforcement Learning Onboard a Nano Drone Microcontroller ”, Duisterhof et al 2019

Learning to Seek: Autonomous Source Seeking with Deep Reinforcement Learning Onboard a Nano Drone Microcontroller⁠

“Does Learning Require Memorization? A Short Tale about a Long Tail ”, Feldman 2019

Does Learning Require Memorization? A Short Tale about a Long Tail⁠

“Weight Agnostic Neural Networks ”, Gaier & Ha 2019

Weight Agnostic Neural Networks⁠

“StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast End-To-End Universal Style Transfer Networks ”, An et al 2019

StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast End-to-End Universal Style Transfer Networks⁠

“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks ”, Tan & Le 2019

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks⁠

“Superposition of Many Models into One ”, Cheung et al 2019

Superposition of many models into one⁠

“Playing Atari With Six Neurons ”, Cuccu et al 2018

Playing Atari with Six Neurons⁠

“Measuring the Intrinsic Dimension of Objective Landscapes ”, Li et al 2018

Measuring the Intrinsic Dimension of Objective Landscapes⁠

“SqueezeNext: Hardware-Aware Neural Network Design ”, Gholami et al 2018

SqueezeNext: Hardware-Aware Neural Network Design⁠

“Wide Compression: Tensor Ring Nets ”, Wang et al 2018

Wide Compression: Tensor Ring Nets⁠

“Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing ”, Rosenfeld & Tsotsos 2018

Intriguing Properties of Randomly Weighted Networks: Generalizing while Learning Next to Nothing⁠

“Fix Your Classifier: the Marginal Value of Training the Last Weight Layer ”, Hoffer et al 2018

Fix your classifier: the marginal value of training the last weight layer⁠

“Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition ”, Ye et al 2017

Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition⁠

“3D Semantic Segmentation With Submanifold Sparse Convolutional Networks ”, Graham et al 2017

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks⁠

“XUnit: Learning a Spatial Activation Function for Efficient Image Restoration ”, Kligvasser et al 2017

xUnit: Learning a Spatial Activation Function for Efficient Image Restoration⁠

“Natural Language Processing With Small Feed-Forward Networks ”, Botha et al 2017

Natural Language Processing with Small Feed-Forward Networks⁠

“ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices ”, Zhang et al 2017

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices⁠

“Submanifold Sparse Convolutional Networks ”, Graham & Maaten 2017

Submanifold Sparse Convolutional Networks⁠

“Shake-Shake Regularization of 3-Branch Residual Networks ”, Gastaldi 2017

Shake-Shake regularization of 3-branch residual networks⁠

“Using the Output Embedding to Improve Language Models ”, Press & Wolf 2016

Using the Output Embedding to Improve Language Models⁠

“Deep Residual Learning for Image Recognition ”, He et al 2015

Deep Residual Learning for Image Recognition⁠

“Tensorizing Neural Networks ”, Novikov et al 2015

Tensorizing Neural Networks⁠

“Eight Pairs of Descending Visual Neurons in the Dragonfly Give Wing Motor Centers Accurate Population Vector of Prey Direction ”, Gonzalez-Bellido et al 2013

Eight pairs of descending visual neurons in the dragonfly give wing motor centers accurate population vector of prey direction⁠

“The Cat Is out of the Bag: Cortical Simulations With 10⁹ Neurons, 10¹³ Synapses ”, Ananthanarayanan et al 2009

The cat is out of the bag: cortical simulations with 10⁹ neurons, 10¹³ synapses⁠

“On the Computational Power of Threshold Circuits With Sparse Activity ”, Uchizawa et al 2006

On the Computational Power of Threshold Circuits with Sparse Activity⁠

“Networks of Spiking Neurons: The Third Generation of Neural Network Models ”, Maass 1997

Networks of spiking neurons: The third generation of neural network models⁠

“Characteristics of Sparsely Encoded Associative Memory ”, Amari 1989

Characteristics of sparsely encoded associative memory⁠

“[2110.08152] Kronecker Decomposition for GPT Compression ”

⁠[2110.08152] Kronecker Decomposition for GPT Compression⁠ :

View PDF:

⁠/doc/www/arxiv.org/ae4a089397d3b8667469ba90ca313ead5a4bdcb0.pdf⁠

“Higher Accuracy on Vision Models With EfficientNet-Lite ”

⁠Higher accuracy on vision models with EfficientNet-Lite⁠ :

View HTML:

⁠/doc/www/blog.tensorflow.org/5190b62fb9f2d53675a2f934d01f87ef413057a8.html⁠

“Something Weird Is Happening With LLMs and Chess ”, Dynomight 2025

⁠Something weird is happening with LLMs and chess⁠

“Delivering Real-Time AI in the Palm of Your Hand ”

⁠Delivering real-time AI in the palm of your hand⁠ :

View HTML:

⁠/doc/www/engineering.fb.com/65910fdbbc7e7f5970d2ecf96c18a0eb77eab3cf.html⁠

“Sparsity-Aware Deep Learning Inference Runtime for CPUs ”

Sparsity-aware deep learning inference runtime for CPUs⁠

“Neuralmagic/sparseml: Libraries for Applying Sparsification Recipes to Neural Networks With a Few Lines of Code, Enabling Faster and Smaller Models ”

neuralmagic/sparseml: Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models⁠

“An Estimation of the Absolute Number of Axons Indicates That Human Cortical Areas Are Sparsely Connected ”

An estimation of the absolute number of axons indicates that human cortical areas are sparsely connected⁠

“Creating a 17 KB Style Transfer Model With Layer Pruning and Quantization ”, Toole 2025

Creating a 17 KB style transfer model with layer pruning and quantization⁠

“BERT-Large: Prune Once for DistilBERT Inference Performance ”

⁠BERT-Large: Prune Once for DistilBERT Inference Performance :

View HTML:

⁠/doc/www/neuralmagic.com/4e89fd35918a0a8e03c1d63ee7c5af3e1d76e968.html⁠

“Circuits in Superposition: Compressing Many Small Neural Networks into One ”

⁠Circuits in Superposition: Compressing many small neural networks into one⁠ :

View HTML:

⁠/doc/www/www.greaterwrong.com/56cb7ccd134aaa922ba1f32126ca7c67fc25fb15.html#Read_in_interference⁠

“Measuring the Intrinsic Dimension of Objective Landscapes [Video] ”

⁠Measuring the Intrinsic Dimension of Objective Landscapes [video]⁠ :

⁠https://www.youtube.com/watch?v=uSZWeRADTFI#uber⁠

Sort By Magic

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Autoencoder § Sparse autoencoder (SAE)⁠ :

https://en.wikipedia.org/wiki/Autoencoder#Sparse_autoencoder_(SAE)⁠

Miscellaneous

Bibliography

https://arxiv.org/abs/2403.17844: “Mechanistic Design and Scaling of Hybrid Architectures ”⁠, Michael Poli, Armin W. Thomas, Eric Nguyen …, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting⁠, Taiji Suzuki, Brian Hie, Stefano Ermon⁠, Christopher Ré⁠, Ce Zhang, Stefano Massaroli
link-bibliography⁠
https://arxiv.org/abs/2311.10770: “Exponentially Faster Language Modeling ”⁠, Peter Belcak, Roger Wattenhofer⁠
link-bibliography⁠
https://www.sciencedirect.com/science/article/pii/S0893608023005051: “An Exact Mapping from ReLU Networks to Spiking Neural Networks ”⁠, Ana Stanojevic, Stanisław Woźniak, Guillaume Bellec …, Giovanni Cherubini, Angeliki Pantazi, Wulfram Gerstner⁠
link-bibliography⁠
https://arxiv.org/abs/2310.17157: “Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ”⁠, Zichang Liu, Jue Wang⁠, ⁠Tri Dao …, Tianyi Zhou, Binhang Yuan, Zhao Song⁠, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re⁠, Beidi Chen
link-bibliography⁠
https://arxiv.org/abs/2308.14711: “Fast Feedforward Networks ”⁠, Peter Belcak, Roger Wattenhofer⁠
link-bibliography⁠
https://arxiv.org/abs/2302.12441: “MUX-PLMs: Pre-Training Language Models With Data Multiplexing ”⁠, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez …, Izhak Shafran, Mingqiu Wang, Yuan Cao⁠, Karthik Narasimhan
link-bibliography⁠
https://arxiv.org/abs/2210.06313#google: “The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers ”⁠, Zonglin Li, Chong You, Srinadh Bhojanapalli …, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar⁠
link-bibliography⁠
https://arxiv.org/abs/2207.03620: “More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 Using Sparsity (SLaK) ”⁠, Shiwei Liu, Tianlong Chen, Xiaohan Chen …, Xuxi Chen, Qiao Xiao, Boqian Wu, Mykola Pechenizkiy, Decebal Mocanu, Zhangyang Wang
link-bibliography⁠
https://arxiv.org/abs/2205.03983#google: “Building Machine Translation Systems for the Next Thousand Languages ”⁠, Ankur Bapna, Isaac Caswell, Julia Kreutzer …, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao⁠, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu⁠, Macduff Hughes
link-bibliography⁠
https://arxiv.org/abs/2204.00595: “Monarch: Expressive Structured Matrices for Efficient and Accurate Training ”⁠, ⁠Tri Dao, Beidi Chen, Nimit Sohoni …, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré⁠
link-bibliography⁠
https://arxiv.org/abs/2203.06850: “Efficient Language Modeling With Sparse All-MLP ”⁠, Ping Yu, Mikel Artetxe⁠, Myle Ott …, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li
link-bibliography⁠
https://arxiv.org/abs/2202.07415#deepmind: “NeuPL: Neural Population Learning ”⁠, Siqi Liu, Luke Marris, Daniel Hennes …, Josh Merel, Nicolas Heess⁠, ⁠Thore Graepel
link-bibliography⁠
https://arxiv.org/abs/2106.09685#microsoft: “LoRA: Low-Rank Adaptation of Large Language Models ”⁠, Edward J. Hu, Yelong Shen, Phillip Wallis …, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
link-bibliography⁠
https://arxiv.org/abs/1905.11946#google: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks ”⁠, Mingxing Tan, Quoc V. Le⁠
link-bibliography⁠
https://arxiv.org/abs/1803.10615: “SqueezeNext: Hardware-Aware Neural Network Design ”⁠, Amir Gholami, Kiseok Kwon, Bichen Wu …, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, Kurt Keutzer⁠
link-bibliography⁠
https://arxiv.org/abs/1512.03385#microsoft: “Deep Residual Learning for Image Recognition ”⁠, Kaiming He⁠, Xiangyu Zhang, Shaoqing Ren, Jian Sun
link-bibliography⁠