‘NN pruning’ directory

Gwern

‘NN pruning’ directory

Links

“2:4 Sparse Llama: Smaller Models for Efficient GPU Inference ”, Kurtić et al 2024

⁠2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

“The Super Weight in Large Language Models ”, Yu et al 2024

The Super Weight in Large Language Models⁠

“What Matters in Transformers? Not All Attention Is Needed ”, He et al 2024

What Matters in Transformers? Not All Attention is Needed⁠

“When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models ”, Chang et al 2024

When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models⁠

“Pre-Training Small Base LMs With Fewer Tokens ”, Sanyal et al 2024

Pre-training Small Base LMs with Fewer Tokens⁠

“Streamlining Redundant Layers to Compress Large Language Models ”, Chen et al 2024

Streamlining Redundant Layers to Compress Large Language Models⁠

“The Unreasonable Ineffectiveness of the Deeper Layers ”, Gromov et al 2024

The Unreasonable Ineffectiveness of the Deeper Layers⁠

“Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression ”, Hong et al 2024

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression⁠

“SliceGPT: Compress Large Language Models by Deleting Rows and Columns ”, Ashkboos et al 2024

SliceGPT: Compress Large Language Models by Deleting Rows and Columns⁠

“Weight Subcloning: Direct Initialization of Transformers Using Larger Pretrained Ones ”, Samragh et al 2023

Weight subcloning: direct initialization of transformers using larger pretrained ones⁠

“To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”, Doshi et al 2023

To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets⁠

“Sheared LLaMA: Accelerating Language Model Pre-Training via Structured Pruning ”, Xia et al 2023

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning⁠

“One Wide Feedforward Is All You Need ”, Pires et al 2023

One Wide Feedforward is All You Need⁠

“A Comparative Study between Full-Parameter and LoRA-Based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model ”, Sun et al 2023

A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model⁠

“Fast As CHITA: Neural Network Pruning With Combinatorial Optimization ”, Benbaki et al 2023

Fast as CHITA: Neural Network Pruning with Combinatorial Optimization⁠

“Self-Compressing Neural Networks ”, Cséfalvay & Imber 2023

Self-Compressing Neural Networks⁠

“Pruning Compact ConvNets for Efficient Inference ”, Ghosh et al 2023

Pruning Compact ConvNets for Efficient Inference⁠

“Rethinking the Role of Scale for In-Context Learning: An Interpretability-Based Case Study at 66 Billion Scale ”, Bansal et al 2022

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale⁠

“Lottery Tickets on a Data Diet: Finding Initializations With Sparse Trainable Networks ”, Paul et al 2022

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks⁠

“Heavy-Tailed Neuronal Connectivity Arises from Hebbian Self–organization ”, Lynn et al 2022

Heavy-tailed neuronal connectivity arises from Hebbian self–organization⁠

“PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression ”, Vo et al 2022

PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression⁠

“The Combinatorial Brain Surgeon: Pruning Weights That Cancel One Another in Neural Networks ”, Yu et al 2022

The Combinatorial Brain Surgeon: Pruning Weights That Cancel One Another in Neural Networks⁠

“Data-Efficient Structured Pruning via Submodular Optimization ”, Halabi et al 2022

Data-Efficient Structured Pruning via Submodular Optimization⁠

“Sparsity Winning Twice: Better Robust Generalization from More Efficient Training ”, Chen et al 2022

Sparsity Winning Twice: Better Robust Generalization from More Efficient Training⁠

“Fortuitous Forgetting in Connectionist Networks ”, Zhou et al 2022

Fortuitous Forgetting in Connectionist Networks⁠

“How Many Degrees of Freedom Do We Need to Train Deep Networks: a Loss Landscape Perspective ”, Larsen et al 2021

How many degrees of freedom do we need to train deep networks: a loss landscape perspective⁠

“Prune Once for All: Sparse Pre-Trained Language Models ”, Zafrir et al 2021

Prune Once for All: Sparse Pre-Trained Language Models⁠

“DSEE: Dually Sparsity-Embedded Efficient Tuning of Pre-Trained Language Models ”, Chen et al 2021

DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models⁠

“HALP: Hardware-Aware Latency Pruning ”, Shen et al 2021

HALP: Hardware-Aware Latency Pruning⁠

“On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis ”, Lai et al 2021

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis⁠

“Block Pruning For Faster Transformers ”, Lagunas et al 2021

Block Pruning For Faster Transformers⁠

“Scaling Laws for Deep Learning ”, Rosenfeld 2021

Scaling Laws for Deep Learning⁠

“A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness ”, Diffenderfer et al 2021

A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness⁠

“Chasing Sparsity in Vision Transformers: An End-To-End Exploration ”, Chen et al 2021

Chasing Sparsity in Vision Transformers: An End-to-End Exploration⁠

“On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning ”, Vischer et al 2021

On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning⁠

“Sifting out the Features by Pruning: Are Convolutional Networks the Winning Lottery Ticket of Fully Connected Ones? ”, Pellegrini & Biroli 2021

Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?⁠

“Learning N:M Fine-Grained Structured Sparse Neural Networks From Scratch ”, Zhou et al 2021

Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch⁠

“ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution ”, Song et al 2021

ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution⁠

“Postnatal Connectomic Development of Inhibition in Mouse Barrel Cortex ”, Gour et al 2020

Postnatal connectomic development of inhibition in mouse barrel cortex⁠

“Progressively Stacking 2.0: A Multi-Stage Layerwise Training Method for BERT Training Speedup ”, Yang et al 2020

Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup⁠

“A Primer in BERTology: What We Know about How BERT Works ”, Rogers et al 2020

A Primer in BERTology: What we know about how BERT works⁠

“Bort: Optimal Subarchitecture Extraction For BERT ”, Wynter & Perry 2020

Bort: Optimal Subarchitecture Extraction For BERT⁠

“Pruning Neural Networks at Initialization: Why Are We Missing the Mark? ”, Frankle et al 2020

Pruning Neural Networks at Initialization: Why are We Missing the Mark?⁠

“Logarithmic Pruning Is All You Need ”, Orseau et al 2020

Logarithmic Pruning is All You Need⁠

“On the Predictability of Pruning Across Scales ”, Rosenfeld et al 2020

On the Predictability of Pruning Across Scales⁠

“Progressive Skeletonization: Trimming More Fat from a Network at Initialization ”, Jorge et al 2020

Progressive Skeletonization: Trimming more fat from a network at initialization⁠

“Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow ”, Tanaka et al 2020

Pruning neural networks without any data by iteratively conserving synaptic flow⁠

“Movement Pruning: Adaptive Sparsity by Fine-Tuning ”, Sanh et al 2020

Movement Pruning: Adaptive Sparsity by Fine-Tuning⁠

“Bayesian Bits: Unifying Quantization and Pruning ”, Baalen et al 2020

Bayesian Bits: Unifying Quantization and Pruning⁠

“Lite Transformer With Long-Short Range Attention ”, Wu et al 2020

Lite Transformer with Long-Short Range Attention⁠

“On the Effect of Dropping Layers of Pre-Trained Transformer Models ”, Sajjad et al 2020

On the Effect of Dropping Layers of Pre-trained Transformer Models⁠

“Train-By-Reconnect: Decoupling Locations of Weights from Their Values (LaPerm) ”, Qiu & Suda 2020

Train-by-Reconnect: Decoupling Locations of Weights from their Values (LaPerm)⁠

“Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers ”, Li et al 2020

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers⁠

“What’s Hidden in a Randomly Weighted Neural Network? ”, Ramanujan et al 2019

What’s Hidden in a Randomly Weighted Neural Network?⁠

“Sparse Networks from Scratch: Faster Training without Losing Performance ”, Dettmers & Zettlemoyer 2019

Sparse Networks from Scratch: Faster Training without Losing Performance⁠

“Playing the Lottery With Rewards and Multiple Languages: Lottery Tickets in RL and NLP ”, Yu et al 2019

Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP⁠

“SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers ”, Fedorov et al 2019

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers⁠

“Are 16 Heads Really Better Than One? ”, Michel et al 2019

Are 16 Heads Really Better than One?⁠

“Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned ”, Voita et al 2019

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned⁠

“Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask ”, Zhou et al 2019

Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask⁠

“Stabilizing the Lottery Ticket Hypothesis ”, Frankle et al 2019

Stabilizing the Lottery Ticket Hypothesis⁠

“The State of Sparsity in Deep Neural Networks ”, Gale et al 2019

The State of Sparsity in Deep Neural Networks⁠

“Differential Contribution of Cortical Thickness, Surface Area, and Gyrification to Fluid and Crystallized Intelligence ”, Tadayon et al 2019

Differential Contribution of Cortical Thickness, Surface Area, and Gyrification to Fluid and Crystallized Intelligence⁠

“Efficient Training of BERT by Progressively Stacking ”, Gong et al 2019

Efficient Training of BERT by Progressively Stacking⁠

“A Closer Look at Structured Pruning for Neural Network Compression ”, Crowley et al 2018

A Closer Look at Structured Pruning for Neural Network Compression⁠

“The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks ”, Frankle & Carbin 2018

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks⁠

“Efficient Neural Audio Synthesis ”, Kalchbrenner et al 2018

Efficient Neural Audio Synthesis⁠

“Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks ”, Mittal et al 2018

Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks⁠

“Learning to Prune Filters in Convolutional Neural Networks ”, Huang et al 2018

Learning to Prune Filters in Convolutional Neural Networks⁠

“Faster Gaze Prediction With Dense Networks and Fisher Pruning ”, Theis et al 2018

Faster gaze prediction with dense networks and Fisher pruning⁠

“Automated Pruning for Deep Neural Network Compression ”, Manessi et al 2017

Automated Pruning for Deep Neural Network Compression⁠

“Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method ”, Sun et al 2017

Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method⁠

“NeST: A Neural Network Synthesis Tool Based on a Grow-And-Prune Paradigm ”, Dai et al 2017

NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm⁠

“To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression ”, Zhu & Gupta 2017

To prune, or not to prune: exploring the efficacy of pruning for model compression⁠

“Bayesian Sparsification of Recurrent Neural Networks ”, Lobacheva et al 2017

Bayesian Sparsification of Recurrent Neural Networks⁠

“Structured Bayesian Pruning via Log-Normal Multiplicative Noise ”, Neklyudov et al 2017

Structured Bayesian Pruning via Log-Normal Multiplicative Noise⁠

“Exploring Sparsity in Recurrent Neural Networks ”, Narang et al 2017

Exploring Sparsity in Recurrent Neural Networks⁠

“Variational Dropout Sparsifies Deep Neural Networks ”, Molchanov et al 2017

Variational Dropout Sparsifies Deep Neural Networks⁠

“Iterative Magnitude Pruning: Learning Both Weights and Connections for Efficient Neural Networks ”, Han et al 2015

Iterative Magnitude Pruning: Learning both Weights and Connections for Efficient Neural Networks⁠

“Flat Minima ”, Hochreiter & Schmidhuber 1997

Flat Minima⁠

“Optimal Brain Surgeon and General Network Pruning ”, Hassibi et al 1993

Optimal Brain Surgeon and general network pruning⁠

“Fault Tolerance of Pruned Multilayer Networks ”, Segee & Carter 1991

Fault tolerance of pruned multilayer networks⁠

“Using Relevance to Reduce Network Size Automatically ”, Mozer & Smolensky 1989

Using Relevance to Reduce Network Size Automatically⁠

“Optimal Brain Damage ”, LeCun et al 1989

Optimal Brain Damage⁠

“Trading Off Compute in Training and Inference § Pruning ”

Trading Off Compute in Training and Inference § Pruning

Wikipedia

Synaptic pruning⁠ :

https://en.wikipedia.org/wiki/Synaptic_pruning⁠

Miscellaneous

Bibliography

https://arxiv.org/abs/2406.15786: “What Matters in Transformers? Not All Attention Is Needed ”⁠, Shwai He, Guoheng Sun, Zheyu Shen, Ang Li
link-bibliography⁠
https://arxiv.org/abs/2406.13131: “When Parts Are Greater Than Sums: Individual LLM Components Can Outperform Full Models ”⁠, Ting-Yun Chang, Jesse Thomason, Robin Jia
link-bibliography⁠
https://arxiv.org/abs/2404.08634: “Pre-Training Small Base LMs With Fewer Tokens ”⁠, Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis
link-bibliography⁠
https://arxiv.org/abs/2401.15024#microsoft: “SliceGPT: Compress Large Language Models by Deleting Rows and Columns ”⁠, Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento …, Torsten Hoefler⁠, James Hensman
link-bibliography⁠
https://arxiv.org/abs/2310.13061: “To Grok or Not to Grok: Disentangling Generalization and Memorization on Corrupted Algorithmic Datasets ”⁠, Darshil Doshi, Aritra Das, Tianyu He, Andrey Gromov
link-bibliography⁠
https://arxiv.org/abs/2310.06694: “Sheared LLaMA: Accelerating Language Model Pre-Training via Structured Pruning ”⁠, Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen⁠
link-bibliography⁠
https://arxiv.org/abs/2202.09844: “Sparsity Winning Twice: Better Robust Generalization from More Efficient Training ”⁠, Tianlong Chen, Zhenyu Zhang, Pengjun Wang …, Santosh Balachandra, Haoyu Ma, Zehao Wang, Zhangyang Wang
link-bibliography⁠
https://arxiv.org/abs/2111.05754: “Prune Once for All: Sparse Pre-Trained Language Models ”⁠, Ofir Zafrir, Ariel Larey, Guy Boudoukh …, Haihao Shen, Moshe Wasserblat
link-bibliography⁠
https://arxiv.org/abs/2111.00160: “DSEE: Dually Sparsity-Embedded Efficient Tuning of Pre-Trained Language Models ”⁠, Xuxi Chen, Tianlong Chen, Yu Cheng …, Weizhu Chen, Zhangyang Wang, Ahmed Hassan Awadallah
link-bibliography⁠
https://arxiv.org/abs/2108.07686: “Scaling Laws for Deep Learning ”⁠, Jonathan S. Rosenfeld⁠
link-bibliography⁠
https://arxiv.org/abs/2106.04533: “Chasing Sparsity in Vision Transformers: An End-To-End Exploration ”⁠, Tianlong Chen, Yu Cheng, Zhe Gan …, Lu Yuan, Lei Zhang, Zhangyang Wang
link-bibliography⁠
https://arxiv.org/abs/2009.08576: “Pruning Neural Networks at Initialization: Why Are We Missing the Mark? ”⁠, ⁠Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, ⁠Michael Carbin⁠
link-bibliography⁠
https://arxiv.org/abs/2006.10621: “On the Predictability of Pruning Across Scales ”⁠, Jonathan S. Rosenfeld⁠, ⁠Jonathan Frankle, ⁠Michael Carbin⁠, Nir Shavit⁠
link-bibliography⁠
https://arxiv.org/abs/2006.05467: “Pruning Neural Networks without Any Data by Iteratively Conserving Synaptic Flow ”⁠, Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli⁠
link-bibliography⁠
https://arxiv.org/abs/2004.03844: “On the Effect of Dropping Layers of Pre-Trained Transformer Models ”⁠, Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov⁠
link-bibliography⁠
https://arxiv.org/abs/1911.13299: “What’s Hidden in a Randomly Weighted Neural Network? ”⁠, Vivek Ramanujan, ⁠Mitchell Wortsman, Aniruddha Kembhavi …, Ali Farhadi⁠, Mohammad Rastegari
link-bibliography⁠
https://arxiv.org/abs/1903.01611: “Stabilizing the Lottery Ticket Hypothesis ”⁠, ⁠Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, ⁠Michael Carbin⁠
link-bibliography⁠
https://arxiv.org/abs/1902.09574: “The State of Sparsity in Deep Neural Networks ”⁠, Trevor Gale, Erich Elsen, Sara Hooker
link-bibliography⁠
https://arxiv.org/abs/1810.04622: “A Closer Look at Structured Pruning for Neural Network Compression ”⁠, Elliot J. Crowley, Jack Turner, Amos Storkey⁠, Michael O’Boyle⁠
link-bibliography⁠
https://arxiv.org/abs/1801.10447: “Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks ”⁠, Deepak Mittal, Shweta Bhardwaj, Mitesh M. Khapra, Balaraman Ravindran
link-bibliography⁠
1993-hassibi.pdf: “Optimal Brain Surgeon and General Network Pruning ”⁠, Babak Hassibi⁠, David G. Stork⁠, Gregory J. Wolff
link-bibliography⁠