Bibliography:

  1. ‘NN sparsity’ tag

  2. 2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

  3. The Super Weight in Large Language Models

  4. What Matters in Transformers? Not All Attention is Needed

  5. When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

  6. Pre-training Small Base LMs with Fewer Tokens

  7. Streamlining Redundant Layers to Compress Large Language Models

  8. The Unreasonable Ineffectiveness of the Deeper Layers

  9. Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

  10. SliceGPT: Compress Large Language Models by Deleting Rows and Columns

  11. Weight subcloning: direct initialization of transformers using larger pretrained ones

  12. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

  13. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

  14. One Wide Feedforward is All You Need

  15. A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model

  16. Fast as CHITA: Neural Network Pruning with Combinatorial Optimization

  17. Self-Compressing Neural Networks

  18. Pruning Compact ConvNets for Efficient Inference

  19. Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

  20. Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

  21. Heavy-tailed neuronal connectivity arises from Hebbian self–organization

  22. PPCD-GAN: Progressive Pruning and Class-Aware Distillation for Large-Scale Conditional GANs Compression

  23. The Combinatorial Brain Surgeon: Pruning Weights That Cancel One Another in Neural Networks

  24. Data-Efficient Structured Pruning via Submodular Optimization

  25. Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

  26. Fortuitous Forgetting in Connectionist Networks

  27. How many degrees of freedom do we need to train deep networks: a loss landscape perspective

  28. Prune Once for All: Sparse Pre-Trained Language Models

  29. DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

  30. HALP: Hardware-Aware Latency Pruning

  31. On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

  32. Block Pruning For Faster Transformers

  33. Scaling Laws for Deep Learning

  34. A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness

  35. Chasing Sparsity in Vision Transformers: An End-to-End Exploration

  36. On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning

  37. Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones?

  38. Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch

  39. Postnatal connectomic development of inhibition in mouse barrel cortex

  40. ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution

  41. Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup

  42. A Primer in BERTology: What we know about how BERT works

  43. Bort: Optimal Subarchitecture Extraction For BERT

  44. Pruning Neural Networks at Initialization: Why are We Missing the Mark?

  45. Logarithmic Pruning is All You Need

  46. On the Predictability of Pruning Across Scales

  47. Progressive Skeletonization: Trimming more fat from a network at initialization

  48. Pruning neural networks without any data by iteratively conserving synaptic flow

  49. Movement Pruning: Adaptive Sparsity by Fine-Tuning

  50. Bayesian Bits: Unifying Quantization and Pruning

  51. Lite Transformer with Long-Short Range Attention

  52. On the Effect of Dropping Layers of Pre-trained Transformer Models

  53. Train-by-Reconnect: Decoupling Locations of Weights from their Values (LaPerm)

  54. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

  55. What’s Hidden in a Randomly Weighted Neural Network?

  56. Sparse Networks from Scratch: Faster Training without Losing Performance

  57. Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP

  58. SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

  59. Are 16 Heads Really Better than One?

  60. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

  61. Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask

  62. Stabilizing the Lottery Ticket Hypothesis

  63. The State of Sparsity in Deep Neural Networks

  64. Differential Contribution of Cortical Thickness, Surface Area, and Gyrification to Fluid and Crystallized Intelligence

  65. Efficient Training of BERT by Progressively Stacking

  66. A Closer Look at Structured Pruning for Neural Network Compression

  67. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

  68. Efficient Neural Audio Synthesis

  69. Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks

  70. Learning to Prune Filters in Convolutional Neural Networks

  71. Faster gaze prediction with dense networks and Fisher pruning

  72. Automated Pruning for Deep Neural Network Compression

  73. Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method

  74. NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm

  75. To prune, or not to prune: exploring the efficacy of pruning for model compression

  76. Bayesian Sparsification of Recurrent Neural Networks

  77. Structured Bayesian Pruning via Log-Normal Multiplicative Noise

  78. Exploring Sparsity in Recurrent Neural Networks

  79. Variational Dropout Sparsifies Deep Neural Networks

  80. Iterative Magnitude Pruning: Learning both Weights and Connections for Efficient Neural Networks

  81. Flat Minima

  82. Optimal Brain Surgeon and general network pruning

  83. Fault tolerance of pruned multilayer networks

  84. Using Relevance to Reduce Network Size Automatically

  85. Optimal Brain Damage

  86. Trading Off Compute in Training and Inference § Pruning

  87. 2024-chang-figure3-lotteryticketsemergeearlyintrainingandthengetupweighted.jpg

  88. 2020-rogers-table1-bertcompression.png

  89. 2020-rosenfeld-equation1-functionalformofdlscalingpruninglaw.png

  90. 2020-rosenfeld-figure1-relationshipbetweenpruningsparsificationandclassificationerrorincifar10cnnresnets.jpg

  91. 2020-rosenfeld-figure2-extrapolatedvsactualrelationshipbetweenpruningsparsificationandclassificationerrorincifar10cnnresnets.png

  92. 2020-rosenfeld-figure8-sweepingwidthparametercountofcifar10resnettofindoptimallylargemodelforbestpossibleprunedmodel.jpg

  93. https://cprimozic.net/blog/reverse-engineering-a-small-neural-network/

  94. 84ba227dcd1fa0f187c014ec2c7df8277f747aa6.html

  95. https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms

  96. https://x.com/RamaswmySridhar/status/1621870497070981121

  97. What Matters in Transformers? Not All Attention is Needed

  98. https%253A%252F%252Farxiv.org%252Fabs%252F2406.15786.html

  99. When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

  100. https%253A%252F%252Farxiv.org%252Fabs%252F2406.13131.html

  101. Pre-training Small Base LMs with Fewer Tokens

  102. https%253A%252F%252Farxiv.org%252Fabs%252F2404.08634.html

  103. SliceGPT: Compress Large Language Models by Deleting Rows and Columns

  104. https%253A%252F%252Farxiv.org%252Fabs%252F2401.15024%2523microsoft.html

  105. To grok or not to grok: Disentangling generalization and memorization on corrupted algorithmic datasets

  106. https%253A%252F%252Farxiv.org%252Fabs%252F2310.13061.html

  107. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

  108. https%253A%252F%252Farxiv.org%252Fabs%252F2310.06694.html

  109. Sparsity Winning Twice: Better Robust Generalization from More Efficient Training

  110. https%253A%252F%252Farxiv.org%252Fabs%252F2202.09844.html

  111. Prune Once for All: Sparse Pre-Trained Language Models

  112. https%253A%252F%252Farxiv.org%252Fabs%252F2111.05754.html

  113. DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

  114. https%253A%252F%252Farxiv.org%252Fabs%252F2111.00160.html

  115. Scaling Laws for Deep Learning

  116. Jonathan S. Rosenfeld

  117. https%253A%252F%252Farxiv.org%252Fabs%252F2108.07686.html

  118. Chasing Sparsity in Vision Transformers: An End-to-End Exploration

  119. https%253A%252F%252Farxiv.org%252Fabs%252F2106.04533.html

  120. Pruning Neural Networks at Initialization: Why are We Missing the Mark?

  121. Jonathan Frankle—Chief Neural Network Scientist at Databricks

  122. Michael Carbin

  123. https%253A%252F%252Farxiv.org%252Fabs%252F2009.08576.html

  124. On the Predictability of Pruning Across Scales

  125. Jonathan S. Rosenfeld

  126. Jonathan Frankle—Chief Neural Network Scientist at Databricks

  127. Michael Carbin

  128. https%253A%252F%252Farxiv.org%252Fabs%252F2006.10621.html

  129. Pruning neural networks without any data by iteratively conserving synaptic flow

  130. https%253A%252F%252Farxiv.org%252Fabs%252F2006.05467.html

  131. On the Effect of Dropping Layers of Pre-trained Transformer Models

  132. https%253A%252F%252Farxiv.org%252Fabs%252F2004.03844.html

  133. What’s Hidden in a Randomly Weighted Neural Network?

  134. https%253A%252F%252Farxiv.org%252Fabs%252F1911.13299.html

  135. Stabilizing the Lottery Ticket Hypothesis

  136. Jonathan Frankle—Chief Neural Network Scientist at Databricks

  137. Michael Carbin

  138. https%253A%252F%252Farxiv.org%252Fabs%252F1903.01611.html

  139. The State of Sparsity in Deep Neural Networks

  140. https%253A%252F%252Farxiv.org%252Fabs%252F1902.09574.html

  141. A Closer Look at Structured Pruning for Neural Network Compression

  142. https%253A%252F%252Farxiv.org%252Fabs%252F1810.04622.html

  143. Recovering from Random Pruning: On the Plasticity of Deep Convolutional Neural Networks

  144. https%253A%252F%252Farxiv.org%252Fabs%252F1801.10447.html

  145. Optimal Brain Surgeon and general network pruning

  146. %252Fdoc%252Fai%252Fnn%252Fsparsity%252Fpruning%252F1993-hassibi.pdf.html