 See Also

Links
 “MUXPLMs: Pretraining Language Models With Data Multiplexing”, Et Al 2023
 “DataMUX: Data Multiplexing for Neural Networks”, Et Al 2023
 “Noise Transforms FeedForward Networks into Sparse Coding Networks”, 2022
 “Exploring Low Rank Training of Deep Neural Networks”, Et Al 2022
 “Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Et Al 2022
 “More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 Using Sparsity (SLaK)”, Et Al 2022
 “Building Machine Translation Systems for the Next Thousand Languages”, Et Al 2022
 “Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, Et Al 2022
 “NeuPL: Neural Population Learning”, Et Al 2022
 “Datamodels: Predicting Predictions from Training Data”, Et Al 2022
 “Spiking Neural Networks and Their Applications: A Review”, Et Al 2022
 “Persia: An Open, Hybrid System Scaling Deep Learningbased Recommenders up to 100 Trillion Parameters”, Et Al 2021
 “EvilModel: Hiding Malware Inside of Neural Network Models”, Et Al 2021
 “LoRA: LowRank Adaptation of Large Language Models”, Et Al 2021
 “Clusterability in Neural Networks”, Et Al 2021
 “Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks”, Et Al 2021
 “Scaling down Deep Learning”, 2020
 “Extreme Model Compression for Ondevice Natural Language Understanding”, Et Al 2020
 “Neural Arithmetic Units”, 2020
 “Learning to Seek: Autonomous Source Seeking With Deep Reinforcement Learning Onboard a Nano Drone Microcontroller”, Et Al 2019
 “Weight Agnostic Neural Networks”, 2019
 “StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast EndtoEnd Universal Style Transfer Networks”, An Et Al 2019
 “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, 2019
 “Superposition of Many Models into One”, Et Al 2019
 “Playing Atari With Six Neurons”, Et Al 2018
 “Measuring the Intrinsic Dimension of Objective Landscapes”, Et Al 2018
 “SqueezeNext: HardwareAware Neural Network Design”, Et Al 2018
 “Wide Compression: Tensor Ring Nets”, Et Al 2018
 “Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing”, 2018
 “Fix Your Classifier: the Marginal Value of Training the Last Weight Layer”, Et Al 2018
 “Learning Compact Recurrent Neural Networks With BlockTerm Tensor Decomposition”, Et Al 2017
 “XUnit: Learning a Spatial Activation Function for Efficient Image Restoration”, Et Al 2017
 “Natural Language Processing With Small FeedForward Networks”, Et Al 2017
 “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”, Et Al 2017
 “ShakeShake Regularization of 3branch Residual Networks”, 2017
 “Bonsai: Resourceefficient Machine Learning in 2KB RAM for the Internet of Things”, Kumar & Al 2017
 “Tensorizing Neural Networks”, Et Al 2015
 “Eight Pairs of Descending Visual Neurons in the Dragonfly Give Wing Motor Centers Accurate Population Vector of Prey Direction”, GonzalezEt Al 2013
 “Networks of Spiking Neurons: The Third Generation of Neural Network Models”, 1997
 “Delivering Realtime AI in the Palm of Your Hand”
 Wikipedia
 Miscellaneous
 Link Bibliography
Neural nets are extremely ‘overparameterized’ in the sense that they have orders of magnitude more parameters than necessary to solve the problems they are trained on, as can be proven by the regular improvements in training smaller/faster but still performant networks but also in directly creating smaller neural nets with similar or identical performance on those problems. Major techniques are: deleting parameters (pruning)/reducing precision of the numeric encoding (quantizing)/training a smaller network from scratch using the original large network somehow (distillation).
Mysteriously, these smaller networks typically cannot be trained from scratch; performance gains can be obtained without the original data; models can be trained to imitate themselves in selfdistillation; despite this indicating overfitting ought to be a major concern, they generalize well; and many of these smaller networks are in some sense already present in the original neural network. This is frequently taken to indicate some sort of blessing of scale in large NNs having smoother loss landscapes, which simple optimizers can successfully traverse to good optima no matter how hard the problem, as compared to smaller networks which may wind up ‘trapped’ at a bad place with no free parameters to let it slip around obstacles and find some way to improve (much less the loss landscape of equivalently powerful but extremely brittle encodings such as Brainf—k or assembler programs). As well as their great theoretical interest—How can we train these small models directly? What does this tell us about how NNs work?—such smaller NNs are critical to practical realworld deployment to servers & smartphones at scale, the design of accelerator hardware supporting reducedprecision operations, and also are an interesting case of capability growth for AI risk: as soon as any NN exists which can achieve performance goal X, it is likely that a much more efficient NN (potentially orders of magnitude smaller or faster) can be created to achieve X thereafter. (These are merely one way that your software can be much faster.)
Some examples of NNs being compressed in size or FLOPs by anywhere from 50% to ~17,000% (an incomplete bibliography, merely papers I have noted during my reading) below.
See Also
Links
“MUXPLMs: Pretraining Language Models With Data Multiplexing”, Et Al 2023
“MUXPLMs: Pretraining Language Models with Data Multiplexing”, 20230224 ( ; similar; bibliography)
“DataMUX: Data Multiplexing for Neural Networks”, Et Al 2023
“DataMUX: Data Multiplexing for Neural Networks”, 20230113 ( ; backlinks; similar)
“Noise Transforms FeedForward Networks into Sparse Coding Networks”, 2022
“Noise Transforms FeedForward Networks into Sparse Coding Networks”, 20220929 ( ; backlinks; similar)
“Exploring Low Rank Training of Deep Neural Networks”, Et Al 2022
“Exploring Low Rank Training of Deep Neural Networks”, 20220927 (similar)
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Et Al 2022
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, 20220916 ( ; similar)
“More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 Using Sparsity (SLaK)”, Et Al 2022
“More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 using Sparsity (SLaK)”, 20220707 (similar; bibliography)
“Building Machine Translation Systems for the Next Thousand Languages”, Et Al 2022
“Building Machine Translation Systems for the Next Thousand Languages”, 20220509 ( ; similar; bibliography)
“Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, Et Al 2022
“Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, 20220401 ( ; similar; bibliography)
“NeuPL: Neural Population Learning”, Et Al 2022
“NeuPL: Neural Population Learning”, 20220215 ( ; similar; bibliography)
“Datamodels: Predicting Predictions from Training Data”, Et Al 2022
“Datamodels: Predicting Predictions from Training Data”, 20220201 (similar)
“Spiking Neural Networks and Their Applications: A Review”, Et Al 2022
“Spiking Neural Networks and Their Applications: A Review”, 2022 ( ; similar)
“Persia: An Open, Hybrid System Scaling Deep Learningbased Recommenders up to 100 Trillion Parameters”, Et Al 2021
“Persia: An Open, Hybrid System Scaling Deep Learningbased Recommenders up to 100 Trillion Parameters”, 20211110 ( ; backlinks; similar)
“EvilModel: Hiding Malware Inside of Neural Network Models”, Et Al 2021
“EvilModel: Hiding Malware Inside of Neural Network Models”, 20210719 ( ; similar)
“LoRA: LowRank Adaptation of Large Language Models”, Et Al 2021
“LoRA: LowRank Adaptation of Large Language Models”, 20210617 ( ; similar; bibliography)
“Clusterability in Neural Networks”, Et Al 2021
“Clusterability in Neural Networks”, 20210304 (similar)
“Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks”, Et Al 2021
“Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks”, 20210131 (similar)
“Scaling down Deep Learning”, 2020
“Scaling down Deep Learning”, 20201201 ( ; backlinks; similar; bibliography)
“Extreme Model Compression for Ondevice Natural Language Understanding”, Et Al 2020
“Extreme Model Compression for Ondevice Natural Language Understanding”, 20201130 (similar)
“Neural Arithmetic Units”, 2020
“Neural Arithmetic Units”, 20200114 ( ; similar)
“Learning to Seek: Autonomous Source Seeking With Deep Reinforcement Learning Onboard a Nano Drone Microcontroller”, Et Al 2019
“Learning to Seek: Autonomous Source Seeking with Deep Reinforcement Learning Onboard a Nano Drone Microcontroller”, 20190925 ( ; similar)
“Weight Agnostic Neural Networks”, 2019
“Weight Agnostic Neural Networks”, 20190611 ( ; similar)
“StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast EndtoEnd Universal Style Transfer Networks”, An Et Al 2019
“StyleNAS: An Empirical Study of Neural Architecture Search to Uncover Surprisingly Fast EndtoEnd Universal Style Transfer Networks”, 20190606 (backlinks; similar)
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, 2019
“EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, 20190528 ( ; similar; bibliography)
“Superposition of Many Models into One”, Et Al 2019
“Superposition of many models into one”, 20190214 (similar)
“Playing Atari With Six Neurons”, Et Al 2018
“Playing Atari with Six Neurons”, 20180604 ( ; similar)
“Measuring the Intrinsic Dimension of Objective Landscapes”, Et Al 2018
“Measuring the Intrinsic Dimension of Objective Landscapes”, 20180424 ( ; similar)
“SqueezeNext: HardwareAware Neural Network Design”, Et Al 2018
“SqueezeNext: HardwareAware Neural Network Design”, 20180323 (similar)
“Wide Compression: Tensor Ring Nets”, Et Al 2018
“Wide Compression: Tensor Ring Nets”, 20180225 (similar)
“Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing”, 2018
“Intriguing Properties of Randomly Weighted Networks: Generalizing while Learning Next to Nothing”, 20180125 (similar)
“Fix Your Classifier: the Marginal Value of Training the Last Weight Layer”, Et Al 2018
“Fix your classifier: the marginal value of training the last weight layer”, 20180114 (similar)
“Learning Compact Recurrent Neural Networks With BlockTerm Tensor Decomposition”, Et Al 2017
“Learning Compact Recurrent Neural Networks with BlockTerm Tensor Decomposition”, 20171214 ( ; similar)
“XUnit: Learning a Spatial Activation Function for Efficient Image Restoration”, Et Al 2017
“xUnit: Learning a Spatial Activation Function for Efficient Image Restoration”, 20171117 (similar)
“Natural Language Processing With Small FeedForward Networks”, Et Al 2017
“Natural Language Processing with Small FeedForward Networks”, 20170801
“ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”, Et Al 2017
“ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”, 20170704 (similar)
“ShakeShake Regularization of 3branch Residual Networks”, 2017
“ShakeShake regularization of 3branch residual networks”, 20170315 (similar)
“Bonsai: Resourceefficient Machine Learning in 2KB RAM for the Internet of Things”, Kumar & Al 2017
“Tensorizing Neural Networks”, Et Al 2015
“Tensorizing Neural Networks”, 20150922 ( ; backlinks; similar)
“Eight Pairs of Descending Visual Neurons in the Dragonfly Give Wing Motor Centers Accurate Population Vector of Prey Direction”, GonzalezEt Al 2013
“Eight pairs of descending visual neurons in the dragonfly give wing motor centers accurate population vector of prey direction”, 20130108 ( ; backlinks; similar)
“Networks of Spiking Neurons: The Third Generation of Neural Network Models”, 1997
“Networks of spiking neurons: The third generation of neural network models”, 199712 ( ; similar)
“Delivering Realtime AI in the Palm of Your Hand”
Wikipedia
Miscellaneous

2018cheng.pdf
2018 
http://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_00990

https://ai.facebook.com/blog/ahighlyefficientrealtimetexttospeechsystemdeployedoncpus/

https://ai.googleblog.com/2018/05/customondevicemlmodels.html

https://ai.googleblog.com/2019/03/anallneuralondevicespeech.html

https://ai.googleblog.com/2021/10/grammarcorrectionasyoutypeonpixel.html

https://ai.googleblog.com/2022/03/autogeneratedsummariesingoogledocs.html

https://ai.googleblog.com/2022/08/efficientsequencemodelingforon.html

https://blog.roblox.com/2020/05/scaledbertserve1billiondailyrequestscpus/

https://blog.tensorflow.org/2020/03/higheraccuracyonvisionmodelswithefficientnetlite.html

https://neuralmagic.com/blog/bertlargepruneoncefordistilbertinferenceperformance/
Link Bibliography

https://arxiv.org/abs/2302.12441
: “MUXPLMs: Pretraining Language Models With Data Multiplexing”, Vishvak Murahari, Ameet Deshpande, Carlos E. Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik Narasimhan: 
https://arxiv.org/abs/2207.03620
: “More ConvNets in the 2020s: Scaling up Kernels Beyond 51×51 Using Sparsity (SLaK)”, : 
https://arxiv.org/abs/2205.03983#google
: “Building Machine Translation Systems for the Next Thousand Languages”, : 
https://arxiv.org/abs/2204.00595
: “Monarch: Expressive Structured Matrices for Efficient and Accurate Training”, : 
https://arxiv.org/abs/2202.07415#deepmind
: “NeuPL: Neural Population Learning”, Siqi Liu, Luke Marris, Daniel Hennes, Josh Merel, Nicolas Heess, Thore Graepel: 
https://arxiv.org/abs/2106.09685
: “LoRA: LowRank Adaptation of Large Language Models”, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan AllenZhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen: 
https://greydanus.github.io/2020/12/01/scalingdown/
: “Scaling down Deep Learning”, Sam Greydanus: 
https://arxiv.org/abs/1905.11946#google
: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, Mingxing Tan, Quoc V. Le: