“Sparse Networks from Scratch: Faster Training without Losing Performance”, Tim Dettmers, Luke Zettlemoyer2019-07-10 (; similar)⁠:

We demonstrate the possibility of what we call sparse learning: accelerated training of deep neural networks that maintain sparse weights throughout training while achieving dense performance levels. Sparse learning refers to techniques in machine learning that allow for training models more efficiently by reducing the number of active (non-zero) weights, yet without significantly compromising the model’s performance. This approach can greatly enhance computational efficiency and reduce the model’s memory footprint.

We accomplish this by developing sparse momentum, an algorithm which uses exponentially smoothed gradients (momentum) to identify layers and weights which reduce the error efficiently. Sparse momentum redistributes pruned weights across layers according to the mean momentum magnitude of each layer. Within a layer, sparse momentum grows weights according to the momentum magnitude of zero-valued weights. Momentum in machine learning helps accelerate the optimization process, finding better solutions faster than standard gradient descent methods.

We demonstrate state-of-the-art sparse performance on MNIST, CIFAR-10, and ImageNet, decreasing the mean error by a relative 8%, 15%, and 6% compared to other sparse algorithms. Furthermore, we show that sparse momentum reliably reproduces dense performance levels while providing up to 5.61× faster training. The significance of achieving such high levels of performance on these standard benchmarks cannot be overstated. These datasets are pivotal in the field of computer vision, allowing researchers to benchmark the performance of their algorithms against a standard reference.

In our analysis, ablations show that the benefits of momentum redistribution and growth increase with the depth and size of the network. Additionally, we find that sparse momentum is insensitive to the choice of its hyperparameters suggesting that sparse momentum is robust and easy to use. Hyperparameters are the external configurations to the model that are not learned from the data but significantly impact model performance. The robustness to hyperparameter choices simplifies the model’s application to a variety of problems without the need for extensive tuning.