Neural Net Sparsity

Gwern

Neural Net Sparsity

2022-10-07 finished certainty: highly likely importance: 6

Neural nets are extremely ‘overparameterized’ in the sense that they have orders of magnitude more parameters than necessary to solve the problems they are trained on, as can be proven by the regular improvements in training smaller/faster but still performant networks, but also in directly creating smaller neural nets with similar or identical performance on those problems. Major techniques are: deleting parameters (pruning)/reducing precision of the numeric encoding (quantizing)/training a smaller network from scratch using the original large network somehow (distillation).

Mysteriously, these smaller networks typically cannot be trained from scratch; performance gains can be obtained without the original data; models can be trained to imitate themselves in self-distillation; despite this indicating overfitting ought to be a major concern, they generalize well; and many of these smaller networks are in some sense already present in the original neural network. This is frequently taken to indicate some sort of blessing of scale in large NNs having smoother loss landscapes, which simple optimizers can successfully traverse to good optima no matter how hard the problem, as compared to smaller networks which may wind up ‘trapped’ at a bad place with no free parameters to let it slip around obstacles and find some way to improve (much less the loss landscape of equivalently powerful but extremely brittle encodings such as Brainf—k or X86 assembler programs). As well as their great theoretical interest—How can we train these small models directly? What does this tell us about how NNs work?—such smaller NNs are critical to practical real-world deployment to servers & smartphones at scale, the design of accelerator hardware supporting reduced-precision operations, and also are an interesting case of capability growth for AI risk: as soon as any NN exists which can achieve performance goal X, it is likely that a much more efficient NN (potentially orders of magnitude smaller or faster) can be created to achieve X thereafter. (These are merely one way that your software can be much faster⁠.)

This tag covers some examples of NNs being compressed in size or FLOPs by anywhere from 50% to ~17,000% (an incomplete bibliography, merely papers I have noted during my reading).

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]