“Optimal Brain Damage”, Yann LeCun, John S. Denker, Sara A. Solla1989 (; backlinks)⁠:

We have used information-theoretic ideas to derive a class of practical and nearly optimal schemes for adapting the size of a neural network. By removing unimportant weights from a network, several improvements can be expected: better generalization, fewer training examples required, and improved speed of learning and/or classification.

The basic idea is to use second-derivative information to make a tradeoff between network complexity and training set error.

Experiments confirm the usefulness of the methods on a real-world application.