“Gradient-Based Hyperparameter Optimization through Reversible Learning”, 2015-02-11 (; backlinks; similar):
Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable.
We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure.
These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures.
We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.