This chapter develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the under-lying optimization algorithm in non-trivial ways. For instance, a mediocre optimization algorithm, stochastic gradient descent, is shown to perform very well on large-scale learning problems.
…This chapter develops the ideas initially proposed by Bottou & Bousquet2008 [“The tradeoffs of large scale learning”, NIPS 2007]. §13.2 proposes a decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well-known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. §13.3 explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate of the statistical estimation error. §13.4 reports experimental results supporting this analysis.
…These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the objective function and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences:
The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (13.11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error.
Second-order algorithms bring few asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all 4 algorithms are dominated by the polynomial term in (1⁄ε). However, there are important variations in the influence of the constants d, κ, and ν.These constants are very important in practice.
Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite showing the worst optimization performance on the empirical cost. This phenomenon has already been described and observed in experiments (eg. Bottou & LeCun2004).
In contrast, since the optimization error εopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is determined solely by the statistical properties of the objective function.
…Figure 13.1 shows how much time each algorithm takes to reach a given optimization accuracy. The superlinear algorithm TRON reaches the optimum with 10 digits of accuracy in less than one minute. The stochastic gradient starts more quickly but is unable to deliver such a high accuracy. The upper part of the figure clearly shows that the testing set loss stops decreasing long before the superlinear algorithm overcomes the SGD algorithm.
Figure 13.1: Training time and testing loss as a function of the optimization accuracy ρ for SGD and TRON (Linet al2007).
Figure 13.2 shows how the testing loss evolves with the training time. The stochastic gradient descent curve can be compared with the curves obtained using conjugate gradients on subsets of the training examples with increasing sizes. Assume, for instance, that our computing time budget is 1 second. Running the conjugate gradient algorithm on a random subset of 30,000 training examples achieves a much better performance than running it on the whole training set. How to guess the right subset size a priori remains unclear. Meanwhile, running the SGD algorithm on the full training set reaches the same testing set performance much faster.
Figure 13.2: Testing loss versus training time for SGD, and for conjugate gradients running on subsets of the training set.
…Conclusion: Taking into account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the objective function and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms.
This framework leaves room for considerable refinements. Shalev-Shwartz & Srebro2008 rigorously extend the analysis to regularized risk formulations with linear parameterization and find again that, for learning purposes, SGD algorithms are often more attractive than standard primal or dual algorithms with good optimization complexity (Joachims2006; Hushet al2006). It could also be interesting to investigate how the choice of a surrogate loss function(Zhang2004; Bartlettet al2006) impacts the large-scale case.