“LHOPT: A Generalizable Approach to Learning Optimizers”, Diogo Almeida, Clemens Winter, Jie Tang, Wojciech Zaremba2021-06-02 (, ; similar)⁠:

[learning rate tuning; code; cf. Chinchilla] A core issue with learning to optimize neural networks has been the lack of generalization to real world problems.

To address this, we describe a system designed from a generalization-first perspective, learning to update [using PPO] optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function. This system outperforms Adam at all neural network tasks including on modalities not seen during training. We achieve 2× speedups on ImageNet, and a 2.5× speedup on a language modeling task using over 5 orders of magnitude more compute than the training tasks.

…Because even the largest language modeling tasks are trained for less than an epoch [8], we choose to train for only a single epoch to evaluate performance in an underfitting regime…The baselines are all AdamW-based and combinations of 5 learning rates (1_e_−4, 3_e_−4, 1_e_−3, 3_e_−3, 1_e_−2) and 7 commonly used schedules (constant, multi-step, linear decay, quadratic decay, exponential decay, cosine [25] to 0, cosine to 0.1 of original LR)…We also had one additional class of actions that were not hyperparameter updates but fit in nicely within the existing framework: learning to restart from checkpoints. There are many motivations for such an action:

  1. ideally learned optimizers would be able to handle all the task-specific tuning that a practitioner would have to do and restarting on divergence is one such tasks,

  2. previous work has noted that SGD often works best with the highest possible stable learning rate [43] and it may not be possible to determine that value without venturing into unstable territory,

  3. sophisticated hyperparameter optimizations algorithms such as Population-Based Training could be learned from such a simple action, and finally

  4. even if restarting was never used by a trained model, it could greatly help with exploration while training—to both decrease the length of credit assignment paths and also make it less punishing for models to sample suboptimal settings.

Figure 3 shows the learning curves for the LHOPTs and best baseline. An interesting observation that we will see repeated throughout the paper is that despite being capable of achieving a lower loss earlier, the chosen hyperparameters tend to underperform the best possible loss for that compute, presumably to achieve a better loss later. It’s unclear how necessary it is trade-off early performance for later, but many successful hand-made schedules tend to do this: multi-step schedules tend to stay at the same learning rate long after they’ve hit a plateau and cosine schedules tend to decay their learning rates much less aggressively than other commonly used schedules.

Figure 3: Performance of learned optimizers on optimizing 1 epoch of GPT-2-Large on WikiText-103. Our learned optimizers get almost 2× speedups on this task despite being over 2 magnitudes larger than training tasks.

…We then trained a range of model sizes to compute scaling laws [21] for both baselines and models trained with the LHOPT schedule and present the results on Figure 2. The LHOPT schedule demonstrates consistent speedup over baselines with a slightly steeper slope. We can estimate what a constant speedup would be for this range of points by assume their scaling law slopes are equal and from this calculate a 2.5× speedup. To take the change in slope into account as well, we extrapolate the curves to 175 billion parameters (same size as GPT-3) and at that size, the estimated speedup would be 3.6×.

Note that this result is despite the codebase doing multiple optimization techniques that our LHOPT would have no way of taking into account: gradient clipping to a fixed value and gradually increasing batch size.

Figure 2: Test learning curves and scaling law fit of compute efficient frontier on a large well-tuned language modeling codebase. Our learned optimizers demonstrate consistent speedups ≥2×, with speedup increasing as model size does with no computational overhead. Dotted lines are baselines, full lines use a LHOPT hyperparameter schedule from a similar but smaller task.