“General Cyclical Training of Neural Networks”, Leslie N. Smith2022-02-17 ()⁠:

This paper describes the principle of General Cyclical Training in machine learning, where training starts and ends with “easy training” and the “hard training” happens during the middle epochs.

We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning.

In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as 3 examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology.

In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks.

In the spirit of reproducibility, the code used in our experiments is available at Github.

Table 1: Cyclical Weight Decay: Top-1 test classification accuracies comparing cyclical weight decay (CWD) [with a constant learning rate] to constant weight decay for CIFAR-10, 4K CIFAR-10 (ie. only 4,000 training samples), CIFAR-100, and ImageNet. In all of these experiments, CWD improved on the network’s performance as compared to training with a constant weight decay.

Table 1 compares the test accuracies for cyclical weight decay (CWD) to training with tuned hyper-parameters (with a constant weight decay) and learning rate warmstart and cosine annealing23. For each dataset in this Table there are two rows: the first row presents the mean test accuracy and the standard deviation over 4 runs (for ImageNet, this is the mean and standard deviation over two runs), and the second row provides the range of weight decay used in the training. The second column in the Table provides the results of training with a constant weight decay, and the subsequent columns, show the results of training with an increasing range for weight decay. In our experiments, we found that the performance was relatively insensitive to the value of fc.

The results in Table 1 show that there is a benefit to training over a range of weight decay values. For CIFAR-10, using cyclical weight decay improves the network performance relative to using a constant value of 5 × 10−3, and the range from 10−4 to 10−3 has the best performance but using the range from 2 × 10−4 to 8 × 10−3 is within the precision of our experiments.

The second row of Table 1 shows the results when training on only a fraction of the CIFAR-10 training set. Here we used the first 4,000 samples in the CIFAR-10 training dataset. Using cyclical weight decay improves the network performance relative to using a constant value of 5 × 10−3, and the range from 10−4 to 10−3 has the best performance. It is noteworthy that CWD provides a more substantial benefit when the amount of training data is limited. In addition, the third row of Table 1 shows results for CIFAR-100 and the range from 10−4 to 8 × 10−4 has the best performance.