ā€œScaling Laws for Acoustic Modelsā€, Jasha Droppo, Oguz Elibol2021-06-11 (, , ; similar)⁠:

There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent ā€œirreducible lossā€ of the task. We find that the scaling laws accurately match model performance over 2 orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.

…The context module is a sequence-to-sequence model that converts the encoded input sequence into a sequence of context vectors. We experiment with 2 different designs for the context model: the LSTM and the Transformer. To maintain causality, the LSTM are uni-directional and the Transformer masking to prevent the network from using future frames…All acoustic data used in this paper was drawn from a 23,000 hour corpus of untranscribed, de-identified, far-field, English voice command and voice query speech collected from home environments [Alexa?]. This data is presented to the network as a series of log-Mel frequency filterbank feature vectors, at a rate of 100 vectors per second of audio. Although this data is not publicly available, the authors believe that the phenomena described in this paper should apply to any similar set of speech recordings.

Figure 5: Development set loss for both LSTM and Transformer models for models with the indicated number of layers. The dashed line represents the computationally efficient frontier defined in Equation 4.

When a model reaches L(C), it means that a different model with enough capacity, but with fewer parameters, would need more computation and more data to reach the same loss value. Alternatively, a model with more parameters would need more computation and less data to reach the same loss value.

Where curves for 2 experiments meet, it is an indication that the same amount of compute can reach the given loss value through 2 different methods. One can either use more parameters and fewer data, or use fewer parameters and more data.

The constant š“āˆž is 0.306 in both figures. This represents a shared asymptote between the LSTM and Transformer systems, which will never be surpassed, regardless of the computational or data budget. The fact that the same asymptote applies to both systems hints that irreducible loss is indeed a fundamental property of the data and not the model. Additionally, this constant is similar to the value found in §3.1. The authors suspect that the constants should be identical, but our precision in measuring it is limited.

The LSTM models exhibit a compute-efficient frontier with a slope of āˆ’0.167. A doubling of computation yields a 10.9% reduction in objective function. A halving of objective function would come with a 63.5Ɨ increase in computation. The slope of the compute-efficient frontier for Transformer models is āˆ’0.197. When computation is increased by a factor of r, then the reducible loss will be changed by a factor of rāˆ’0.197. At that rate, a doubling of computation yields a 12.7% reduction in objective function. A halving of objective function would come with a 33.7Ɨ increase in computation. [These results are consistent with LSTMs vs Transformers on text, and would probably be more impressive if acoustic modeling wasn’t so close to the irreducible loss (ie. solved).]

The difference in slope between the LSTM and Transformer experiments indicate that the Transformer architecture makes more efficient use of increased model parameters and increased training data. Although LSTM is superior to transformer at smaller model sizes, as the model size grows, and these trends continue, the transformer will eventually be more efficient.

Finally, the experimental data show that larger models learn more quickly from the same amount of data. Each of the points plotted in Figure 5 represent the consumption of an additional 25,000 minibatches of training data. At the first point, second, or third, each model has processed the same data, but the larger models have achieved better accuracy on the held-out development set.