āResolving Discrepancies in Compute-Optimal Scaling of Language Modelsā, 2024-06-27 (; similar)ā :
et al 2020 and et al 2021 developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions.
We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying 3 factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning.
With these factors corrected, we obtain excellent agreement with the et al 2021(ie. āChinchillaā) scaling law. Counter to a hypothesis of et al 2021 we find that careful learning rate decay is not essential for the validity of their scaling law.
As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW β2 parameter is essential at lower batch sizes.