ā€œResolving Discrepancies in Compute-Optimal Scaling of Language Modelsā€, Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon2024-06-27 (, ; similar)⁠:

Kaplan et al 2020 and Hoffmann et al 2021 developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions.

We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying 3 factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning.

With these factors corrected, we obtain excellent agreement with the Hoffmann et al 2021(ie. ā€œChinchillaā€) scaling law. Counter to a hypothesis of Hoffmann et al 2021 we find that careful learning rate decay is not essential for the validity of their scaling law.

As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW β2 parameter is essential at lower batch sizes.