“How Big Should My Language Model Be?”, Teven Le Scao2020-06-08 (; backlinks; similar)⁠:

[Discussion of DL scaling laws and how big = better, with interactive graphs to help visualize the multi-way relationship between dataset / model / validation-loss / FLOPS.]

Research at Hugging Face also leverages this phenomenon, and we’ve combined it with GPU speed estimations to ensure model size is just right for the compute budget of the experiment (when in doubt, it’s bigger than you think!). This blog post will show how this impacts architecture decisions on a standard language modeling benchmark: we replicate the 14-layer state-of-the-art result from Zhang et al’s Transformer-XL paper without any hyper-parameter optimization and saving 25% of training time. We also estimate that the 18-layer model from the same paper trained for an order of magnitude too many training steps. Wanna play with our demo before reading? Just click here!

  1. There is an optimal time to stop training (and it’s earlier than you think)

  2. GPUs are optimized for large, wide models

  3. Demonstration on a language modeling task: Wikitext-103

  4. Takeaways

    • Big models are surprisingly efficient!

    • Training until convergence is not efficient at all.

    • Benchmarking smaller-scale runs allows us to predict model performance and optimal stopping time for production-scale models.

    • Using larger models stopped earlier and optimizing model size for speed lowers training costs.