There remain many open questions pertaining to the scaling behavior of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact.
The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplanet al2020 presents a comprehensive study of the scaling behavior of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm.
The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient.
To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50% fewer parameters and training 40% faster compared to the widely adopted T5-base model.
We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
Figure 1: The predictability and unpredictability of pre-training versus fine-tuning. While the upstream pre-training performance measured by negative log-perplexity scales with model size quite independently from the model shape, the downstream performance (SuperGLUE (avg) score) does not. This indicates that the shape of models plays an important role on how it performs on the target task and the performance is not merely a function of parameter size.
…The overall findings and insights of the paper can be summarized as follows:
We find that scaling laws may differ in upstream and downstream setups. Specifically, contrary to Kaplanet al2020, we find that downstream performance strongly depends on shape and not only on model size.
Hence, pretraining performance may not necessarily transfer to downstream applications. (Figure 1).
Our findings show that pre-training perplexity can often be a deceiving indicator of downstream quality and therefore model building based on upstream perplexity can be challenging.
Scaling laws can differ substantially when considering metrics on actual downstream finetuning. (Figure 1)
Given that empirical scaling laws differ when considering quality on the downstream, our work investigates the Pareto-frontier of transformer configurations in this setup.
We find that the canonical model configurations such as T5-Base and T5-Large sizes (Raffelet al2019) are relatively inefficient (Figure 2). Note that these sizes are based off the canonical BERT (Devlinet al2018) base and large sizes.
We find that scaling strategies differ at different compute regions, ie. applying same strategies at different compute regions (small vs large) has a different effect on model quality.
This has practical implications since finding strategies at small scale might not necessarily transfer or generalize to higher compute regions (§4.2).
After extensive empirical exploration of the Pareto-frontier of transformer models, we propose a simple but effective scaling strategy which we call the DeepNarrow strategy. We show that we are able to obtain model quality on par or better than canonical model sizes (eg. base) with 50% less parameters and being 40% faster.
While we highlight the limitations of this strategy, we also show that this DeepNarrow strategy is applicable to all model sizes. (Table 4).
To consider how generalized these scaling strategies are, we conduct additional experiments on vision transformers (ViT; Dosovitskiyet al2020) to verify them in the vision domain.
Moreover, on top of the 17 GLUE (Wanget al2018) / SuperGLUE(Wanget al2019) and SQuAD(Rajpurkaret al2016) tasks we employed in our extensive study, we verify our findings via additional downstream experiments across 12 diverse language tasks (§4.6).
We release (1) the pre-trained checkpoints for our T5 models with improved scaling protocols and (2) all 100+ model checkpoints, including intermediate training checkpoints to the research community.
We believe that this is a treasure trove of data to study the behavior of large LM pretraining and finetuning especially pertaining to scaling laws. The checkpoints and code will be released at Github. The checkpoints are now publicly available at our Google Cloud Bucket gs://scenic-bucket/scaling_explorer/scaling_explorer.