[cf. Naranget al2021, Tay commentary on efficient attention bust] There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behavior? How does this influence upstream (pretraining) and downstream (transfer)?
Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales.
We believe that the findings outlined in this work has substantial implications to how model architectures are currently evaluated in the community.
Figure 1b: Downstream accuracy. An overview compute-performance (FLOPs vs performance) plot of all the diverse models and architectures we pretrained and finetuned in this study. Colors represent different model architectures and size of the circles represent the size of the model (parameters).
…For the first time, we derive scaling laws for different inductive biases and model architectures. We find that this scaling coefficient differs greatly from model to model. We believe this is an important consideration in model development. It turns out that amongst all ten architectures that we consider, the vanilla Transformer has the best scaling behavior, even if its absolute performance at each compute region is not the greatest…We also find concerning trends where linear-time attention models such as Performer struggle with scaling up…We also note that ALBERT scales (trends) negatively (gets worse) as we scale the model up.
Figure 2: Upstream Negative Log-Perplexity of vanilla Transformer compared to other models.
…Another somewhat surprising finding is that the model shapes such as width or depth of the Transformer network have minimal effects on the cross-entropy loss for a wide range of scales. [Do we need self-attention at all?] Subsequent works (Henighanet al2020; Hernandezet al2021) made similar conclusions for autoregressive generative modeling and for transfer learning, respectively. This finding is also generally supported by Tayet al2021b but discrepancies were found for the gap between pretraining and finetuning—highlighting the fact that observing downstream performance of large language model is indeed important. In Tayet al2021b, the effect of depth was unusually pronounced for downstream performance.