“Training Compute-Optimal Protein Language Models”, 2024-06-09 ():
We explore optimally training protein language models, an area of substantial interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 0.003–10.7 billion parameters on 5–200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives.
First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model (MLM) when repeating the commonly used Uniref database.
To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects.
Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data.
Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens.
Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure & function-related tasks, all within less or equivalent pre-training compute budgets.
…We focus on the best practices, which include revisiting datasets, optimization objectives, and model parameters as key factors. Our goal is to investigate an optimal training scheme for protein language models given predetermined compute budgets. Our core findings are as follows:
We revisited the protein sequence data used for training PLMs and collected a dataset of 194 billion unique tokens on 939M unique sequences from publicly available sources to address the issue of overfitting and perform plateau in protein language modeling.
We find that, in both MLM and CLM, training data scales sub-linearly in the model sizes but follow distinct power-laws.
MLM scales with a compute exponent of ~0.77. In other words, a 10× increase in compute leads to a 6× increase in MLM model size and a 70% increase in data, versus a 4× increase in CLM model size and a 3× increase in training tokens.
We also find that models trained with CLM can be transferred to MLM.
When given a predetermined amount of computation, and one wants to obtain both a CLM and a MLM model, there is a trade-off in allocating the training token to each model to jointly optimize the performance of the two. Interestingly, the allocation for CLM pre-training was determined by the scaling law of CLM and MLM, and the Effectively Transferred Tokens Dt from CLM to MLM. Furthermore, we verify this method experimentally using a 470M model and fine-tuning on downstream tasks.
Building on our scaling strategies, we reevaluated the allocation of model size and training tokens under the compute budgets of established PROGEN2-xlarge and ESM-2 (3B) setups.
Consequently, with the same compute budgets, we trained two corresponding models, one with 7.2b parameters and the other with 10.7b parameters, which exhibited enhanced performance in a diverse range of downstream tasks.