Building upon OpenAI’s recent work on scaling laws, my project explores how much pre-training on English helps when transferring across different languages.
Here, I will discuss scaling laws discovered while fine-tuning across different languages with pre-trained English language models. Specifically, I found that (1) pre-trained English models help most when learning German, then Spanish, and finally Chinese and (2) transfer from English to Chinese, German, and Spanish scales predictably in terms of parameters, data, and compute.
My experiments try to answer the question: How much does pre-training on English help when transferring across different languages as we vary the dataset size and model size?
…Effective Data Transfer:
Figure 4: The performance of a 16M parameter transformer model on Chinese, both trained from scratch on Chinese and pre-trained on English then fine-tuned on Chinese.
In my experiments, I wanted to find the effective data transferred for models trained on English text to Chinese, Spanish, and German text. The effective data transferred is defined in “Scaling Laws for Transfer” as the amount of additional fine-tuning data that a model of the same size, trained on only that fine-tuning dataset, would have needed to achieve the same loss as a pre-trained model. In the figure above, each point is a 16M transformer trained to convergence on dataset of X tokens. The total amount of data required for the model trained from scratch can be represented as De = Df + Dt where De is the total amount of effective data, Df is the amount of data needed for the fine-tuned model, and Dt is the amount of additional data needed for the trained from scratch model. Dt is the amount of data transferred from pre-training on English.
Figure 5: Comparing performance of a 16M parameter transformer trained from scratch, and fine-tuned on Chinese, Spanish, and German.
For the dataset size of 8000 tokens, Dt, the amount of data transferred, is largest for German. The dashed line on the graphs represent Dt. As the number of tokens in the dataset size increase, Dt becomes smaller across all languages.
As seen in the figures above, English to Chinese had a smaller amount of data transferred compared to English to Spanish for the same model size and English to German had the greatest amount of data transferred. Pre-trained English text models help most when learning German, followed by Spanish, and finally, Chinese. I believe these results reflect the degree of linguistic similarities between English and the non-English languages. English and German are both derived from Proto-Germanic and are linguistically most similar. Although the Spanish alphabet shares almost all the same symbols with the English alphabet, it is a Romance language, and Chinese does not share the same alphabet as English. Each language has a distinctive shape and distance between fine-tuning and training from scratch. For instance, the effective data transfer is not too much greater for Spanish, vs Chinese, at the smallest dataset size, 8000 tokens. However, as we increase the dataset size, pre-training continues to help for another order of magnitude until the 100M token dataset size than the Chinese which converges at 10M token dataset size.
…I find many of the same trends and relationships found in the Scaling Law for Transfer between text and code, between English and different languages. In the low data regime, pre-training is helpful across model sizes, but especially in large model sizes…Lastly, pre-trained models are more compute efficient than training from-scratch across dataset sizes. This is without accounting for the compute costs for the pre-trained model.