“Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model”, 2022-01-28 (; similar):
[blog] Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.
As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer-based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as well as other interesting observations and new properties exhibited by MT-NLG.
We demonstrate that MT-NLG achieves superior zero-shot, one-shot, and few-shot learning accuracies on several NLP benchmarks and establishes new state-of-the-art results.
We believe that our contributions will help further the development of large-scale training infrastructures, large-scale language models, and natural language generations.
…The validation cross-entropy loss is 3.15 after the model is trained on the first 1 billion tokens. As mentioned earlier, we increase the batch size linearly over the first 12 billion tokens. At the end of this phase, the loss becomes 2.31. When the model reaches our targeted number of tokens, 270 billion, the validation loss becomes 1.85.
Table 2: LAMBADA zero-shot, one-shot and few-shot accuracy. MT-NLG outperforms previous models across different settings and establishes new SOTA for all 3 settings. We did not find any recent strong supervised baseline for LAMBADA, hence we omit the comparison with supervised models here. Model Zero-shot One-shot Few-shot GPT-3 76.20 72.50 86.40 Gopher 74.50 - - MT-NLG (ours) 76.56 73.06 87.15 …To our pleasant surprise, MT-NLG is quite capable in solving riddles, answering Jeopardy! questions and even generating code off-the-shelf. We present some examples of each category below.
Riddle Answer Generation: We used riddles to probe the model’s reasoning capability in an ambiguous context, crafting each riddle ourselves in order to prevent their incidence in the training set. We first observe that in a riddle-solving context, the model tends to generate its interpretation of each line in the riddle along with its answer. While not always perfect, these interpretations most of the time make good sense. Such an example is shown in Table 13. For riddles that are ambiguous enough to have multiple plausible answers, MT-NLG not only generates alternative plausible answers through stochastic sampling, but it can also generate alternative interpretations matching the answer it has generated (Table 14).
Jeopardy! Questions: Question answering datasets30, 25 often poses specific and direct questions to benchmark the models. However, we are also interested in how the model can use the knowledge it memorized in a guessing game setting, where some reasoning over the hints is required. To this end, we take several Jeopardy! questions from the most recent episode and let our model generate the answers. Since Jeopardy! questions take the reverse trivia format where the “question” is in the format of an answer and contestants are asked to select matching questions, we choose to use few-shot setting to inform the model of the task format. MT-NLG can generate fairly plausible answers and in fact get the correct ones in most cases. Some examples are shown in Table 15.
Code Generation: The recent development of code generation using language models suggests that large scale pretrained LMs already show decent code generation capabilities from pretraining. To this end, we investigate the code generation capability of MT-NLG off-the-shelf. We presented some function signatures with detailed comments to see how MT-NLG would complete the implementation of the missing function. We observe that MT-NLG is capable of generating syntactically correct code consistently, and is also able to arrive at correct implementations for simple tasks. We sometimes observe that the model will generate an answer making use of another function, and then move on to generate the invoked function after the current one is finished. Some examples of this are shown in Table 16.
Inferring Arithmetic Operations: Understanding and using mathematical operations is yet another aspect of language understanding. Prior work [GPT-3] has demonstrated that a strong language model, even if not trained specifically to solve math problems, can answer simple arithmetic questions with a certain degree of accuracy beyond chance. However, some doubts remain as to whether the model indeed has some understanding of math expressions, or whether it simply rehashes examples encountered during training. To this end, we devise a new task where we obfuscate operator symbols in an expression and check if our model can reverse-engineer the arithmetic operation. We observe that common operations like addition, subtraction, multiplication and division can usually be inferred correctly. Some examples of this task is shown in Table 17.
Free-form Generative Writing Assistance: We qualitatively examined the free-form generation capability of MT-NLG by enlisting the model to help authoring the abstract section of this paper. This was done through prompting MT-NLG with the text from §1, then proceeding to sample the model sentence by sentence. For each sentence multiple candidates were generated, from which one was picked and edited if necessary. We repeated this process until the abstraction excerpt appeared complete. [not included?]
See Also: