“[Ali Released PLUG: 27 Billion Parameters, the Largest Pre-Trained Language Model in the Chinese Community]”, Zhao Yuying2021-04-19 (, ; similar)⁠:

[Google Translate] Today, AliBaba officially released the pre-trained language model PLUG (Pretraining for Language Understanding and Generation), which is the largest pre-trained language model in the Chinese community so far, with 27 billion parameters. It has just won the first place in the classification field on the most authoritative list of Chinese language models, CLUE.

… It is understood that PLUG uses more than 1TB of high-quality Chinese text training data, covering a wide range of types and fields such as news, novels, poems, and Q&A, and its model training relies on the AliBaba Cloud EXAFLOPS high-performance AI computing cluster.

PLUG super-large-scale pre-trained Chinese understanding & generation unified model is currently the largest pure-text pre-training language model in the Chinese community, integrating language understanding and generation capabilities. The goal is to greatly improve the performance of the major tasks of Chinese NLP through the capabilities of the super-large model, and achieve performance that exceeds human performance.

According to the introduction of AliBaba Dharma Academy, compared with other large-scale generation models such as Open AI’s GPT-3, PLUG has the following advantages:

In the latest Chinese Language Understanding Evaluation Benchmark (CLUE), the PLUG R&D team tested the language understanding ability of PLUG on CLUE’s classification task, and only used the ensemble results of several sets of hyperparameter training downstream models, which achieved the first place Achievement.

PLUG technical details: Previously, the NLU language model StructBERT and the NLG language model PALM developed by the Machine Intelligence Laboratory of Dharma Academy have achieved SOTA effects in their respective fields. Simply put, the StructBERT model strengthens the model’s ability to learn grammar by strengthening the modeling of language structure information in the training objectives of the sentence level (Sentence Structural Objective) and the word level (Word Structural Objective). The PALM model combines 2 pre-training methods, Autoencoding and Autoregression, and introduces the Masked LM target to improve the representation ability of the encoder, and at the same time improves the generation ability of the decoder by predicting the second half of the text. For the training of the large-scale language model, the DAMO Academy team learned from the strengths of the 2 and proposed a simple framework for NLU&NLG joint training. Compared with the GPT series model, this large-scale generative model uses StructBERT as the encoder, which has a strong ability to understand the input text in both directions, so that it can generate more relevant content for the input.

The entire training process is divided into 2 stages. First, in the first stage, the DAMO Academy team trained a standard StructBERT model of 24 layers/8192 hidden size as an encoder. In this process, a total of 300B tokens of training data were trained, and the scale is comparable to that of GPT-3.

In the second stage, the Dharma Academy team used this encoder for the initialization of the generation model, and plugged a decoder of 6 layers / 8,192 hidden size. In the process of training the generation model, the length was randomly determined on both the encoder side and the decoder side. [32, 512] Carry out data sampling to ensure that it is suitable for a wide range of downstream generation tasks. In this stage, a total of 100B tokens of training data were trained. In the first 90% of the training, the team retained the Masked LM task to maintain the NLU capability of the model. In the last 10% of the training, the MLM task was removed for fine-tuning to make the generated PPL If it is lowered, a better generation effect can be achieved.

… According to Qingyuan CPM’s plan, from July to September 2021, the entire model will contain about 100 billion parameters. The training data includes 1TB of multilingual data with Chinese as the core and a 100 million-level entity relationship map.

Now, AliBaba officially released PLUG, once again promoting the development of the Chinese community pre-training model. Next, PLUG will expand the parameter scale to 200 billion and further improve the quality of text generation. In addition to the PLUG (27 billion parameters) with Chinese as the core, Dharma Academy has also jointly released a new super-large-scale pre-training model “Wenhui” (11.3 billion parameters) for cognition in conjunction with Zhiyuan Research Institute and Tsinghua University, and the joint Tsinghua University released the ultra-large-scale multi-modal pre-training model “M6” (100 billion parameters).