“GLaM: Efficient Scaling of Language Models With Mixture-Of-Experts”, 2021-12-13 (; similar):
Scaling language models with more data, compute and parameters has driven progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires large amounts of computing resources.
In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
The largest GLaM has 1.2 trillion parameters, which is ~7× larger than GPT-3. It consumes only 1⁄3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.