[blog; cf. UniLM, MAE] Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups.
We begin by disentangling architectural archetypes with pre-training objectives—two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups.
Finally, by scaling our model up to 20b parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
Finally, we show that UL2 20B works well with chain-of-thought prompting and reasoning.
…UL2 for Few-Shot Prompting and Chain-of-Thought Reasoning: We scale up UL2 and train a 20 billion parameter encoder-decoder model on the public C4 corpus and demonstrate some impressive capabilities of the UL2 20B model.
UL2 is a powerful in-context learner that excels at both few-shot and chain-of-thought (CoT) prompting. In the table below, we compare UL2 with other state-of-the-art models (e.g, T5 XXL and PaLM) for few-shot prompting on the XSum summarization dataset. Our results show that UL2 20B outperforms PaLM and T5, both of which are in the same ballpark of compute cost.
Comparison of UL2 with T5 XXL, PaLM and LaMDA 137B on 1-shot summarization (XSum) in terms of ROUGE-1/2/L (higher is better), which captures the quality by comparing the generated summaries with the gold summaries as reference.
Most CoT prompting results have been obtained using much larger language models, such as GPT-3-175b, PaLM-540b, or LaMDA-137b. We show that reasoning via CoT prompting can be achieved with UL2 20B, which is both publicly available and several times smaller than prior models that leverage chain-of-thought prompting. This enables an open avenue for researchers to conduct research on CoT prompting and reasoning at an accessible scale. In the table below, we show that for UL2, CoT prompting outperforms standard prompting on math word problems with a range of difficulties (GSM8K, SVAMP, ASDiv, AQuA, and MAWPS). We also show that self-consistency further improves performance.
Chain-of-thought (CoT) prompting and self-consistency (SC) results on 5 arithmetic reasoning benchmarks.