Large language models such as GPT-3 (Brownet al2020) can perform arbitrary tasks without undergoing fine-tuning after being prompted with only a few labeled examples. An arbitrary task can be reformulated as a natural language prompt, and a language model can be asked to generate the completion, indirectly performing the task in a paradigm known as prompt-based learning.
To date, emergent prompt-based learning capabilities have mainly been demonstrated for unidirectional language models. However, bidirectional language models pre-trained on denoising objectives such as masked language modeling produce stronger learned representations for transfer learning. This motivates the possibility of prompting bidirectional models, but their pre-training objectives have made them largely incompatible with the existing prompting paradigm.
We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models.
Using machine translation task as a case study, we prompt the bidirectional mT5 model (Xueet al2021) with SAP and demonstrate its few-shot and zero-shot translations outperform the few-shot translations of unidirectional models like GPT-3 and XGLM(Linet al2021), despite mT5ās ~50% fewer parameters.
We further show SAP is effective on question answering and summarization.
ā¦We propose a range of improvementsāfiltering, prompt ensembling, and English-centric bootstrappingāto the unsupervised machine translation procedure outlined by Hanet al2021 to better adapt the bootstrapping process for unsupervised low-resource machine translation.
For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models, rather than only unidirectional models.
ā¦We hypothesize these future bidirectional training schemes could yield an approach that overcomes the efficiency limitations of SAP, while maintaining the performance and parameter size reduction benefits. Concurrent recent work that compares or mixes unidirectional and bidirectional pre-training objectives (Wanget al2022; Tayet al2022; Soltanet al2022) already provide some early evidence towards this hypothesis.
ā¦Sequential autoregressive Prompting (SAP) Technique: By requiring mT5 to in-fill <X>, we are effectively asking it to translate the Spanish sentence. However, due to the limitations of the denoising pre-training objective on prompting (described in §2.1), we observe mT5 often outputs a partial translation of the beginning of the source sentence, rather than the full translation. To overcome this, we prompt mT5 T times until the model generates a stop token </s>, resulting in a longer translation. At each time step of iteration, we keep the first word generated (using the space character as delimiter) and concatenate it into the last line of the prompt to use in the next time step. This iterative prompting enables us to extract longer generations. Formally, we denote the generation at each time step t as Gt. We denote the first word generated at each time step as Ft, where Ft = SPLIT(Gt, ā ā)[0]. We update the prompt at each time step Pt to include the cumulative generation from all previous time steps concatenated in the ast line of the prompt. The prompt used at each time step Pt is as follows:
Translate Spanish to English. Spanish: El clima es soleado.</s> English: The weather is sunny.</s> Spanish: Mi perro es un cachorro.</s> English: My dog is a puppy.</s> Spanish: Los Ć”rboles son importantes.</s> English: CONCAT(F0, ā¦, Ftā1)<X>
Figure 1: A visualization of our SAP technique extracting high-quality translations from mT5. In the zero-shot setting, the examples used in the prompt are synthetic examples retrieved in a fully unsupervised manner.
In Table 1, we also consider sequential promptingāconcatenating the entire generation Gt instead of just the first word of the generation Ftābut find that it produces substantially inferior results as low-quality tokens are generated after the first word. By conditioning the model to generate the next word in the translation based on previous words generated, this technique resembles autoregression. mT5 is already autoregressive, but it is autoregressive only at the decoder level. Adding previously generated words back into the prompt allows them to pass through the encoder layers as well. For this reason, we call this technique SAP (Sequential Autoregressive Prompting). To provide a signal to stop generation, we add our stop token at the end of each example in the prompt. We stop prompting after the model generates a stop token. The overall process is graphically depicted, with stop tokens omitted, in Figure 1.
Figure 2: A visualization of the bootstrapping process described in §4.