Crosslingual conditional generation (eg. machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. A source query in one language, for instance, may yield several translation options in another language without any extra context. Only one translation could be acceptable however, depending on the translator’s preferences and goals. Choosing the incorrect option might affect translation usefulness and quality.
We propose a novel method interactive-chain prompting (INTERCPT)—a series of question, answering and generation intermediate steps between a Translator model and a User model—that reduces translations into a list of subproblems addressing ambiguities and then resolving such subproblems before producing the final text to be translated.
To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which leads to ambiguities at inference for 4 languages.
To encourage further exploration in this direction, we release all datasets.
We note that interactive-chain prompting, using 8 interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities.
Figure 1: Interactive-Chain-Prompting (INTERCPT).
Figure 2: Translation queries with multiple possible predictions. Correctly solving subproblems around ambiguities with “you” and “it” greatly affects the BLEU (Papineni et al 200222ya) translation metric.
…It consists of a 3 step reasoning chain (See Figure 1):
The first step is for identifying ambiguities. The prompt in this step always contains the same constant exemplars, showing multiple queries to translate and questions about each query’s ambiguities. During inference, the Translator LM uses the prompt to generate a pointed question that identifies the specific ambiguity.
The second step is for resolving ambiguities. The prompt in this step contains exemplars answering the question to the ambiguity subproblems in step one. The User LM answers each question using additional information from the provided context. In real life applications, we assume that a real user has similar background information about the text to be translated.
The third step is for translating. Generated questions and answers are appended to the prompt in step 1 before the final translation is produced. Constant prompts in this step demonstrate how to translate in the specified target language using only details provided by the User LM and no-context. During inference, the Translator LM uses the prompt to generate the translation.
…Our test results for en-es, en-fr, en-de and en-ja are summarized in Table 2: We first notice that INTERCPT surpasses all other baselines. Surprisingly, PaLM-with-context, even with all the necessary background to resolve ambiguities, lags behind INTERCPT on F-Acc. for formality, G-Acc. for “it resolution” and both hit@3 and B@3 for polysemy. This results suggests that the multistep computation approach of fist resolving the ambiguity subproblems and then generating text has an advantage over other baselines. BLEU scores are also 2–3 points higher while BLEURT scores are only slightly higher. This suggest that INTERCPT generates sentences syntactically much closer to the ground truth while conserving the correct semantics.
Figure 3: INTERCPT enables large LMs to solve ambiguity subproblems in cross-lingual generation. The multistep disambiguate-translate capability is an emergent ability that is reached at higher parameter scales. Note that interactive = INTERCPT.
…6.3. Are interactive generation abilities emergent? We show in Figure 3 for each prompt template the effects of scaling PaLM parameters on the performance of formality, “it” resolution and polysemy for Spanish (ES), French (FR), German (DE) and Japanese (JA) target languages. Please note that while we vary the parameter count (8B, 62B and 540B) of the Translator LM, the User LM is a 540b parameters PaLM model for all experiments. The plots provide interesting insights. First, at the 8b parameter scale, PaLM-no-extras performs best across all languages for Formality and “it” resolution across all language pairs. Neither context or interaction seem to provide benefits to translation.
Second, at the 62b parameter scale, the PaLM-with-context and INTERCPT methods have on par performances. Context or interaction in this case are only clearly beneficial for polysemy. Third, the PaLM 540b parameter INTERCPT outpaces other prompt-based methods across language pairs and ambiguity subproblems. At this stage, baselines scaling trend decelerates, with scaling curves flattening, compared to INTERCPT [hidden scaling]. It shows that INTERCPT is an emergent ability of model scale (Weiet al2022a). We conjecture that the emergent behavior of INTERCPT is due to a better ability to ask questions and incorporate answers before generating final prediction.