“Self-Ask: Measuring and Narrowing the Compositionality Gap in Language Models (Bamboogle)”, Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, Mike Lewis2022-10-07 (, , , , )⁠:

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining.

In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning.

We then demonstrate how elicitive prompting (such as chain-of-thought) narrows the compositionality gap by reasoning explicitly instead of implicitly. We present a new method, self-ask, that further improves on chain-of-thought. In our method, the model explicitly asks itself (and then answers) follow-up questions before answering the initial question. We finally show that self-ask’s structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

…Unsurprisingly, we find that performance on both single & multi-hop question answering improves monotonically with the scale of the pretrained model. Intriguingly, however, we find that the compositionality gap remains at a roughly constant 40% between different model sizes and training techniques, with no apparent improvement from scale (Figure 1). This result is especially surprising given the straightforward reasoning steps required for such questions, and it suggests that larger scale pretraining is highly effective at teaching models to memorize facts but not how to compose them. We also find a positive correlation between LM confidence about a certain fact and whether it can compose it with other facts to answer compositional questions about it. [calibration?]

Figure 1: The compositionality gap does not decrease with scale. This plot shows accuracy for the compositional questions (blue) and the two sub-questions that comprise them (green) in the Compositional Celebrities dataset. ⧍ are the regular GPT models and ◦ and × are the 001 and 002 instruct models. Percentages are relative and indicate the compositionality gaps.

…We evaluate GPT-3 on CC, using a 2-shot prompt, separately for the 2 & 1-hop questions. We use a category-specific prompt for each question that contains 2 randomly chosen examples that have been removed from the dataset; see Appendix Table 3 & 4. GPT-3 (davinci-002) correctly answers 45.4% of the 2-hop questions. In some categories, such as Birthplace/Domain Name, the accuracy reaches 84.6% even though the model most likely did not explicitly see the vast majority of these questions during training. The model even answers some extreme questions correctly, such as “What is the top-level domain of the birthplace of Plato?” [born in Athens, so ancient Greece; modern Greece owns .gr] On the hardest categories, such as Birth Year/Literature Nobel Prize Winner, GPT-3 answers only 1.2% of the questions correctly. However, it answers 80% of the sub-questions on this dataset correctly, showing that it has seen many of the individual facts required to answer these questions but lacks sufficient compositional capabilities to answer the question correctly. Appendix Table 5 shows the full results

…Beyond CC, we also apply elicitive prompts to two existing, automatically generated datasets, (2WikiMultiHopQA, Ho et al 2020, and MuSiQue-Ans, Trivedi et al 2022), and on a third dataset of 125 questions, Bamboogle, that we manually constructed. Bamboogle is a small dataset with 2-hop questions written by the authors, where all questions are sufficiently difficult to be unanswerable by a popular internet search engine, but where both supporting pieces of evidence can be found in Wikipedia (and hence were probably included in the pretraining set of any LM).

The two datasets we constructed—Bamboogle and the previously mentioned Compositional Celebrities—are complementary and serve different research purposes. Bamboogle is a small, hand-crafted dataset that covers many different types of questions on different areas written in unique ways, whereas CC (similar to MuSiQue and 2WikiMultiHopQA) is a large, automatically generated dataset where each question fits into one of the 17 templates we made (ie. much lower variation than Bamboogle). Compositional Celebrities is designed to estimate the compositionality gap on a large set of questions, and Bamboogle is designed to measure the extent to which a question answering system can answer varied compositional questions, albeit with less statistical power.

…How can we determine which facts GPT-3 will and will not be able to compose? Figure 2 shows that as the perplexities assigned to correct sub-question answers decrease (ie. the model becomes more confident about the correct answers), the probability of answering the compositional question correctly increases. For example, when the maximal perplexity assigned to the correct sub-question answer (ie. assigned to the correct answer that the model is less confident about of the two sub-questions) is 1.232–6.738, the model answers 42.6% of the compositional question correctly. However, when the maximal perplexity is 1.000–1.002, the model answers 81.1% of the compositional question correctly. We observed a similar pattern when sorting sub-question pairs by average instead of worse perplexity.

Figure 3: Direct prompting [zero-shot] (Brown et al 2020) compared to chain-of-thought and our ‘self-ask’ method on a question from Bamboogle. Text with a white background is the prompt, text with a green background is the LM output, and underlined text is the inference-time question. The prompt has been shortened here, we actually used a 4-shot prompt for this dataset, see §3.5.

…“Let’s think step by step” from Kojima et al 2022 is also an elicitive method, but in our experiments it obtained 45.7%/1.1% accuracy with InstructGPT-davinci-002/Davinci whereas self-ask obtains 79.6%/54.2% accuracy with those models on CC. This is consistent with results from Kojima et al 2022 showing that their method is not as strong as chain-of-thought and that using non-instruct models further degrades performance. We therefore do not run more experiments with this method here.

3.1 Self-Ask: Our method builds on chain-of-thought prompting, but, instead of outputting a continuous un-demarcated chain-of-thought, our prompt has the model explicitly state the next follow-up question it wants to ask before answering it. In addition, our method inserts scaffolds like “Follow up:”, which we found to improve the ability to output the correct final answer in an easily parseable way. As we later show, this makes it easy to integrate our method with an internet search engine to answer follow-up questions, which further improves performance. Self-ask (depicted in Figure 3) requires a one or few-shot prompt that demonstrates how to answer the questions. Our prompt starts with those examples, after which we append the inference-time question. We then insert the phrase “Are follow up questions needed here:” at the end of the prompt since we found that doing so slightly improves results.3 The model then outputs a response. In most cases it first outputs “Yes.”, meaning that follow-up questions are necessary. The LM then outputs the first follow-up question, answers it, and continues asking and answering follow-up questions until it decides it has sufficient information; at this point, it outputs “So the final answer is:” before providing the final answer; this makes the final answer easily parseable as what appears after ‘:’ on the last output line. In rare cases the LM decides that it need not ask follow-up questions and can answer the question immediately. As in chain-of-thought, our method is completely automatic: we simply input the prompt and the test-time question, and the model executes the entire process by itself, including deciding how many follow-up questions to ask.

We hypothesize that the advantage of self-ask over chain-of-thought is that it disentangles the decomposition of the full question (by formulating sub-questions) from the actual answers to those sub-questions. In addition, the rigid scaffolding self-ask provides makes it easier for the model to state the final answer in a concise, parseable way. In some cases chain-of-thought did not output a short-form final answer, preferring instead to opt for full sentences, which is not the format demonstrated in its prompt. In Bamboogle, 40% of chain-of-thought’s final answers were not in short form, while that metric was 17% for self-ask and 3% for self-ask + search engine. Appendix Table 14 contains examples of chain-of-thought failure cases.

Figure 4: Self-ask is able to narrow and sometimes close the compositionality gap on CC. Here, self-ask uses a 1-shot prompt. chain-of-thought performed within 1% of self-ask on this dataset and is not shown here. ⧍ are the regular GPT models and ◦ and × are the 001 and 002 instruct models.

3.3 Improving Self-Ask With A Search Engine: Unlike chain-of-thought, self-ask clearly demarcates the beginning and end of every sub-question. Therefore, we can use a search engine to answer the sub-questions instead of the LM. Search engines have features that LMs lack, such as an ability to be easily and quickly updated (Kasai et al 2022).

We therefore integrate a popular internet search engine into self-ask. Figure 5 depicts Self-ask + Search Engine (SA+SE). Note that SA+SE uses the same prompt as self-ask. We input the prompt to the language model; if the LM outputs “Follow up:”, we let it finish generating the question, indicated by outputting the string “Intermediate answer:”. Upon this response, we stop the LM, and, instead of having it output its own answer, we input the full sub-question the model asked to a search engine API. [We used “SerpAPI: Google Search API”, which provides an API for a popular internet search engine.] We then add the answer returned by the search engine to the prompt before asking the LM to continue generating its answer.

Thus, the LM takes as input a compositional question and decomposes it by first outputting an initial sub-question that is fed into the search engine; the answer is fed back to the LM, which generates another sub-question, and so on, until it outputs the final answer (marked as such). Since we insert the results from the search engine back into the prompt as if the LM had output that result, we need not finetune our model with any special syntax or modify the model’s architecture. Furthermore, we need not even modify our prompt to integrate the search engine into self-ask! In all experiments, the prompt we use for the self-ask + search engine method is exactly the same as the one we use for self-ask. This method is implementable in only a few lines of code. It lets the LM—with no changes—use an API, and API calls are not directly exposed to the LM, only their results.

Figure 5: Self-ask + Search Engine: prompt on white background, LM-generated text in green. We start by using a few-shot prompt (reduced here for space) and appending the test-time question (underlined) to it. The LM then generates a follow-up question which we input to an internet search engine. The response is inserted back into the rest of the prompt to let the LM generate the next follow-up question. This process then repeats until the LM decides to output the final answer.
Table 1: davinci-002 accuracy (%) on Bamboogle, 2WikiMultiHopQA, and MuSiQue. For Bamboogle, accuracy is manually judged by the authors; for the other datasets, the metric is exact match. Search Engine is a popular internet search engine. Appendix Table 13 shows results on 2WikiMultiHopQA and MuSiQue using other metrics, which rank the systems in the same way as here.

…We use only the 1,252 questions from the MuSiQue development set that are labeled as 2-hop since we found the 3 & 4-hop questions to be too convoluted even for the paper authors to understand at times.

…Self-ask improves over chain-of-thought by smaller margins on 2WikiMultiHopQA and MuSiQue but by a large 11% (absolute) on Bamboogle. We hypothesize that the much more varied nature of Bamboogle, and the fact that most questions are not similar to those in the few-shot prompt, might make it harder for chain-of-thought to decompose the questions, whereas our self-ask model, which explicitly decomposes questions before answering them, deals much better with novel inference questions. Integrating the search engine into self-ask further improves performance on all datasets, sometimes by as much as 10% (absolute).