“What Can a Generative Language Model Answer About a Passage?”, 2021-11-10 ():
Generative language models trained on large, diverse corpora can answer questions about a passage by generating the most likely continuation of the passage followed by a question/answer pair. However, accuracy rates vary depending on the type of question asked.
In this paper we keep the passage fixed, and test with a wide variety of question types, exploring the strengths and weaknesses of the GPT-3 language model.
We provide the passage and test questions as a challenge set for other language models.
…4.2.3 Reasoning: The most challenging questions we posed were questions requiring some kind of reasoning process to arrive at the answer. There has been some success at getting GPT to correctly follow a reasoning process by giving examples of the reasoning steps to follow, and having it imitate these steps one at a time. In the zero-shot prompts we are using, however, reasoning beyond what was required for the earlier types of questions seemed to be beyond its capabilities. It is unclear to what extent these difficulties with reasoning lie with the architecture (a limited number of layers can only carry out so many steps) or with the training set. Certainly other transformers trained on, for example, calculus problems (2019) rather than web text, are able to correctly generate valid chains of reasoning.
Mathematical Word Problems: Questions that require mathematical operations were frequently incorrect. This matches what one would expect from the original paper on GPT-3, where zero-shot math questions were usually incorrect. (eg. “The events of this story happened in 1975. How old would Mr. Hug be in 2020? He would be 91 years old.”)
GPT-3’s abilities at arithmetic and “word problems” have been the subject of several investigations (2020). It is clear that the tokenization makes arithmetic more difficult for the model to learn.
Temporal Reasoning: InstructGPT is able to successfully answer questions about what events happened during a particular time interval (eg. What happened after the robbers arrived and before they pushed Mr. Hug down the elevator shaft?) Questions about a time interval that require reasoning about beginnings and ends of events, however, were difficult for the model. These questions were inspired by 2020. (eg. “Did the elevator car reach the springs before Mr. Hug finished falling? Yes.”)
False Premises: Questions with false premises were almost never answered correctly. Answering these questions correctly would mean pointing out the error in the question [see inner-monologue on eliciting multiple-step answers]. Instead, the model answers as if the premise of the question were true in a plausible way. (eg. “Why was there an airplane in the furniture store? The airplane was a display in the store.”)