We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web.
By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers.
We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences.
This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
Figure 2: Human evaluations on ELI5 comparing against (a) demonstrations collected using our web browser, (b) the highest-voted answer for each question. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.
Figure 3: TruthfulQA results. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient (see Figure 8). Error bars represent ±1 standard error.
In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API, and use unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3. Instead of trying to improve these ingredients, we focus on combining them using more faithful training objectives. Following Stiennonet al2020, we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans
We make 2 key contributions:
We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in an end to end fashion using general methods such as imitation learning and reinforcement learning.
We generate answers with references: passages extracted by the model from web pages while browsing. This is crucial for allowing labelers to judge the factual accuracy of answers, without engaging in a difficult and subjective process of independent research.
…We use this data in 4 main ways: behavior cloning (ie. supervised fine-tuning) using the demonstrations, reward modeling using the comparisons, reinforcement learning against the reward model, and rejection sampling against the reward model. Our best model uses a combination of behavior cloning and rejection sampling. We also find reinforcement learning to provide some benefit when inference-time compute is more limited.
…We evaluate our best model in 3 different ways. First, we compare our model’s answers to answers written by our human demonstrators on a held-out set of questions. Our model’s answers are preferred 56% of the time, demonstrating human-level usage of the text-based browser. Second, we compare our model’s answers (with references stripped, for fairness) to the highest-voted answer provided by the ELI5 dataset. Our model’s answers are preferred 69% of the time. Third, we evaluate our model on TruthfulQA, an adversarial dataset of short-form questions. Our model’s answers are true 75% of the time, and are both true and informative 54% of the time, outperforming our base model (GPT-3), but falling short of human performance
Figure 1: An observation from our text-based web-browsing environment, as shown to human demonstrators (left) and models (right). The web page text has been abridged for illustrative purposes.
…Environment design: …For this approach, we designed a text-based web-browsing environment. The language model is prompted with a written summary of the current state of the environment, including the question, the text of the current page at the current cursor location, and some other information (see Figure 1(b)). In response to this, the model must issue one of the commands given in Table 1, which performs an action such as running a Bing search, clicking on a link, or scrolling around. This process is then repeated with a fresh context (hence, the only memory of previous steps is what is recorded in the summary).
While the model is browsing, one of the actions it can take is to quote an extract from the current page. When this is performed, the page title, domain name and extract are recorded to be used later as a reference. Browsing then continues until either the model issues a command to end browsing, the maximum number of actions has been reached, or the maximum total length of references has been reached. At this point, as long as there is at least one reference, the model is prompted with the question and the references, and must compose its final answer
Table 2: An answer produced by our 175B best-of-64 model to a randomly-chosen question from the ELI5 test set (not cherry-picked). The full text of the references can be found in Appendix J, along with answers from our human demonstrators and the ELI5 dataset. Further samples are available.
Field
Output
Question
Why did we decide that certain words were “bad” and shouldn’t be used in social settings?
Answer
We have all been taught that there are certain words that should not be used in public, at least not without good reason. But why do some words fall into this category while others do not? It turns out that there is not a single answer to this question, as each word has a different history and carries different meanings in different cultures and families [2]. However, there are some general categories that “bad” words tend to fall into. “Bad” words generally relate to parts of life that we don’t like talking about in public, like bathroom functions, or negative ways of talking about people’s religion, skin color, ability, or gender [3]. Often, people say bad words when they are experiencing strong emotions, and those words are linked to our emotions [3].
References (titles only)
[1, 2, 3] Why Are Some Words ‘Bad’? | Vermont Public Radio (www.vpr.org)
[4] On Words: ‘Bad’ Words and Why We Should Study Them | UVA Today (news.virginia.edu)
[5] The Science of Curse Words: Why The &@$! Do We Swear? (www.babbel.com)
Figure 5: Preference of the 175B best-of-n BC model over the BC model. The validation RM prediction is obtained using the estimator described in Appendix I, and predicts human preference well in this setting. The shaded region represents ±1 standard error.
…Our results are shown in Figure 2. Our best model, the 175B best-of-64 model, produces answers that are preferred to those written by our human demonstrators 56% of the time. This suggests that the use of human feedback is essential, since one would not expect to exceed 50% preference by imitating demonstrations alone (although it may still be possible, by producing a less noisy policy). The same model produces answers that are preferred to the reference answers from the ELI5 dataset 69% of the time. This is a substantial improvement over Krishna et al,2021 whose best model’s answers are preferred 23% of the time to the reference answers, although they use substantially less compute than even our smallest model.
Figure 6: BC scaling, varying the proportion of the demonstration dataset and parameter count of the policy.
…The combination of RL and rejection sampling also fails to offer much benefit over rejection sampling alone. One possible reason for this is that RL and rejection sampling are optimizing against the same reward model, which can easily be overoptimized (especially by RL, as noted above). In addition to this, RL reduces the entropy of the policy, which hurts exploration. Adapting the RL objective to optimize rejection sampling performance is an interesting direction for future research. It is also worth highlighting the importance of carefully tuning the BC baseline for these comparisons. As discussed in Appendix E, we tuned the number of BC epochs and the sampling temperature using a combination of human evaluations and reward model score. This alone closed much of the gap we originally saw between BC and RL.
Figure 7: RM scaling, varying the proportion of the comparison dataset and parameter count of the reward model.
…Scaling trends with dataset size and parameter count are shown in Figures 6 & 7. For dataset size, doubling the number of demonstrations increased the policy’s reward model score by about 0.13, and doubling the number of comparisons increased the reward model’s accuracy by about 1.8%. For parameter count, the trends were noisier, but doubling the number of parameters in the policy increased its reward model score by roughly 0.09, and doubling the number of parameters in the reward model increased its accuracy by roughly 0.4%.
Figure 8: Best-of-n scaling, varying the parameter count of the policy and reward model together, as well as the number of answers sampled.
…For rejection sampling, we analyzed how to trade off the number of samples against the number of model parameters for a given inference-time compute budget (see Figure 8). We found that it is generally compute-efficient to use some amount of rejection sampling, but not too much. The models for our main evaluations come from the Pareto frontier of this trade-off: the 760M best-of-4 model, the 13B best-of-16 model, and the 175B best-of-64 model. [cf. Jones2021]