“Fine-Tuning GPT-2 from Human Preferences”, Daniel Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Dario Amodei, Alec Radford, Paul Christiano, Geoffrey Irving2019-09-19 (, ; backlinks; similar)⁠:

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans”, which we believe is key to extracting information about human values.

This work applies human preference learning to several natural language tasks: continuing text with positive sentiment or physically descriptive language using the BookCorpus, and summarizing content from the TL;DR and CNN/Daily Mail datasets. Each of these tasks can be viewed as a text completion problem: starting with some text X, we ask what text Y should follow. [For summarization, the text is the article plus the string “TL;DR:”.]

We start with a pretrained language model (the 774M parameter version of GPT-2) and fine-tune the model by asking human labelers which of four samples is best. Fine-tuning for the stylistic continuation tasks is sample efficient: 5,000 human samples suffice for strong performance according to humans. For summarization, models trained with 60,000 comparisons learn to copy whole sentences from the input while skipping irrelevant preamble; this copying is an easy way to ensure accurate summaries, but may exploit the fact that labelers rely on simple heuristics.

Bugs can optimize for bad behavior

Looking forward: We’ve demonstrated reward learning from human preferences on two kinds of natural language tasks, stylistic continuation and summarization. Our results are mixed: for continuation we achieve good results with very few samples, but our summarization models are only “smart copiers”: they copy from the input text but skip over irrelevant preamble. The advantage of smart copying is truthfulness: the zero-shot and supervised models produce natural, plausible-looking summaries that are often lies. We believe the limiting factor in our experiments is data quality exacerbated by the online data collection setting, and plan to use batched data collection in the future.

We believe the application of reward learning to language is important both from a capability and safety perspective. On the capability side, reinforcement learning lets us correct mistakes that supervised learning would not catch, but RL with programmatic reward functions “can be detrimental to model quality.” On the safety side, reward learning for language allows important criteria like “don’t lie” to be represented during training, and is a step towards scalable safety methods such as a debate and amplification. [Followup: “Learning to summarize from human feedback”, Stiennon et al 2020.]