“CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning”, Louis Castricato, Alexander Havrilla, Shahbul, Matiana, Michael Pieler, Anbang Ye, Ian Yang, Spencer Frazier, Mark Riedl2022-10-14 (, , )⁠:

Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences. Existing methods to control for story preference utilize prompt engineering which is labor intensive and often inconsistent. They may also use logit-manipulation methods which require annotated datasets to exist for the desired attributes.

To address these issues, we first train a contrastive bi-encoder model to align stories with corresponding human critiques, named CARP, building a general purpose preference model. This is subsequently used as a reward function to fine-tune a generative language model via reinforcement learning.

However, simply fine-tuning a generative language model with a contrastive reward model does not always reliably result in a story generation system capable of generating stories that meet user preferences. To increase story generation robustness we further fine-tune the contrastive reward model using a prompt-learning technique.

A human participant study is then conducted comparing generations from our full system, ablations, and two baselines. We show that the full fine-tuning pipeline results in a story generator preferred over an LLM 20× as large as well as logit-based methods.

This motivates the use of contrastive learning for general purpose human preference modeling.

Figure 1: Illustration of our technique for generating story content controlled by preferences. A language model generates candidates, which are ranked by the CARP model to produce scores. The scores are used to fine-tune the language model to produce higher scoring—and thus more aligned with preferences—story continuations.

…We use Proximal Policy Optimization (PPO) (Schulman et al 2017) to fine-tune GPT-2-750M (Radford et al 2019) to generate text consistent with a given initial criterion. Reward is represented as the CARP similarity of the generated story and the desired preference. Initial attempts indicated the reward signal generated by CARP could sometimes be exploited by the generator, resulting in collapse. In some other cases the generator failed to learn anything at all.

To address this, we present CARP CoOp: a robust version of CARP leveraging prompt tuning for a stronger reward signal. We deploy a pseudo-labeling technique [Van Gansbeke et al 2020, Pham et al 2021] on CARP’s latent space. This allows identification of preferences resistant or susceptible to collapse. Further, we find CARP CoOp is extremely data-efficient, easily incorporating previously unknown preferences with only a couple hundred examples when they are available. This efficiency is demonstrated on a moral alignment dataset which classifies character stories into ‘good’, ‘neutral’ or ‘evil’. We present this as a pipeline for fine-tuning a new language mode that can robustly generate text consistent with a complex set of preferences expressed in natural language.

We evaluate our approach with a human participant study; participants were asked to match sections of generated stories to a list of preference labels ranging from moral alignment to detailed subjective imagery. We show that our proposed technique is better at producing story segments that capture the given preference than prompting a much larger (GPT-NeoX-20B) language model and the GeDi method utilizing logit manipulation. Further, we conduct an ablation study by fine-tuning GPT-2-750M with standard CARP but without CoOp, showing CARP can still improve preferences over the NeoX baseline.

In summary, we make the following contributions:

  1. Introduction of a contrastively trained preference model, CARP, as a reward signal for preference learning in story generation.

  2. A new model, Pseudo CARP CoOp, that improves the robustness of preference learning via CARP over a wide class of preferences.

  3. The introduction of the Alignment CARP CoOp model which signals the moral alignment of story characters. This demonstrates the data efficiency of CARP CoOp when annotated data is available.

  4. A human subject study evaluating how well existing and proposed generation methodologies satisfy desired human preferences.