“GPT-2 Preference Learning for Music Generation § Decision Transformers: Preference Learning As Simple As Possible”, Gwern2019-12-16 (, , , , , ; similar)⁠:

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.

I propose an extremely simple form of preference learning using the ‘Decision Transformer’ approach: simply encode the human choices as a sorted list of options, finetune a sequence model like GPT to predict text on a dataset including that encoded preference/choice data, and then generate text in the form of said sorted lists. The model will learn what preferred text looks like, and in generating sorted lists, will generated preferred samples first.

This RL approach, by gardening the data, is closer to prompt programming or Software 2.0 than traditional DRL algorithms like PPO. This avoids all of the complexity, instability, and compute requirements of the GPT preference learning approach used previously, moving the reward learning to inside the dataset, and is particularly applicable to tasks like AI Dungeon-style text adventure games, where the complexity of training rankers & RL-finetuned models has barred their use to date.