“GPT-2 Preference Learning for Music Generation § Optimization by Backprop, Not Blackbox”, Gwern2019-12-16 (, , , , , ; similar)⁠:

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.

Here I propose changing the agent/generator model architecture to explicitly optimize the reward model’s utility/reward score, by removing the agent/generator entirely and instead improving possible sequences by gradient ascent on the (differentiable) reward model. There is no need to build a redundant agent model when the reward model is differentiable and can be used to directly specify how an input sequence ought to change to improve it.

This simplifies the overall architecture greatly, avoids expensive & unstable & complex blackbox training of DRL agents, and enables easy generation of both high-scoring & highly-diverse (thus informative) sequences for an oracle to rate, which can then be fed back into the reward model for further training. To the extent an agent/generator is necessary to efficiently generate many sequences, it can be trained quickly & stably by imitation learning on a corpus of datapoints optimized by the model.