“GPT-2 Preference Learning for Music Generation § Bradley-Terry Preference Learning”, Gwern2019-12-16 (, , , ; similar)⁠:

Experiments with OpenAI’s ‘preference learning’ approach, which trains a NN to predict global quality of datapoints, and then uses reinforcement learning to optimize that directly, rather than proxies. I am unable to improve quality, perhaps due to too-few ratings.

Christiano et al 2017 introduced a deep reinforcement learning architecture for learning “I know it when I see it” subjectively-defined reward functions from human feedback: a human makes comparisons of actions/datapoints/episodes to select the ‘better’ one, a NN is trained to predict the better one based on these comparisons, and another NN is RL-trained based on the predicted comparisons interpreted as a reward. Since the human is unable to write down a conventional reward function in software, the predictor NN (analogous to a Discriminator in a GAN or a ‘critic’ in actor-critic RL) learns the reward function by example, and then the RL agent NN (analogous to a Generator in a GAN) learns by trial-and-error what sequences will optimize this complex reward function, and the human feedback provides additional guidance on new parts of the problem as the pair of NNs bootstrap into better performance. This is demonstrated on video game or robotic-style simulations, but appears equally applicable to other sequence problems where reward functions are impossible to write and existing losses like maximum likelihood are imperfect for generation (such as music or poetry composition).

As originally framed, the predictor merely does comparisons, receiving & providing binary feedback. This is justified as being implicitly equivalent to a standard pair-comparison/competition model, the Bradley-Terry model (akin to the famous ELO), where each datapoint has a latent variable on a common cardinal scale (often, like a liability-threshold model, scaled to 𝒩(0,1) for convenience), producing a total order which efficiently extracts all possible information from the comparisons.

I suggest that this is not necessarily the case, as examples from GANs indicate that such a preference-learning architecture may be learning something odder (such as memorizing comparisons), and that the architecture could be improved by removing the implicitness of the B-T ranking, computing the B-T rankings directly (which can be done even with non-overlapping comparisons by using a Bayesian model with priors and using covariates such as the predictor’s own estimates), thereby providing absolute quality scores for correctness of comparisons, more efficient regression, RL rewards, and meaningful interpretable scores for downstream uses.