“Learning through Human Feedback [Blog]”, 2017-06-12 (; backlinks; similar):
The system—described in our paper “Deep Reinforcement Learning from Human Preferences”—departs from classic RL systems by training the agent from a neural network known as the ‘reward predictor’, rather than rewards it collects as it explores an environment.
It consists of three processes running in parallel:
A reinforcement learning agent explores and interacts with its environment, such as an Atari game.
Periodically, a pair of 1–2 second clips of its behavior is sent to a human operator, who is asked to select which one best shows steps towards fulfilling the desired goal.
The human’s choice is used to train a reward predictor, which in turn trains the agent. Over time, the agent learns to maximise the reward from the predictor and improve its behavior in line with the human’s preferences.
The system separates learning the goal from learning the behavior to achieve it
This iterative approach to learning means that a human can spot and correct any undesired behaviors, a crucial part of any safety system. The design also does not put a onerous burden on the human operator, who only has to review around 0.1% of the agent’s behavior to get it to do what they want. However, this can mean reviewing several hundred to several thousand pairs of clips, something that will need to be reduced to make it applicable to real world problems.
View HTML: