“Training a Helpful and Harmless Assistant With Reinforcement Learning from Human Feedback”, 2022-04-12
We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. This approach sets a new standard in aligning AI systems with human intentions and preferences.
We explore an iterated online mode of training, where preference models and reinforcement learning policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. This process allows for rapid iteration and adaptation, ensuring that the models remain relevant and effective over time.
We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as Python coding and summarization. Our research demonstrates significant advancements in the capability of language models to understand and generate human-like text, making them more versatile and useful across a broader range of tasks.
Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the Kullback–Leibler (KL) divergence between the policy and its initialization. This finding offers valuable insights into the mechanics of reinforcement learning and its applicability in language model training.
Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of out-of-distribution (OOD) detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work. This supplementary information enriches the understanding of our methods and their implications for the development of AI language models.