âUltraFeedback: Boosting Language Models With High-Quality Feedbackâ, 2023-10-02 ()â :
Reinforcement learning from human feedback (RLHF) has become a pivot technique in aligning large language models (LLMs) with human preferences. In RLHF practice, preference data plays a crucial role in bridging human proclivity and LLMs. However, the scarcity of diverse, naturalistic datasets of human preferences on LLM outputs at scale poses a great challenge to RLHF as well as feedback learning research within the open-source community. Current preference datasets, either proprietary or limited in size and prompt variety, result in limited RLHF adoption in open-source models and hinder further exploration.
In this study, we propose UltraFeedback, a large-scale, high-quality, and diversified preference dataset designed to overcome these limitations and foster RLHF development. To create UltraFeedback, we compile a diverse array of instructions and models from multiple sources to produce comparative data. We meticulously devise annotation instructions and employ GPT-4 to offer detailed feedback in both numerical and textual forms. UltraFeedback establishes a reproducible and expandable preference data construction pipeline, serving as a solid foundation for future RLHF and feedback learning research.
Utilizing UltraFeedback, we train various models to demonstrate its effectiveness, including the reward model UltraRM, chat language model UltraLM-13B-PPO, and critique model UltraCM. Experimental results indicate that our models outperform existing open-source models, achieving top performance across multiple benchmarks.
Our data and models are available at Github.