“AlpaGasus: Training A Better Alpaca With Fewer Data”, Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin2023-07-17 (, , )⁠:

Large language models (LLMs) obtain instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (eg. Alpaca’s 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT.

In this paper, we propose a simple and effective data selection strategy that automatically identifies and removes low-quality data using a strong LLM (eg. ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data.

AlpaGasus outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and its 13B variant matches >90% performance of its teacher LLM (ie. Text-Davinci-003) on test tasks. It also provides 5.7× faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes.

AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models.

Our project page is available at: https://lichang-chen.github.io/AlpaGasus/.