“Large Language Models Can Self-Improve”, 2022-10-20 ():
Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs.
In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained PaLM LLM to generate “high-confidence” rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs.
We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4% → 82.1% on GSM8K, 78.2% → 83.0% on DROP, 90.0% → 94.4% on OpenbookQA, and 63.4% → 67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label.
We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.
[Jack Clark summary:
The results are mind-blowing: Using this technique, the researchers are able to get new state-of-the-art results on 4⁄6 reasoning benchmarks. They also show very good results on out-of-domain tasks, e.g arithmetic reasoning and natural language reasoning. It generally seems like chain-of-thought plus self-consistency leads to robust gains on a large set of diverse tasks. Also, it’s an inherently simple approach, and simple tends to scale.
Why this matters—self-bootstrapping systems: This is an example of a self-bootstrapping AI; the language model can get better performance purely by leveraging its own capabilities. This is also a neat illustration of how there’s a current capabilities overhang in AI development; the LMs we have today are actually much more powerful than they appear, and we mostly need to invent ways to uncover these techniques or, as in the research here, figure out how to get LMs to themselves reveal their capabilities to us.]
View PDF: