“Exploring Length Generalization in Large Language Models”, 2022-07-11 ():
The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These include theorem proving, solving quantitative mathematics problems, and reading/summarizing novels. In this paper, we run careful empirical studies exploring the length generalization capabilities of transformer-based language models.
We first establish that naively finetuning transformers on length generalization tasks shows generalization deficiencies independent of model scale. We then show that combining pretrained large language models’ in-context learning abilities with scratchpad prompting (asking the model to output solution steps before producing an answer) results in a dramatic improvement in length generalization.
We run careful failure analyses on each of the learning modalities and identify common sources of mistakes that highlight opportunities in equipping language models with the ability to generalize to longer problems.
…Few-shot scratchpad prompting enables pretrained large language models to pick up scratchpad-templates that extrapolate to arbitrary lengths, leading to dramatic improvements on longer problem instances. Unlike raw finetuning, this approach does scale with model size [4].
Trying to further enhance the performance of few-shot scratchpad prompted LLMs via finetuning yields mixed results, depending on the non-finetuned performance of the base model at the target task. We emphasize that the aforementioned few-shot variable length pattern matching capability—something that doesn’t require changing model architecture—offers a qualitatively different approach to handle length generalization in contrast to prior art that introduced architectural modifications to achieve the same goal. This capability is also important in that it implies that for LLMs, there are certain skills, like length generalization, that can be learned better through in-context learning rather than through finetuning, even in the presence of infinite data.