[code/model; cf. Tirumalaet al2022, Ostapenkoet al2022, Dinget al2022, Jinet al2021, Wanget al2022; based on instruction-tuning] Recent work on large language models relies on the intuition that most natural language processing tasks can be described via natural language instructions. Language models trained on these instructions show strong zero-shot performance on several standard datasets. However, these models even though impressive still perform poorly on a wide range of tasks outside of their respective training and evaluation sets [eg. catastrophic forgetting].
To address this limitation, we argue that a model should be able to keep extending its knowledge and abilities, without forgetting previous skills. In spite of the limited success of Continual Learning we show that Fine-tuned Language Models can be continual learners.
We empirically investigate the reason for this success and conclude that Continual Learning emerges from self-supervision pre-training.
Our resulting model Continual-T0 (CT0) is able to learn diverse new tasks, while still maintaining good performance on previous tasks, spanning remarkably through 70 datasets in total.
Finally, we show that CT0 is able to combine instructions in ways it was never trained for, demonstrating some compositionality.
…6. Conclusion: We explored for the first time Continual Learning for instruction-based models. Our results indicate that fine-tuned Language Models are efficient continual learners: 1% rehearsal is enough to maintain a high performance on previously learned tasks, while learning new ones. Additionally, we show that our model CT0 is able to comprehend new instructions obtained via instruction composition.
The current technique to learn multiple tasks is to train a model from scratch. We hope this work paves the way toward a new paradigm where models do not have to be retrained all over again. We believe our experimental findings will contribute to the effectiveness of large language models, enabling them to progressively adapt to new concepts and acquire more and more abilities. As an analogy with Software Development, this could be seen as learning new features. New checkpoints are like new versions of a model. In this context, Continual Learning will help toward the “Call to Build Models Like We Build Open-Source Software”.
…Notably, we also maintain the performance for the T0 zero-shot evaluation datasets, even though no rehearsal for those was done, the first of its kind setup for CL (§4).
Our final model, Continual-T0 (CT0) in addition to performing as well as T0 on all the different T0 datasets, can also understand instructions about the newly introduced tasks focused on language generation problems such as writing a haiku, generating empathetic responses in a dialogue, simplifying text, generating a headline with decoding constraints, generating natural language explanations for Natural Language Inference (NLI) tasks, generating a Tweet on a given topic in the style of a given author, or question answering for new domain/concepts such as COVID-19.
…Why are transformer models (like T0) continual learners? Is it because of their multi-task nature or the instruction tuning paradigm? Or does the large scale parameterization of language models contribute to this success? Our experimental analysis show that the easy adaptability and continual learning capabilities actually emerge from pre-training and not the above, including scale (Table 5, §5.1).
…To test this hypothesis, we start from T0 checkpoint, a model trained on 50 datasets. We progressively train it on a sequence of 8 new NLG tasks (see §7.3.1 & Table 2 for description of those tasks) using Continual Learning via rehearsal (r = 1%). We call our final model CT0…In Figure 2, we display the progressive sequential learning on the 8 new tasks. We learn a new task, starting from T0, and add to our rehearsal buffer 1% of the data of the learned task. We observe an improvement progressively for each task, that is our model keeps learning new tasks. At the same time, the performance is preserved for the other tasks, (ie. the Relative gain remains around 1) indicating the success of our CLR method in a sequential learning setup through more than 1,000 gradient steps over 8 different tasks.
Figure 2: Progressive Relative Gain results for CT0 (11B) during the sequential learning (y-axis) vs Number of Training steps (x-axis). The curves for tasks T0, …T7 are displayed respectively at step 0, …, i such that only the first task, Simplification (green and orange) is present at step 0, then HGen (red) etc.
…Our two CT0 models obtain final results very close to their UB, maintaining 99.8% for T0pp and 98.0% for T0_3B. This clearly indicates the efficiency of the CLR method. Notably, no task suffers a decrease in performance more than 2% for T0pp. Table 3 shows how the CT0 model remembers and retains knowledge from tasks trained at very early stages of the Continual Learning process. Moreover, CT0 still performs well on the zero-shot set of tasks (T0zs) despite no rehearsal for those.
…5.1 Why could LLMs be lifelong learners? Given our current experimental protocol, one can draw different hypotheses: is CL a consequence emerging from the massive multi-task pre-training in T0? Or from the instruction tuning paradigm of T0? Or from the scaling law as studied by Ramaseshet al2022? To answer this research question, we applied the same CL setup starting from (1) T5-small, (2) T5-3B, and (3) a T5-3B architecture randomly initialized. Our results in Table 5 show that CT5 with 3b parameters performs similar to CT03B on the 8 tasks. While CT5-small obtains as expected a lower average performance, it still mostly maintains great results w.r.t. its Upper Bound, indicating that CL does not emerge from scale. Conversely, when initialized randomly the model is not even able to obtain a good UB. These results draw a clear conclusions: CL emerges from the intensive pre-training stage. This confirms contemporaneous findings by Cossuet al2022 and Mehtaet al2021 in other setups and even modalities. We report the detailed results for those experiments in the Appendix.
Table 5: Results including T5-small and T5-3B, T0_3B, and a 3B Transformer randomly initialized. We can observe that (1) only CTrand largely degrades w.r.t. its UB, UB_rand; (2) even T5_small is able to mostly maintain its performance indicating that scale is not what matter the most.