“Evaluating Parameter Efficient Learning for Generation”, 2022-10-25 ():
Parameter efficient learning methods (PERMs) have recently gained attention as they provide an efficient way for pre-trained language models (PLMs) to adapt to a downstream task. However, these conclusions are mostly drawn from in-domain evaluations over the full training set. In this paper, we present comparisons between PERMs and finetuning from 3 new perspectives: (1) the effect of sample and model size to in-domain evaluations, (2) generalization to unseen domains and new datasets, and (3) the faithfulness of generations.
Our results show that for in-domain settings (a) there is a cross point of sample size for which PERMs will perform better than finetuning when training with fewer samples, and (b) larger PLMs have larger cross points. For cross-domain and cross-dataset cases, we show that (a) Adapter ( et al 2019) performs the best amongst all the PERMs studied here, and (b) it outperforms finetuning if the task dataset is below a certain size. We also compare the faithfulness of generations and show that PERMs can achieve better faithfulness score than finetuning, especially for small training set, by as much as 6%.
Finally, we apply Adapter to MT-NLG 530b ( et al 2022) and achieve new state-of-the-art results on XSum ( et al 2018) for all ROUGE scores (ROUGE-1 49.17, ROUGE-2 27.20, ROUGE-L 40.98).
…Comparing FT [finetuning] with AP [Adapter], we find there is always a cross point of sample size where FT is better than AP. This shows that if we have large number of samples in training set, FT will work better. But if the number of samples for the task are small, AP will be better. Also, this cross point will be larger if we use larger PLMs. For example, the cross point for 1.3b model over XSum is less than 10k samples, whereas for the 8.3b model, it is 50k samples. This phenomenon can be attributed to that FT can easily overfit when you have large models or few training samples. It motivates us to use AP when you have small dataset or large model to achieve better in-domain performances.
Interestingly, tuning AP with a 8.3b model of only 32m extra parameters over 5k samples achieves much better results than finetuning 357m model over 100k samples. This means more than 90% task-specific parameters can be saved for deployment and more than 97% tasks-specific samples can be reduced for training by sharing the larger frozen PLMs.