“BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models”, Elad Ben Zaken, Shauli Ravfogel, Yoav Goldberg2021-06-18 (; backlinks)⁠:

We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified.

We show that with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, the method is competitive with other sparse fine-tuning methods.

Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.

Figure 2: Comparison of BitFit and Full-FT with BERT~BASE~ exact match score on SQuAD validation set.
Figure 2: Comparison of BitFit and Full-FT with BERTBASE exact match score on SQuAD validation set.

…Size of training data: The GLUE results suggest a reverse correlation between BitFit ability to reach Full-FT performance, and training set size. To test this (and to validate another token-level task), we train on increasing-sized subsets of SQuAD v1.0. The results on Figure 2 show a clear trend: BitFit dominates over Full-FT in the smaller-data regime, while the trend is reversed when more training data is available.

We conclude that BitFit is a worthwhile targeted fine-tuning method in small-to-medium data regimes.