We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified.
We show that with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, the method is competitive with other sparse fine-tuning methods.
Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
Figure 2: Comparison of BitFit and Full-FT with BERTBASE exact match score on SQuAD validation set.
âŚSize of training data: The GLUE results suggest a reverse correlation between BitFit ability to reach Full-FT performance, and training set size. To test this (and to validate another token-level task), we train on increasing-sized subsets of SQuAD v1.0. The results on Figure 2 show a clear trend: BitFit dominates over Full-FT in the smaller-data regime, while the trend is reversed when more training data is available.
We conclude that BitFit is a worthwhile targeted fine-tuning method in small-to-medium data regimes.