āEvaluating the Fairness of Task-Adaptive Pretraining on Unlabeled Test Data Before Few-Shot Text Classificationā, 2024-09-30 (; similar)ā :
Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models.
Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text.
[finetuning Mistral-7b 2,000 times, and BERT and GPT-2 135,000 times, for science.] Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language modelsāBERT, GPT-2, and Mistral-7bādo:
not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds.
Code and data are available at Github.