“MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers”, 2020-02-25 (; similar):
Pre-trained language models (eg. BERT ( et al 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints.
In this work, we present a simple and effective approach to compress large Transformer ( et al 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (ie. the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant ( et al 2019) also helps the distillation of large pre-trained Transformer models.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99% accuracy on SQuAD 2.0 and several GLUE benchmark tasks using 50% of the Transformer parameters and computations of the teacher model.
We also obtain competitive results in applying deep self-attention distillation to multilingual pre-trained models.