âSelf-Distillation: Born Again Neural Networksâ, 2018-05-12 (; backlinks)â :
Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the studentâs compactness. We desire a compact model with performance close to the teacherâs.
We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers.
Surprisingly, these Born-Again Networks (BANs), outperform their teachers, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error.
Additional experiments explore two distillation objectives: (1) Confidence-Weighted by Teacher Max (CWTM) and (2) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and non-predicted classes.
We present experiments with students of various capacities, focusing on the under-explored case where students overpower teachers. Our experiments show advantages from transferring knowledge between DenseNets and ResNets in either direction.