“Self-Distillation: Born Again Neural Networks”, Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, Anima Anandkumar2018-05-12 (, ; backlinks)⁠:

Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness. We desire a compact model with performance close to the teacher’s.

We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers.

Surprisingly, these Born-Again Networks (BANs), outperform their teachers, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error.

Additional experiments explore two distillation objectives: (1) Confidence-Weighted by Teacher Max (CWTM) and (2) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and non-predicted classes.

We present experiments with students of various capacities, focusing on the under-explored case where students overpower teachers. Our experiments show advantages from transferring knowledge between DenseNets and ResNets in either direction.