An Empirical Study of Iterative Knowledge Distillation for Neural Network Compression

Abstract

In this paper we introduce Iterative Knowledge Distillation (IKD), the process of successively minimizing models based on the Knowledge Distillation (KD) approach in Hinton et al. We study two variations of IKD, called parental- and ancestral- training. Both use a single-teacher, and result in a single-student model: the differences arise from which model is used as a teacher. Our results provide support for the utility of the IKD procedure, in the form of increased model compression, without significant losses in predictive accuracy. An important task in IKD is choosing the right model(s) to act as a teacher for a subsequent iteration. Across the variations of IKD studied, our results suggest that the most recent model constructed (parental-training) is the best single teacher for the model in the next iteration. This result suggests that training in IKD can proceed without requiring us to keep all models in the sequence.

Publication
28th European Symposium on Artificial Neural Networks, 2020. (Accepted)