Faculty Distillation with Optimal Transport

04/25/2022

∙

Knowledge distillation (KD) has shown its effectiveness in improving a student classifier given a suitable teacher. The outpouring of diverse and plentiful pre-trained models may provide abundant teacher resources for KD. However, these models are often trained on different tasks from the student, which requires the student to precisely select the most contributive teacher and enable KD across different label spaces. These restrictions disclose the insufficiency of standard KD and motivate us to study a new paradigm called faculty distillation. Given a group of teachers (faculty), a student needs to select the most relevant teacher and perform generalized knowledge reuse. To this end, we propose to link teacher's task and student's task by optimal transport. Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions by minimizing Sinkhorn distances. The transportation cost also acts as a measurement of teachers' adaptability so that we can rank the teachers efficiently according to their relatedness. Experiments under various settings demonstrate the succinctness and versatility of our method.

READ FULL TEXT

Faculty Distillation with Optimal Transport

Sign in with Google

Consider DeepAI Pro