Estimating and Maximizing Mutual Information for Knowledge Distillation

10/29/2021
by   Aman Shrivastava, et al.
0

In this work, we propose Mutual Information Maximization Knowledge Distillation (MIMKD). Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network. We demonstrate through extensive experiments that this can be used to improve the performance of low capacity models by transferring knowledge from more performant but computationally expensive models. This can be used to produce better models that can be run on devices with low computational resources. Our method is flexible, we can distill knowledge from teachers with arbitrary network architectures to arbitrary student networks. Our empirical results show that MIMKD outperforms competing approaches across a wide range of student-teacher pairs with different capacities, with different architectures, and when student networks are with extremely low capacity. We are able to obtain 74.55 of 69.8 ResNet-18 network from 68.88 teacher network.

READ FULL TEXT

page 7

page 12

page 13

page 14

research
12/15/2020

Wasserstein Contrastive Representation Distillation

The primary goal of knowledge distillation (KD) is to encapsulate the in...
research
04/11/2019

Variational Information Distillation for Knowledge Transfer

Transferring knowledge from a teacher neural network pretrained on the s...
research
10/12/2022

Efficient Knowledge Distillation from Model Checkpoints

Knowledge distillation is an effective approach to learn compact models ...
research
03/28/2023

Information-Theoretic GAN Compression with Variational Energy-based Model

We propose an information-theoretic knowledge distillation approach for ...
research
06/13/2023

Enhanced Multimodal Representation Learning with Cross-modal KD

This paper explores the tasks of leveraging auxiliary modalities which a...
research
05/23/2023

One-stop Training of Multiple Capacity Models for Multilingual Machine Translation

Training models with varying capacities can be advantageous for deployin...
research
07/23/2022

Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition

The teacher-free online Knowledge Distillation (KD) aims to train an ens...

Please sign up or login with your details

Forgot password? Click here to reset