Distilling and Transferring Knowledge via cGAN-generated Samples for Image Classification and Regression

04/07/2021
by   Xin Ding, et al.
The University of British Columbia
7

Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student model based on the knowledge from a teacher model. However, there have been very few efforts for applying KD in image regression with a scalar response, and there is no KD method applicable to both tasks. Moreover, existing KD methods often require a practitioner to carefully choose or adjust the teacher and student architectures, making these methods less scalable in practice. Furthermore, although KD is usually conducted in scenarios with limited labeled data, very few techniques are developed to alleviate such data insufficiency. To solve the above problems in an all-in-one manner, we propose in this paper a unified KD framework based on conditional generative adversarial networks (cGANs), termed cGAN-KD. Fundamentally different from existing KD methods, cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples. This unique mechanism makes cGAN-KD suitable for both classification and regression tasks, compatible with other KD methods, and insensitive to the teacher and student architectures. Also, benefiting from the recent advances in cGAN methodology and our specially designed subsampling and filtering procedures, cGAN-KD also performs well when labeled data are scarce. An error bound of a student model trained in the cGAN-KD framework is derived in this work, which theoretically explains why cGAN-KD takes effect and guides the implementation of cGAN-KD in practice. Extensive experiments on CIFAR-10 and Tiny-ImageNet show that we can incorporate state-of-the-art KD methods into the cGAN-KD framework to reach a new state of the art. Also, experiments on RC-49 and UTKFace demonstrate the effectiveness of cGAN-KD in image regression tasks, where existing KD methods are inapplicable.

READ FULL TEXT VIEW PDF

Authors

page 9

page 12

06/30/2020

Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

Knowledge distillation has been used to transfer knowledge learned by a ...
11/29/2021

Improved Knowledge Distillation via Adversarial Collaboration

Knowledge distillation has become an important approach to obtain a comp...
08/06/2020

MED-TEX: Transferring and Explaining Knowledge with Less Data from Pretrained Medical Imaging Models

Deep neural network based image classification methods usually require a...
04/03/2019

Correlation Congruence for Knowledge Distillation

Most teacher-student frameworks based on knowledge distillation (KD) dep...
07/22/2020

Leveraging Undiagnosed Data for Glaucoma Classification with Teacher-Student Learning

Recently, deep learning has been adopted to the glaucoma classification ...
07/27/2018

One-Shot Optimal Topology Generation through Theory-Driven Machine Learning

We introduce a theory-driven mechanism for learning a neural network mod...
09/26/2019

Two-stage Image Classification Supervised by a Single Teacher Single Student Model

The two-stage strategy has been widely used in image classification. How...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Aheavyweight model is often a deep, overparameterized neural network or an ensemble of multiple deep neural networks in deep learning. It usually has high precision but also incurs some costs: (1) a high memory cost due to the large model size (i.e., many learnable parameters); (2) a low inference speed (i.e., number of images processed by a model per second). Note that a neural network’s inference speed is related to both the number of learnable parameters and the connection design. For example, DenseNet-121

[19] has a densely connected pattern (leads to a high computation cost) while VGG11 [45] is a simple, straightforward network. Although DenseNet-121 has fewer learnable parameters than VGG11 does, evaluating VGG11 is about 10 times faster than DenseNet-121. Unfortunately, in many application scenarios (e.g., deploying neural networks on mobile devices), limited computational resources are available to evaluate a trained model, therefore we can only afford some lightweight models that are fast and memory-efficient. However, the small model capacity and the limited amount of labeled data often prevent the lightweight model from achieving sufficiently high precision. Therefore, how to use an accurate heavyweight model to improve the performance of a lightweight model has been actively studied recently.

Knowledge distillation (KD), first proposed by [3] and then developed by [18], is a popular method to improve the performance of a lightweight model by utilizing the knowledge distilled from an accurate, heavyweight model [28, 41, 49, 30, 31, 27, 56]. The heavyweight and lightweight models in KD are often known respectively as a teacher model and a student model. After [18] introduces the baseline knowledge distillation (BLKD) [18], many KD methods have been proposed for the image classification task [50, 15], but we will only briefly review in this paper BLKD [18] and some recently proposed methods [42, 35, 54]. BLKD transfers knowledge from the teacher model to the student model by matching the logits between these two models (i.e., the raw predictions generated by a classification-oriented neural network, which is then passed to the softmax function). The teacher assistant knowledge distillation (TAKD) [35] is a recent improvement of BLKD. [35] finds that BLKD may fail if the teacher and student models’ performance gap is too big. TAKD applies BLKD to transfer knowledge from the teacher model to an intermediate model, termed the teacher assistant (TA) model, to fill this gap. The TA model often performs better than the student model but worse than the teacher model. Then, TAKD transfers knowledge from the TA model to the student model by applying BLKD again. However, our experiments in Section 5 show that TAKD is sometimes inferior to BLKD. The self-supervised learning as an auxiliary task for knowledge distillation (SSKD) [54]

is another recently proposed KD method, which introduces self-supervised learning as an auxiliary task into knowledge distillation to transfer knowledge from the teacher model to the student model. However, the architectures of the teacher and student models sometimes need to be adjusted to fit into the SSKD framework, which may deteriorate the performance of the teacher model. In many scenarios where we need to apply KD, we often have limited labeled images. Thus,

[42] incorporates the unsupervised data augmentation (UDA) [53] into BLKD to improve the performance of the baseline distillation. Different from the image classification task, the application of KD in image regression with a scalar response variable (e.g., angle and age) has rarely been studied. [59]

proposes a KD method specially designed for estimating ages from the images of human faces. However, this method may be inapplicable to other image regression tasks with a scalar response because some components of the proposed framework are only designed for age estimation. To our best knowledge, there is no KD method general enough for all image regression tasks with a scalar response. Moreover, all the above methods are either designed for image classification or image regression; however, there is no unified KD framework that is suitable for both tasks.

Generative adversarial networks (GANs) [14, 34, 40, 37, 57, 2, 38, 21, 22, 7, 8] are state-of-the-art generative models for image synthesis. Some modern GAN models such as BigGAN [2] and StyleGAN [21, 22] are able to generate high-resolution, even photo-realistic images. Conditional generative adversarial networks (cGANs) is an essential family of GANs, which can generate images in terms of some conditions. Most cGANs are designed for categorical conditions such as class labels [34, 38, 40, 37, 57, 2], and cGANs with class labels as conditions are also known as class-conditional GANs. Recently, [7, 8] propose a new cGAN framework, termed continuous conditional GANs (CcGANs). CcGANs can generate images conditional on continuous, scalar variables (termed regression labels). In the scenario with limited training data, the performance of GANs often deteriorates. To alleviate this problem for unconditional GANs and class-conditional GANs, DiffAugment [60] proposes to conduct online transformation on images during the GAN training. Our experiments show that it also applies to CcGANs. Besides the advances in the GAN theory, some papers [10, 46, 52, 62, 33] use GAN-generated data to do data augmentation in some image classification tasks with insufficient training data. However, even state-of-the-art GANs may generate low-quality samples, which may cause some negative effects on the classification task. Fortunately, some recently proposed subsampling methods [9, 6] may be applied to eliminate these low-quality samples. Additionally, some works [55, 51, 44, 29] propose to incorporate the adversarial loss of GANs into KD, but their performance is not state-of-the-art.

Motivated by the limitations of existing KD methods and the recent development of cGANs, we propose a general and flexible cGAN-based KD framework suitable for both image classification and regression (with a scalar response). Our contributions can be summarized as follows:

  • In Section 3, we introduce a novel KD framework termed cGAN-KD, which distills and transfers knowledge via cGAN-generated samples. As a preliminary, we propose to train BigGAN [2] for classification or CcGANs [7, 8] for regression. We also suggest incorporating DiffAugment [60] into the cGAN training when labeled data are limited. Fake image-label pairs (i.e., fake samples) generated from cGAN are then subsampled and filtered to drop low-quality samples. The knowledge distillation takes place when a pre-trained teacher model adjusts the labels of fake samples. Then, these processed fake samples are used to augment the training set. Finally, the student model is trained on the augmented training set, where the knowledge transfer is conducted implicitly.

  • Compared with existing KD methods, our framework has many advantages that are summarized in Section 3.6. Notably, cGAN-KD is a unified KD framework suitable for both classification and regression tasks. It is compatible with state-of-the-art KD methods and particularly ideal for limited labeled data scenarios. Moreover, unlike many existing KD methods, the teacher and student models’ architecture difference is no longer important in cGAN-KD.

  • In Section 4, we derive the error bound of a student model trained in the cGAN-KD framework, which not only helps us understand how cGAN-KD takes effect but also guides the implementation of cGAN-KD in practice. Such analysis is often omitted in many papers about knowledge distillation. The error bound implies we should generate as many processed fake samples as possible and choose a teacher model with high precision.

  • In Section 5, extensive experiments on CIFAR-10, Tiny-ImageNet, RC-49, and UTKFace datasets demonstrate the effectiveness of cGAN-KD in the image classification and regression tasks with limited training data. In image classification tasks, state-of-the-art KD methods are also improved if incorporated into cGAN-KD. An ablation study on CIFAR-10 and RC-49 is also conducted to show the necessity of the subsampling, filtering, and label adjustment modules in the cGAN-KD framework. In another ablation study, we show that more processed fake images often lead to more stable knowledge distillation performance.

Please note that our approach is fundamentally different from existing GAN-related KD methods [55, 51, 44, 29] because (1) our approach is the first framework that utilizes cGAN-generated samples as the knowledge carrier; (2) our approach applies to both classification and regression; (3) we do not need to incorporate the adversarial loss into KD. Please also note that, in our KD framework, we design a subsampling and filtering scheme to drop low-quality samples generated from cGANs, which does not exist in existing GAN-based data augmentation methods [10, 46, 52, 62, 33].

2 Related Work

2.1 Knowledge Distillation

In this section, we briefly review four KD methods implemented in our experiments for image classification, including BLKD [18], TAKD [35], SSKD [54], and BLKD+UDA [54]. The KD method only designed for age estimation [59] is not considered in this paper since it is inapplicable to other regression tasks.

BLKD [18] transfers knowledge from the teacher model to the student model by matching the logits (i.e., the output of the last layer in a neural network) between these two models, so it is also known as a logits-based KD method. BLKD does not need to change the teacher and student models’ architectures, and it has been widely applied in many applications. Denote by the logits of an image from a neural network, where is a

by 1 vector and

is the number of classes. With softmax function, we can calculate the probability that the image

belonging to class as follows:

(1)

where and is the temperature factor. The by 1 vector is also known as the soft label of image . A higher

leads to a softer probability distribution over classes. On the contrary, the one-hot encoded class label is also known as the

hard label. An example of hard labels and soft labels are shown in Fig. 1. Usually, the soft label is more informative than the hard label because it can reflect the similarity between classes and the confidence of prediction. The logits of the same image from the teacher model and the student model are denoted by and respectively. Then, the corresponding soft labels are denoted respectively by and . The student model is trained to minimize the cross entroy between and as follows:

(2)

The student model is also trained to minimized the cross entropy between the one-hot encoded class label and the soft label as follows:

(3)

Finally, the overall training loss of is a linear combination of Eqs. (2) and (3), i.e.,

(4)

where

is a hyperparameter controlling the trade-off between two losses.

is the standard loss for classification and encourages the knowledge transfer.

Fig. 1: An example of the hard and soft labels of a dog image in an image classification task with three classes.

TAKD [35] is a recent variant of BLKD. [35] finds that if the performance gap between a teacher model and a student model is big, BLKD usually does not perform well. Therefore, TAKD introduces an teacher assistant (TA) model, which often performs better than the student model but worse than the teacher model. BLKD is applied to the teacher-TA and TA-student pairs, respectively, where the knowledge is first transferred from the teacher model to the TA model and then from the TA model to the student model.

SSKD [54] is another recently proposed KD method. Like BLKD, SSKD encourages the student model to mimic the teacher model’s classification performance on labeled data. Additionally, SSKD also minimizes the difference between the student and teacher models’ performance on a self-supervised learning task [4] (a task to learn a more informative representation of images in an unsupervised manner). However, we often need to carefully adjust the architectures of the student model and the teacher model to let them fit into the proposed algorithm of SSKD. Our experiments in Section 5 show that such architecture adjustment may deteriorate the teacher model’s performance.

UDA [53] is an effective data augmentation method for deep learning models when labeled data are scarce. [54] incoporates UDA into BLKD (denoted by BLKD+UDA) to improve the performance of BLKD. Please note that, when implementing BLKD+UDA, UDA must be applied to both the teacher model training and the student model training.

2.2 Conditional Generative Adversarial Networks

cGANs [34] aim to estimate the distribution of images conditional on some auxiliary information. A cGAN model includes two neural networks, a generator and a discriminator . The generator takes as input a random noise and the condition , and outputs a fake image which follows the fake conditional image distribution . The discriminator takes as input an image and the condition , and outputs the probability that the image comes from the true conditional image distribution . A typical pipeline of cGAN is shown in Fig. 2. Mathematically, the cGAN model is trained to minimize the divergence between and . The condition

is usually a categorical variable such as a class label. cGANs with class labels as conditions are also known as class-conditional GANs

[34, 38, 40, 37, 57, 2]. Class-conditional GANs have been widely studied and the state-of-the-art models such as BigGAN [2] are already able to generate photo-realistic images. However, GANs conditional on regressions labels (e.g., angles and ages) have rarely been studied because of two problems. First, very few (even zero) images exist for some regression labels so that the empirical cGAN losses may fail. Second, since regression labels are continuous and infinitely many, they cannot be embedded by one-hot encoding like class labels. Recently, [7, 8] propose a new formulation of cGANs, termed CcGANs. The CcGAN framework consists of novel empirical cGAN losses and novel label input mechanisms. To solve the first problem, the discriminator is trained by either the hard vicinal discriminator loss (HVDL) or the soft vicinal discriminator loss (SVDL). A new empirical generator loss is also proposed to alleviate the first problem. To solve the second problem, [7, 8] introduce a naive label input (NLI) mechanism and an improved label input (ILI) mechanism. Hence, [7, 8] propose four CcGAN models employing different discriminator losses and label input mechanisms, i.e., HVDL+NLI, SVDL+NLI, HVDL+ILI, and SVDL+ILI. The effectiveness of CcGANs has been demonstrated on multiple regression-oriented datasets.

Fig. 2: A typical pipeline of cGAN. The conditioning variable is assumed to follow a distribution , which can be easily estimated from training data if represents class labels or regression labels.

The performance of cGANs often deteriorates when training data are insufficient. DiffAugment [60] is one of some recent works [60, 20, 48, 61] that are designed to stabilize the cGAN training in this setting. Although DiffAugment is designed for unconditional (e.g., styleGAN [21, 22]) and class-conditional GANs (e.g., BigGAN [2]), our experiment shows that it is also applicable to CcGANs [7, 8].

2.3 cDRE-F-cSP+RS: Subsampling Conditional Generative Adversarial Networks

Modern cGANs are demonstrated successful in many applications, but low-quality samples still appear frequently even with state-of-the-art network architectures (e.g., BigGAN [2]) and training setups. To filter out low-quality samples, [6] proposes a subsampling framework, termed cDRE-F-cSP+RS, for class-conditional GANs and CcGANs. This framework consists of two components: a conditional density ratio estimation (cDRE) method termed cDRE-F-cSP and a rejection sampling (RS) scheme. cDRE-F-cSP aims to estimate the conditional density raito function based on real images and fake images . Based on the estimated conditional density ratios, the rejection sampling scheme is utilized to sample from a trained cGAN. For class-conditional GANs, experiments in [6] demonstrate that cDRE-F-cSP+RS can substantially improve the Fréchet inception distance (FID) [17] and Intra-FID [38] scores. For CcGANs, cDRE-F-cSP+RS not only improves the Intra-FID score but also improve the image diversity and label consistency (i.e., the consistency of generated images with respect to the conditioning label) [7, 5].

3 Proposed Method

While many KD methods have been proposed for image classification [50, 15], there is only one KD method for image regression (scalar response) [59]. Unfortunately, it is a specially designed method for age estimation instead of a general KD method. Moreover, there is no KD framework applicable to both tasks.

This section proposes a unified KD framework, termed cGAN-KD, which is suitable for both image classification and regression (scalar response) tasks. The proposed framework can also fit into many state-of-the-art KD methods for image classification to improve their performance. It is also suitable for tasks with limited labeled data and insensitive to the teacher and student models’ architecture differences.

3.1 Problem Formulation

Before we introduce cGAN-KD, let us formulate the KD task in the language of mathematics as follows. Assume we have a set of image-label pairs, i.e.,

which are randomly drawn from the true joint distribution

. We also have a teacher model and a student model which are trained on . often has a smaller test error than does, i.e.,

where is either the cross entropy (CE) loss (i.e., Eq. (2)) for classification or the square error (SE) loss (i.e., ) for regression. The objective of KD is to reduce the test error of by using the knowledge learned by .

3.2 The Workflow of cGAN-KD

As a preliminary of cGAN-KD, we need to train a cGAN on . For image classification, we suggest adopting state-of-the-art class-conditional GANs such as BigGAN [2]. For image regression with a scalar response, we should use CcGANs [7, 8]. In the scenario with very few training data, we propose to apply DiffAugment [60] to stabilize the cGAN training. After the cGAN training, the proposed KD framework can be applied. In Fig. 3, we visualize the workflow of cGAN-KD, which includes three important modules denoted respectively by M1, M2, and M3. First, we draw a set of unprocessed fake image-label pairs from the trained cGAN, i.e.,

These fake samples are then subsampled and filtered by M1 to drop low-quality samples and form a subset of , i.e.,

The next module M2 in the pipeline adjusts the labels of images in by a pre-trained teacher model and outputs a set of processed samples, i.e.,

The processed samples are then used to augment the training set . Finally, M3 trains the student model on the augmented training set . The student model trained on is expected to perform better than the one trained on . More details of the three modules are described in Sections 3.3 to 3.5 and the evolution of fake sample datasets are shown in Fig. 4.

Fig. 3: The workflow of cGAN-KD. There are three important modules denoted respectively by M1, M2, and M3 in this framework.

3.3 M1: Drop Low-quality Fake Samples

Since low-quality samples may cause harmful effects on the prediction accuracy if used to augment the training set, M1 is adopted to drop these samples, including two sequential submodules: a subsampling module and a filtering module.

The subsampling module implements cDRE-F-SP+RS [6] which performs rejection sampling to accept or reject a fake image-label pair in terms of the density ratio of conditioning on . [6] shows that cDRE-F-SP+RS can effectively improve the overall image quality of both class-conditional GANs and CcGANs in the conditional image synthesis setting. Thus, the subsampling module is very suitable for dropping low-quality samples.

The subsequent filtering module is another strategy to drop low-quality samples. Assume we generate a fake image from a trained cGAN conditional on a label , then is called the assigned label of in this paper. In the filtering module, we use the pre-trained teacher model to predict the label of . Our experimental study in Supp. S.8.2 and S.10.2 shows that a significant error (i.e., the cross entropy loss for classification or mean absolute error for regression) between the assigned and predicted labels often implies terrible visual quality. Based on this observation, we propose to drop fake samples with errors larger than a threshold, which is summarized in Alg. 1. The filtering threshold equals to the

-th quantile of fake samples’ errors and the optimal

is selected by a grid search algorithm (i.e., Alg. 2).

1 Sample fake image-label pairs from a trained cGAN with cDRE-F-SP+RS;
2 Predict the labels of these fake images by the pre-trained teacher model ;
3 Compute the error (i.e., cross entropy for classification or MAE for regression) between the assigned and predicted labels;
4 Sort these errors from smallest to largest and the -th quantile of these errors is set as the filtering threshold;
Remove fake image-label pairs with errors larger than the filtering threshold.
Algorithm 1 An algorithm to implement the filtering module with a hyper-parameter to drop low-quality samples.
1 Set a grid of candidate values for (e.g., 0.3 to 0.9 with a stepsize 0.1 in our experiments);
2 Randomly split the training set into a sub-training set and a validation set with a ratio of 4:1;
3 for each in the grid do
4        Generate fake image-label pairs based on Alg. 1;
5        Augment the sub-training set with these fake samples;
6        Train the student model from scratch on the augmented training set;
7        Compute the validation error of on the validation set;
8       
9 end for
The optimal in the grid minimizes the validaiton error. Please note that if we have multiple student models (e.g., our experiments in Section 5), the optimal minimizes the average validaiton error of these student models.
Algorithm 2 A grid search algorithm to select the hyper-parameter in the filtering module.

3.4 M2: Distill Knowledge via Label Adjustment

The subsequent module M2 adjusts the labels of fake samples in generated from the previous module M1 via a pre-trained teacher model . Similar to pseudo-labeling [26, 1]

in semi-supervised learning, the label adjustment is conducted by replacing the

-th assigned label in with the predicted label . Please note that in classification, the predicted labels are hard labels as we described in Fig. 1. This adjustment distills the knowledge about the relation between an image and its label from the trained and stores it in the adjusted dataset . Moreover, cGANs especially CcGANs more or less suffer from the label inconsistency problem, i.e., a fake image’s the assigned label may diverge from its ground truth label. The label adjustment can alleviate this issue.

3.5 M3: Transfer Knowledge via Data Augmentation

The adjusted samples in are also called the processed fake samples. They are used to augment the original training set , i.e., . To transfer knowledge distilled from the pre-trained , we train on the augmented dataset in M3. Please note that empirical studies in Section 5 show that as increases, the test error of often does not stop decreasing until larger than a certain threshold and then starts fluctuating over a small range. Since it is hard to obtain the optimal in practice and a hefty usually does not cause significant adverse effect on precision, we suggest generating the maximum number of processed samples allowed by the computational budge.

Note that M3 makes our method fundamentally different from existing KD methods [18, 42, 35, 54]

because the distilled knowledge is transferred through samples instead of specially designed loss functions or network architectures.

M3 is also distinct from existing GAN-based data augmentaiton methods [10, 46, 52, 62, 33], because they do not have the subsampling, filtering, and label adjustment steps.

3.6 Advantages of cGAN-KD

3.6.1 A Unified Knowledge Distillation Framework for Image Classification and Regression

Since all necessary steps in the workflow of cGAN-KD are applicable to both classification and regression (scalar response), cGAN-KD is actually a unified KD framework. Moreover, the theoretical analysis of cGAN-KD (see Section 4) in both tasks also has the same general formulation.

3.6.2 Compatible with State-of-the-art KD Methods

cGAN-KD distills and transfers knowledge based on fake samples, and it does not require extra loss functions or network architecture changes. Thus, cGAN-KD can be combined with many state-of-the-art KD methods for image classification to improve their performance. To embed a state-of-the-art KD method into cGAN-KD, we just need to train the student model on the augmented training set with this KD method in M3 but keep other procedures in Fig. 3 unchanged.

3.6.3 Suitable for Limited Labeled Data Scenarios

The performance of cGANs heavily deteriorates given a limited amount of labeled training data. Fortunately, DiffAugment [60] can effectively alleviate this problem. Besides DiffAugment, the subsampling and filtering modules in the cGAN-KD framework can also deal with this issue by removing low-quality fake samples. Therefore, cGAN-KD is very suitable for scenarios with limited labeled data.

3.6.4 Architecture-agnostic

As shown by our experiments in Section 5 and some papers [35, 54], the architecture difference between a teacher model and a student model may influence the performance of some existing KD methods because these methods rely on logits or intermediate layers to transfer knowledge. Other KD methods such as SSKD [54] even require some adjustments on the teacher and student models’ architectures. Differently, since the proposed cGAN-KD framework distills and transfers knowledge via fake samples, there is no requirement on the teacher and student models’ architectures, making cGAN-KD more flexible than other KD methods.

4 Theoretical Analysis

In this section, we derive the error bound of the student model , which theoretically illustrates how the teacher model improves the precision of in the cGAN-KD framework. Before we move to the derivation, we first introduce some notations. Denote by the distribution of unprocessed fake samples. Denote by and the distributions of fake samples after processed by M1 and M2 respectively. The evolution of fake samples’ distributions and datasets are visualized in Fig. 4. Additionally, we denote the augmented training dataset as , i.e., . Then, we make the following notations:

where is either the CE loss for classification or the SE loss for regression. Let be the optimal predictor which minimizes . We denote by the hypothesis space of . Note that may not include . Then, we define and as

Fig. 4: Evolution of fake sample datasets and their distributions.

The error bound of in cGAN-KD is depicted by the distance of from , which is described in Theorem 1.

Theorem 1 (Error Bound).

Suppose that

(i.i.d. samples) , , and the augmented dataset is considered as i.i.d. samples from a mixture distribution, i.e.,

(5)

where .

(Measurability) is measurable for all .

(Distribution gap) There is a constant such that

(6)

where denotes the total variation distance [13] between two probability distributions ; and means and .

(Boundedness) There exists a constant , such that , .

Then, , with probability at least ,

(7)

where stands for the empirical Rademacher complexity [39, Definition 3.1] of , which is defined on samples independently drawn from .

Proof.

We first decompose as follows

(8)

The second term in Eq. (8) is a non-negative number because the student model’s hypothesis space may not cover the optimal predictor . in the first term of Eq. (8) can be bounded as follows. Using the triangular inequality and A4 (i.e., boundedness of ) yields that

(9)
(10)

For Eq. (9), we apply the Rademacher bound [24, Thm 7.7.1], yielding that with at least probability ,

(11)

Before we bound Eq. (10), we first review the definition of the total variation distance [13] between any two distributions and , i.e.,

where is a measurable function. Thus,

(12)

Since is measurable (by A2) and is continuous, is also measurable. Let be in Eq. (12), then by A3 (i.e., the distribution gap between and ), Eq. (10) can be bounded as follows

(13)

Combining Eqs. (11) and (13), we can get

(14)

Finally, incorporating Eq. (14) into Eq. (8), we can get the inequality (i.e., Eq. (7)) in Theorem 1, which completes the proof.

Remark 1 (Rationality of A3 and A4).

In the cGAN-KD framework, processed fake images are used to augment the training set, so the distribution gap between and (measured by the total variation distance) should have a significant impact on the student model’s performance. Thus, in A3 of Theorem 1, we model the distribution gap by the summation of two components. The first component stands for the divergence caused by the trained cGAN and the subsampling and filtering steps. The second component is controlled by the generalization performance of —the expected loss of trained teacher model over the true data distribution.

It is also worth discussing the rationality of A4. The two type of learning tasks considered in this work are the regression and classification, for which we use the square loss and the cross entropy loss respectively. Let’s consider the regression task first. In our experiments on the regression datasets, the last layer of

is the ReLU activation function

[12, 11], so . Since , as long as is not a too bad predictor, it should not output arbitrarily large values, which implies can be bounded by a positive constant. Therefore, the square loss is bounded and A4 is satisfied. For the classification task, a sufficient condition to A4 is that when

, representing that our classifier cannot produce

probability for the true label, which is reasonable in practice.

Remark 2 (Illustration of Theorem 1).

The four terms on the right side of Eq. (7) show that the error of may come from four aspects. The first and last terms are only relevant to the nature of , so they are not influenced by . If does not output arbitrarily extreme predictions (as discussed in Remark 1), stays at a moderate level, implying the first term is also small. The last term is inevitable because may not include . The second term diminishes if we set large. For the third term, as increases. Then, the third term is only controlled by the property of and the distribution gap. To reduce the distribution gap, we can either improve the cGAN model, subsampling, and filtering or choose a with better generalization performance.

Therefore, Theorem 1 implies when implementing cGAN-KD we should (1) use state-of-the-art cGANs and subsampling methods, (2) set large, and (3) choose a with highest precision as possible.

5 Experiments

This section aims to experimentally demonstrate the effectiveness of the proposed cGAN-KD framework in image classification and regression (scalar response) tasks when limited training data are available. We conduct extensive experiments on four image datasets, i.e., CIFAR-10 [23] and Tiny-ImageNet [25] for image classification; RC-49 [7, 8] and UTKFace [58] for image regression. Candidate baseline KD methods in classification tasks are NOKD (i.e., no KD method is applied), BLKD [18], TAKD [35], SSKD [54], BLKD+UDA [42], the proposed cGAN-KD, and incorporating state-of-the-art KD methods into our cGAN-KD. In regression tasks, candidate methods only include NOKD and cGAN-KD. Please not that for image regression, as suggested by [7, 8], when training CcGANs, , and , regression labels are normalized to real numbers in . Nevertheless, in the evaluation stage of and , we compute mean absolute error (MAE) on unnormalized regression labels. For detailed experimental setups, please refer to the supplementary material.

5.1 Cifar-10

We first evaluate the effectiveness of the proposed cGAN-KD framework on the CIFAR-10 dataset [23].

Experimental setup: CIFAR-10 consists of 60,000 () RGB images uniformly from 10 classes. The overall number of training samples is 50,000 (5000 for each class), and the remaining 10,000 samples (1000 for each class) are for testing. To compare our proposed method with existing KD methods when limited training data are available, we design three settings denoted respectively by C-50K, C-20K, and C-10K with different numbers of training samples. Specifically, all 50,000 training samples are available in C-50K. C-20K has 20,000 randomly selected training samples (about 2000 per class). In C-10K, the size of training samples is further reduced to 10,000 (about 1000 per class) to simulate the limited training data scenario.

Next, to select student and teacher models for this experiment, some popular classifiers are trained from scratch in each setting, and their test errors are shown in Table S.8.9 in the supplementary material. We choose three teacher models (i.e., MobileNet V2 [43], ResNet-18 [16], and DenseNet-121 [19]) with similarly high precision and three student models (i.e., ShuffleNet V2 [32], efficientnet-b0 [47], and VGG11 [45]) with similarly low precision based on their performance. Note that although MobileNet V2 is a popular lightweight model, its performance is surprisingly good and comparable to other two teachers on CIFAR-10. Therefore, MobileNet V2 is chosen as a teacher model in our experiment. To implement BLKD and TAKD, we set and following [42]. VGG13 is chosen as the TA model in TAKD. During implementing SSKD, MobileNet V2 performs poorly after the necessary architecture adjustment. Additionally, efficientnet-b0 and DenseNet-121 are not supported by the official implementation of SSKD, thus we only consider one teacher model (ResNet-18) and two student models (VGG11 and ShuffleNet V2). To implement the proposed framework, we train one BigGAN model [2] for each setting. DiffAugment [60] is also incorporated into the BigGAN training in C-20K and C-10K due to the availability of limited training samples. In all three settings, no matter the teacher-student combination, we use DenseNet-121 to do the filtering and label adjustment due to its highest average precision. We set and the optimal in Alg. 1 selected by Alg. 2 are 0.9, 0.6, and 0.7 for the three settings respectively. In the experiments with SSKD and BLKD+UDA, we only consider the combination of cGAN-KD with these two KD methods due to limited computational resources.

An ablation study is designed to test the effectiveness of the subsampling, filtering, and label adjustment modules in the cGAN-KD framework in C-10K with , aiming to show how cGAN-KD performs if these three modules are added into the framework one by one. A second ablation study is conducted in the C-10K setting to analyze the effect of , where varies from 0 to .

Please refer to Supp. S.8 for more detailed setups.

Quantitative results: The quantitative comparison results of the main study are shown in Tables I, II, and III among different baseline methods. Please note that since SSKD needs to modify network architectures, and BLKD+UDA incorporates UDA into the teachers’ training (the teacher model’s precision may change), their performances are incomparable. For the same reason, it is also inappropriate to compare them with BLKD and TAKD. Therefore, these existing KD methods can be classified into three groups: (1) BLKD and TAKD, (2) SSKD, and (3) BLKD+UDA. We compare the proposed framework with them separately in Tables I - III. In Table I, we can see cGAN-KD related methods consistently outperform NOKD, BLKD [18], and TAKD [35] under all three settings and all teacher-student combinations. BLKD [18] and TAKD [35] methods are improved after incorporated into the proposed cGAN-KD framework. We also observe that cGAN-KD leads to higher performance gains when there are fewer training samples. Tables II and III show that cGAN-KD can effectively improve the performance of SSKD [54] and BLKD+UDA [42]. Additionally, we conduct ablation studies to evaluate the effect of our proposed M1, M2, M3, and the parameter . The quantitative results of our ablation studies are visualized respectively in Figs. 5 and 6. Fig. 5 reveals the necessity of the subsampling, filtering, and label adjustment modules because their interaction results in the highest precision. Fig. 6 shows that more processed fake images stabilize the student models’ performance without significantly decreasing the precision, which confirms the necessity of a large .

Settings Teachers Students NOKD BLKD TAKD   cGAN-KD
cGAN-KD
+ BLKD
cGAN-KD
+ TAKD
C-50K MobileNet V2 (5.92) VGG11 8.42 6.72 7.35   6.83 5.89 6.23
ShuffleNet V2 7.18 6.30 6.56   6.84 6.11 5.88
efficientnet-b0 7.10 6.45 6.42   6.85 6.11 6.05
ResNet-18 (4.88) VGG11 8.42 7.24 7.43   6.83 6.24 6.41
ShuffleNet V2 7.18 6.44 6.69   6.84 6.34 6.25
efficientnet-b0 7.10 6.72 6.59   6.85 6.21 6.29
DenseNet-121 (4.47) VGG11 8.42 7.40 7.27   6.83 6.37 6.51
ShuffleNet V2 7.18 6.73 6.70   6.84 6.57 6.81
efficientnet-b0 7.10 6.34 6.75   6.85 5.59 6.35
C-20K MobileNet V2 (9.28) VGG11 12.48 10.94 11.48   10.81 9.60 9.39
ShuffleNet V2 11.33 10.10 10.01   11.00 9.63 9.19
efficientnet-b0 12.46 10.11 9.99   11.02 9.28 9.16
ResNet-18 (9.26) VGG11 12.48 11.29 11.56   10.81 9.58 9.88
ShuffleNet V2 11.33 10.28 10.39   11.00 9.83 9.86
efficientnet-b0 12.46 10.55 10.38   11.02 10.10 9.75
DenseNet-121 (8.52) VGG11 12.48 10.88 11.73   10.81 9.63 9.70
ShuffleNet V2 11.33 10.36 10.23   11.00 9.97 9.94
efficientnet-b0 12.46 10.28 9.86   11.02 10.03 9.75
C-10K MobileNet V2 (13.53) VGG11 18.57 15.76 15.98   14.32 12.65 12.28
ShuffleNet V2 17.50 14.03 14.96   13.90 12.53 12.69
efficientnet-b0 15.71 14.49 15.40   13.48 12.48 12.71
ResNet-18 (16.07) VGG11 18.57 16.78 16.42   14.32 14.45 14.55
ShuffleNet V2 17.50 15.54 15.27   13.90 14.29 14.26
efficientnet-b0 15.71 14.99 15.68   13.48 14.79 13.89
DenseNet-121 (14.63) VGG11 18.57 15.95 16.43   14.32 13.81 13.62
ShuffleNet V2 17.50 14.81 14.86   13.90 13.66 13.80
efficientnet-b0 15.71 14.80 17.47   13.48 13.38 13.50
TABLE I: CIFAR-10: Comparison on test error rate among NOKD, BLKD [18], TAKD [35], and our methods ( and for three settings respectively). We always use DenseNet-121 to do the filtering and label adjustment no matter the teacher-student combination. The test error rate is defined as the proportion of incorrectly classified test samples. NOKD implies no KD method is applied. The test error rates of teachers are shown in the parentheses. Our methods outperform NOKD, BLKD, and TAKD in all settings, and this advantage is more significant when training data are scarcer. In almost all scenarios, BLKD and TAKD are improved after incorporated into the cGAN-KD framework (i.e., cGAN-KD+BLKD and cGAN-KD+TAKD).
Settings Teachers Students NOKD SSKD  
cGAN-KD
+ SSKD
C-50K ResNet-18 (5.06) VGG11 7.59 5.82   5.63
ShuffleNet V2 7.03 5.38   5.27
C-20K ResNet-18 (8.05) VGG11 11.17 8.80   8.53
ShuffleNet V2 11.42 8.41   8.31
C-10K ResNet-18 (11.87) VGG11 16.05 12.60   12.33
ShuffleNet V2 18.19 12.30   12.40
TABLE II: CIFAR-10: Comparison on test error rate between SSKD [54] and cGAN-KD+SSKD ( and for three settings respectively). We always use DenseNet-121 in Table I to do the filtering and label adjustment. In most scenarios, SSKD are improved after incorporated into the cGAN-KD framework. The test error rates of teachers are shown in the parentheses.
Settings Teachers Students NOKD
BLKD
+ UDA
 
cGAN-KD
+ BLKD
+ UDA
C-50K MobileNet V2 (5.40) VGG11 8.18 6.18   5.68
ShuffleNet V2 7.11 5.12   5.49
efficientnet-b0 8.62 6.32   6.18
ResNet-18 (5.28) VGG11 8.18 6.17   5.63
ShuffleNet V2 7.11 5.56   5.01
efficientnet-b0 8.62 6.78   6.38
DenseNet-121 (4.55) VGG11 8.18 6.56   5.32
ShuffleNet V2 7.11 5.26   4.94
efficientnet-b0 8.62 6.26   6.29
C-20K MobileNet V2 (9.25) VGG11 12.28 9.57   9.19
ShuffleNet V2 10.72 8.54   8.40
efficientnet-b0 13.27 9.82   9.81
ResNet-18 (9.87) VGG11 12.28 10.04   9.30
ShuffleNet V2 10.72 9.09   9.04
efficientnet-b0 13.27 10.15   9.85
DenseNet-121 (8.54) VGG11 12.28 10.21   8.87
ShuffleNet V2 10.72 9.64   8.60
efficientnet-b0 13.27 11.56   9.32
C-10K MobileNet V2 (13.43) VGG11 18.61 15.14   12.54
ShuffleNet V2 16.76 12.97   12.23
efficientnet-b0 22.45 14.46   12.92
ResNet-18 (14.18) VGG11 18.61 14.79   13.40
ShuffleNet V2 16.76 13.51   13.21
efficientnet-b0 22.45 15.37   13.92
DenseNet-121 (13.4) VGG11 18.61 14.86   12.80
ShuffleNet V2 16.76 13.13   12.25
efficientnet-b0 22.45 14.93   12.97
TABLE III: CIFAR-10: Comparison on test error rate between BLKD+UDA [42] and cGAN-KD+BLKD+UDA ( and for three settings respectively). We always use DenseNet-121 in Table I to do the filtering and label adjustment. In most scenarios (25 out of 27), BLKD+UDA are improved after incorporated into the cGAN-KD framework, and this advantage is more significant when training data are scarcer. The test error rates of teachers are shown in the parentheses.
Fig. 5: CIFAR-10: An ablation study of the subsampling, filtering, and label adjustment modules in the C-10K setting. The interaction of the subsampling, filtering, and label adjustment modules leads to the lowest error rates for all three student models.
Fig. 6: CIFAR-10: Analysis on the effect of number of processed fake images (i.e., ) in the C-10K setting. As is large enough, all three models’ test error rates stop decreasing and start fluctuating in a small range.

5.2 Tiny-ImageNet

This experiment further demonstrates the effectiveness of cGAN-KD on the Tiny-ImageNet dataset [25].

Experimental setup: Tiny-ImageNet contains 200 image classes with 500 images per class for training and 50 images per class for testing. Most images are RGB images of size , and a few grey-scale images are excluded in this experiment.

Based on the test errors of some popular networks (refer to Fig. S.9.11), we choose two teacher models (ResNet-50 [16] and DenseNet-121 [19]) and two student models (ShuffleNet V2 [32] and VGG11 [45]). When implementing TAKD [35], MobileNet V2 [43] is chosen as the TA model because its performance is around the average of the performance of teachers and students. To generate fake samples, we adopt the BigGAN model [2] and DiffAugment [60]. Since DenseNet-121 performs the best in this experiment, it is used for the filtering and label adjustment. Other experimental setups are similar to those of the CIFAR-10 experiment except that all training images are used. Please refer to Supp. S.9 for more details.

Quantitative results: The quantitative results are shown in Tables IV, V, and VI. Similar to the CIFAR-10 experiment, cGAN-KD is very effective and it improves state-of-the-art KD methods in all scenarios.

width=0.85 Teachers Students NOKD BLKD TAKD   cGAN-KD cGAN-KD + BLKD cGAN-KD + TAKD ResNet-50 (35.86) ShuffleNet V2 44.52 43.78 43.80   42.99 43.39 43.05 VGG11 44.13 42.57 41.91   41.55 39.69 39.46 DenseNet-121 (35.22) ShuffleNet V2 44.52 43.97 44.03   42.99 43.52 43.67 VGG11 44.13 42.22 42.03   41.55 38.95 39.49

TABLE IV: Tiny-ImageNet: Comparison on test error rate among NOKD, BLKD [18], TAKD [35], and our methods ( and ). We use DenseNet-121 to do the filtering and label adjustment no matter the teacher-student combination. The test error rate is defined as the proportion of incorrectly classified test samples. NOKD implies no KD method is applied. The test error rates of teachers are shown in the parentheses. Our methods outperform NOKD, BLKD, and TAKD in all settings. In all scenarios, BLKD and TAKD are improved after incorporated into the cGAN-KD framework (i.e., cGAN-KD+BLKD and cGAN-KD+TAKD).

width=1 Teacher Students NOKD SSKD   cGAN-KD + SSKD ResNet-50 (37.6) ShuffleNet V2 46.92 39.44   38.82 VGG11 43.05 36.70   36.37

TABLE V: Tiny-ImageNet: Comparison on test error rate between SSKD [54] and cGAN-KD+SSKD ( and ). We use DenseNet-121 in Table IV to do the filtering and label adjustment. In both two scenarios, SSKD are improved after incorporated into the cGAN-KD framework. The test error rates of teachers are shown in the parentheses.

width=1 Teachers Students NOKD BLKD + UDA   cGAN-KD + BLKD + UDA ResNet-50 (36.07) ShuffleNet V2 43.06 37.47   36.16 VGG11 45.67 35.27   34.89 DenseNet-121 (34.59) ShuffleNet V2 43.06 37.29   35.89 VGG11 45.67 34.51   34.02

TABLE VI: Tiny-ImageNet: Comparison on test error rate between BLKD+UDA [42] and cGAN-KD+BLKD+UDA ( and ). We use DenseNet-121 in Table IV to do the filtering and label adjustment. In all scenarios, BLKD+UDA are improved after incorporated into the cGAN-KD framework. The test error rates of teachers are shown in the parentheses.

5.3 Rc-49

This experiment is conducted on RC-49 [7, 8] to show that cGAN-KD also performs well in the image regression tasks with a scalar response variable.

Experimental setup: The RC-49 dataset is made by rendering 49 3-D chair models individually. Each chair model is rendered at 899 yaw angles from to with a stepsize of

. A yaw angle is selected for training if its last digit is odd, so only 450 angles are in the training set while others are left for testing. This dataset contains 44,051 RGB images of size

with corresponding yaw angles as labels. In this experiment, we design three settings denoted respectively by R-25, R-15, and R-5. The numbers in three setting names specify the number of images for each distinct angle in the training set. For example, in the R-25 setting, each of the 450 angles in the training set has 25 images, so there are 11,250 images available for training. In all three settings, all images not used for training are held out for testing.

Three student models (ShuffleNet V2 [32], MobileNet V2 [43], and efficientnet-b0 [47]) and one teacher model (VGG16 [45]) are selected in this experiment based on the performance of some popular networks (refer to Table S.10.13). Since no general KD method exists for image regression tasks with a scalar response, we only compare cGAN-KD with NOKD. For cGAN-KD, we adopt the SNGAN architecture [37] and train one CcGAN model (SVDL+ILI) [8] with DiffAugment [60] for each setting. We use VGG16 to do the filtering and label adjustment. Alg. 2 is applied to select the optimal in the filtering module. 50,000 processed fake samples are generated to augment the training set in each setting. Detailed experimental setups can be found in Supp. S.10.

Similar to the CIFAR-10 experiment, an ablation study is conducted to show the necessity of the subsampling, filtering, and label adjustment modules. Another ablation study is conducted to show the effect of .

Quantitative results: The quantitative results are shown in Table VII. We can see that cGAN-KD outperforms NOKD by a large margin and the performance enhancement is even more significant if we have fewer training samples. The quantitative results of the two ablation studies are visualized in Figs. 7 and 8. The conclusion from these ablation studies is consistent with that we get in the CIFAR-10 experiment, except that the label adjustment module’s effect is more substantial in regression than in classification. One explanation of this difference is that the label inconsistency issue of CcGANs for regression is more severe than that of BigGAN for classification, and the label adjustment module can effectively alleviate this problem.

Settings Teachers Students NOKD   cGAN-KD
R-25 VGG16 (0.20) ShuffleNet V2 0.52   0.34
MobileNet V2 0.55   0.39
efficientnet-b0 0.99   0.57
R-15 VGG16 (0.31) ShuffleNet V2 0.76   0.40
MobileNet V2 0.95   0.48
efficientnet-b0 1.00   0.69
R-5 VGG16 (0.49) ShuffleNet V2 1.95   0.95
MobileNet V2 1.74   1.18
efficientnet-b0 1.86   1.15
TABLE VII: RC-49: Comparison on test MAE between NOKD and cGAN-KD ( and for three settings respectively). The MAE is evaluated on all samples that are not used for training. NOKD implies no KD method is applied. The MAE of teachers are shown in the parentheses. cGAN-KD outperforms NOKD with a large margin in all settings, and this advantage is more substantial when training data are scarcer.
Fig. 7: RC-49: An ablation study of the subsampling, filtering, and label adjustment modules in the R-5 setting. The interaction of the subsampling, filtering, and label adjustment modules leads to the lowest MAE for all three student models.
Fig. 8: RC-49: Analysis on the effect of number of processed fake images (i.e., ) in the R-5 setting. As is large enough, all three models’ test error rates stop decreasing and start fluctuating in a small range.

5.4 UTKFace

The last experiment evaluates the performance of cGAN-KD on UTKFace [58], another benchmark regression dataset.

Experimental setup: UTKFace is an RGB human face image dataset with ages as regression labels. We use the processed UTKFace dataset [8, 7], which consists of 14,760 RGB images with ages in [1, 60]. The number of images ranges from 50 to 1051 for different ages, and all images are of size . Among these images, 80% images are randomly selected to create a training set, and the rest are held out for testing.

Similar to the RC-49 experiment, three student models (ShuffleNet V2 [32], MobileNet V2 [43], and efficientnet-b0 [47]) and one teacher model (VGG16 [45]) are selected in this experiment based on the performance of some popular networks (please refer to Table S.11.15). For cGAN-KD, we adopt the SAGAN architecture [57] and DiffAugment [60] in the CcGAN (SVDL+ILI) training [8]. We apply VGG16 to conduct the filtering and label adjustment and the optimal is selected by following Alg. 2. 80,000 processed fake samples are generated to augment the training set. Please refer to Supp. S.11 for detailed setups.

Quantitative results: The quantitative results of this experiment is summarized in Table VIII. We observe that cGAN-KD performs substantially better than NOKD. Please also note that in the scenario with efficientnet-b0 as the student model, the test MAE of cGAN-KD is even smaller than that of the teacher model. This phenomenon implies the CcGAN model may generate some human faces which are quite different from those in the training set. In other words, the CcGAN model may synthesize some new information.

Teachers Students NOKD   cGAN-KD
VGG16 (5.14) ShuffleNet V2 7.326   5.774
MobileNet V2 7.335   5.507
efficientnet-b0 6.143   4.766
TABLE VIII: UTKFace: Comparison on test MAE between NOKD and cGAN-KD ( and ). NOKD implies no KD method is applied. The MAE of teachers are shown in the parentheses. cGAN-KD outperforms NOKD with a large margin in all settings.

6 Conclusion

This work proposes the first unified knowledge distillation framework widely applicable for both classification and regression tasks in the limited data scenario. Fundamentally different from existing knowledge distillation methods, we propose distilling and transferring knowledge from the teacher to student models through cGAN-generated samples, termed cGAN-KD. First, cGAN models are trained to generate a sufficient number of fake image samples. Then, high quality samples are obtained via subsampling and filtering procedures. Essentially, the knowledge is distilled by adjusting fake image labels utilizing the teacher model. Finally, the distilled knowledge is transferred to student models by training them on these knowledge-conveyed samples. The proposed novel knowledge distillation framework is particularly effective when labeled training data are scarce. Moreover, our framework is architecture-agnostic and it is compatible with existing state-of-the-art knowledge distillation methods. We also derive the error bound of a student model trained in the cGAN framework for theoretical guidance. Extensive experiments demonstrate that cGAN-KD incorporated methods can achieve state-of-the-art knowledge distillation performance on both classification and regression tasks.

Acknowledgments

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grants CRDPJ 476594-14, RGPIN-2019-05019, and RGPAS2017-507965.

References

  • [1] E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §3.4.
  • [2] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: 1st item, §1, §2.2, §2.2, §2.3, §3.2, §5.1, §5.2.
  • [3] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    ,
    pp. 1597–1607. Cited by: §2.1.
  • [5] T. DeVries, A. Romero, L. Pineda, G. W. Taylor, and M. Drozdzal (2019) On the evaluation of conditional GANs. arXiv preprint arXiv:1907.08175. Cited by: §2.3.
  • [6] X. Ding, Y. Wang, Z. J. Wang, and W. J. Welch (2021) Efficient subsampling for generating high-quality images from conditional generative adversarial networks. arXiv preprint arXiv:2103.11166. Cited by: §1, §S.10.1, §2.3, §3.3, §S.8.1.
  • [7] X. Ding, Y. Wang, Z. Xu, W. J. Welch, and Z. J. Wang (2021) CcGAN: continuous conditional generative adversarial networks for image generation. In International Conference on Learning Representations, Cited by: 1st item, §1, §2.2, §2.2, §2.3, §3.2, §5.3, §5.4, §5.
  • [8] X. Ding, Y. Wang, Z. Xu, W. J. Welch, and Z. J. Wang (2020) Continuous conditional generative adversarial networks for image generation: novel losses and label input mechanisms. arXiv preprint arXiv:2011.07466. Cited by: 1st item, §1, §S.10.1, §2.2, §2.2, §3.2, §5.3, §5.3, §5.4, §5.4, §5.
  • [9] X. Ding, Z. J. Wang, and W. J. Welch (2020) Subsampling generative adversarial networks: density ratio estimation in feature space with softplus loss. IEEE Transactions on Signal Processing 68, pp. 1910–1922. Cited by: §1.
  • [10] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan (2018) Synthetic data augmentation using GAN for improved liver lesion classification. In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pp. 289–293. Cited by: §1, §1, §3.5.
  • [11] K. Fukushima and S. Miyake (1982)

    Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition

    .
    In Competition and cooperation in neural nets, pp. 267–285. Cited by: Remark 1.
  • [12] K. Fukushima (1969)

    Visual feature extraction by a multilayered network of analog threshold elements

    .
    IEEE Transactions on Systems Science and Cybernetics 5 (4), pp. 322–333. Cited by: Remark 1.
  • [13] A. L. Gibbs and F. E. Su (2002) On choosing and bounding probability metrics. International statistical review 70 (3), pp. 419–435. Cited by: §4, Theorem 1.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Cited by: §1.
  • [15] J. Gou, B. Yu, S. J. Maybank, and D. Tao (2020) Knowledge distillation: a survey. arXiv preprint arXiv:2006.05525. Cited by: §1, §3.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §5.1, §5.2.
  • [17] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §2.3.
  • [18] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. NIPS Deep Learning Workshop. Cited by: §1, §2.1, §2.1, §3.5, §5.1, TABLE I, TABLE IV, §5.
  • [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §5.1, §5.2.
  • [20] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676. Cited by: §2.2.
  • [21] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §1, §2.2.
  • [22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §1, §2.2.
  • [23] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1, §5.
  • [24] J. Lafferty, H. Liu, and L. Wasserman Concentration of measure. External Links: Link Cited by: §4.
  • [25] Y. Le and X. Yang (2015) Tiny ImageNet visual recognition challenge. Cited by: §5.2, §5.
  • [26] D. Lee et al. (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §3.4.
  • [27] Q. Li, S. Jin, and J. Yan (2017) Mimicking very efficient network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6356–6364. Cited by: §1.
  • [28] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
  • [29] P. Liu, W. Liu, H. Ma, Z. Jiang, and M. Seok (2020) KTAN: knowledge transfer adversarial network. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §1, §1.
  • [30] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang (2016)

    Face model compression by distilling knowledge from neurons

    .
    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 30. Cited by: §1.
  • [31] Z. Luo, J. Hsieh, L. Jiang, J. C. Niebles, and L. Fei-Fei (2018) Graph distillation for action detection with privileged modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 166–183. Cited by: §1.
  • [32] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet V2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: §5.1, §5.2, §5.3, §5.4.
  • [33] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi (2018) BAGAN: data augmentation with balancing GAN. arXiv preprint arXiv:1803.09655. Cited by: §1, §1, §3.5.
  • [34] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1, §2.2.
  • [35] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh (2020) Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5191–5198. Cited by: §1, §2.1, §2.1, §3.5, §3.6.4, §5.1, §5.2, TABLE I, TABLE IV, §5, §S.8.1.
  • [36] A. Mittal, R. Soundararajan, and A. C. Bovik (2012) Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20 (3), pp. 209–212. Cited by: Fig. S.10.10, §S.10.2, Fig. S.8.9, §S.8.2.
  • [37] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §1, §S.10.1, §2.2, §5.3.
  • [38] T. Miyato and M. Koyama (2018) CGANs with projection discriminator. In International Conference on Learning Representations, Cited by: §1, §2.2, §2.3.
  • [39] M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT Press. Cited by: Theorem 1.
  • [40] A. Odena, C. Olah, and J. Shlens (2017) Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pp. 2642–2651. Cited by: §1, §2.2.
  • [41] Z. Peng, Z. Li, J. Zhang, Y. Li, G. Qi, and J. Tang (2019) Few-shot image recognition with knowledge transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 441–449. Cited by: §1.
  • [42] F. Ruffy and K. Chahal (2019) The state of knowledge distillation for classification. arXiv preprint arXiv:1912.10850. Cited by: §1, §3.5, §5.1, §5.1, TABLE III, TABLE VI, §5.
  • [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNet V2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §5.1, §5.2, §5.3, §5.4.
  • [44] Z. Shen, Z. He, and X. Xue (2019) Meal: multi-model ensemble via adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4886–4893. Cited by: §1, §1.
  • [45] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §5.1, §5.2, §5.3, §5.4.
  • [46] L. Sixt, B. Wild, and T. Landgraf (2018) RenderGAN: generating realistic labeled data. Frontiers in Robotics and AI 5, pp. 66. Cited by: §1, §1, §3.5.
  • [47] M. Tan and Q. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks

    .
    In International Conference on Machine Learning, pp. 6105–6114. Cited by: §5.1, §5.3, §5.4.
  • [48] N. Tran, V. Tran, N. Nguyen, T. Nguyen, and N. Cheung (2020) On data augmentation for GAN training. External Links: 2006.05338 Cited by: §2.2.
  • [49] J. Wang, L. Gou, W. Zhang, H. Yang, and H. Shen (2019) Deepvid: deep visual interpretation and diagnosis for image classifiers via knowledge distillation. IEEE transactions on visualization and computer graphics 25 (6), pp. 2168–2180. Cited by: §1.
  • [50] L. Wang and K. Yoon (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §3.
  • [51] X. Wang, R. Zhang, Y. Sun, and J. Qi (2018) KDGAN: knowledge distillation with generative adversarial networks.. In NeurIPS, pp. 783–794. Cited by: §1, §1.
  • [52] E. Wu, K. Wu, D. Cox, and W. Lotter (2018) Conditional infilling GANs for data augmentation in mammogram classification. In Image Analysis for Moving Organ, Breast, and Thoracic Images, pp. 98–106. Cited by: §1, §1, §3.5.
  • [53] Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le (2020) Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems, Vol. 33, pp. 6256–6268. Cited by: §1, §2.1.
  • [54] G. Xu, Z. Liu, X. Li, and C. C. Loy (2020) Knowledge distillation meets self-supervision. In European Conference on Computer Vision, pp. 588–604. Cited by: §1, §2.1, §2.1, §2.1, §3.5, §3.6.4, §5.1, TABLE II, TABLE V, §5.
  • [55] Z. Xu, Y. Hsu, and J. Huang (2018) Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. ICLR 2018 Workshop. Cited by: §1, §1.
  • [56] C. Zhang and Y. Peng (2018) Better and faster: knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 1135–1141. Cited by: §1.
  • [57] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In International Conference on Machine Learning, pp. 7354–7363. Cited by: §1, §2.2, §5.4.
  • [58] Z. Zhang, Y. Song, and H. Qi (2017)

    Age progression/regression by conditional adversarial autoencoder

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: §5.4, §5.
  • [59] Q. Zhao, J. Dong, H. Yu, and S. Chen (2020) Distilling ordinal relation and dark knowledge for facial age estimation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §2.1, §3.
  • [60] S. Zhao, Z. Liu, J. Lin, J. Zhu, and S. Han (2020) Differentiable augmentation for data-efficient GAN training. Advances in Neural Information Processing Systems 33. Cited by: 1st item, §1, §2.2, §3.2, §3.6.3, §5.1, §5.2, §5.3, §5.4.
  • [61] Z. Zhao, Z. Zhang, T. Chen, S. Singh, and H. Zhang (2020) Image augmentations for GAN training. External Links: 2006.02595 Cited by: §2.2.
  • [62] X. Zhu, Y. Liu, J. Li, T. Wan, and Z. Qin (2018) Emotion classification with data augmentation using generative adversarial networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 349–360. Cited by: §1, §1, §3.5.

Supplementary Material

s.7 GitHub repository

Please find some example codes for this paper at

https://github.com/UBCDingXin/cGAN-based_KD

s.8 More Details of Experiments on the CIFAR-10 Dataset

s.8.1 Experimental Setups

For each experimental setting described in Section 5.1, we implement all KD methods as follows.

All classifiers in this experiment except those related to SSKD are trained for 350 epochs with the SGD optimizer, initial learning rate 0.1 (decays at epoch 150 and 250 with factor 0.1), weight decay

, and batch size 128.

To determine teacher and student models, we first train some popular classifiers from scratch, and their test errors are shown in Table S.8.9. The performance of ShuffleNet V2, efficentnet-b0, and VGG11 is the worst in the three settings, so they are chosen as the student models. Oppositely, MobileNet V2, ResNet-18, and DenseNet-121 are chosen as teacher models. Although VGG11 has much more parameters than DenseNet-121, it has the highest inference speed among these classifiers. Therefore, VGG11 is taken as a lightweight model in this paper. Please also note that, in all settings, we use DenseNet-121 to do the filtering and label adjustment in the cGAN-KD framework because DenseNet-121 has the highest average precision over three settings.

When implementing TAKD, we borrow some codes from its official implementation at https://github.com/imirzadeh/Teacher-Assistant-Knowledge-Distillation. In TAKD, the precision of a good TA model is usually the average of those of the teacher and student models [35]. Therefore, VGG13 is chosen as the TA model based on Table S.8.9. SSKD is implemented based on https://github.com/xuguodong03/SSKD. We use the default training setups of SSKD at https://github.com/xuguodong03/SSKD/blob/master/command.sh. BLKD+UDA is implemented based on https://github.com/karanchahal/distiller.

The setups of cGAN-KD is shown as follows. The implementation of BigGAN is mainly based on https://github.com/ajbrock/BigGAN-PyTorch. The BigGAN model is trained for 2000, 2000, and 6000 epochs in C-50K, C-20K, and C-10K respectively with batch size 512. The DiffAugment is enabled in C-20K and C-10K, and we use its official implementation at https://github.com/mit-han-lab/data-efficient-gans. The strongest transformation combination (Color + Translation + Cutout) is used in the training. cDRE-F-cSP+RS is fitted by using the setups and codes in https://github.com/UBCDingXin/cDRE-based_Subsampling_cGANS. Different from [6], we use a specially designed DenseNet-121 instead of ResNet-34 to extract features for density ratio estimation. The grid search results for selecting the optimal in M1 is shown in Table S.8.10. In each setting, the optimal minimizes the average validation error of three student models.

In each setting, a student model trained by each KD method is evaluated on the 10,000 hold-out test samples of CIFAR-10. The performance of these student models is reflected by test error rates, i.e., the proportion of incorrectly classified test samples.

Please refer to our codes for more detailed experimental setup.

width=1 Networks # Params Inference Speed (FPS) Test Error Rate C-50K C-20K C-10K ShuffleNet V2 1,263,854 11,169 7.18 11.33 17.50 MobileNet V2 2,296,922 14,762 5.92 9.28 13.53 efficientnet-b0 3,599,686 7,505 7.10 12.46 15.71 VGG11 9,231,114 21,850 8.42 12.48 18.57 VGG13 9,416,010 16,457 6.63 10.73 15.63 VGG16 14,728,266 12,712 6.57 10.75 16.15 ResNet-18 11,173,962 9,883 4.88 9.26 16.07 ResNet-50 23,520,842 2,693 4.88 9.72 16.81 DenseNet-121 6,956,298 1,999 4.47 8.52 14.63

TABLE S.8.9: CIFAR-10: Comparison on the model size, inference speed, and test error rate between popular convolutional neural networks. Model size is evaluated by counting the number of learnable parameters in the network. Inference speed is defined as the number of images processed by a network per second. To compute the inference speed, we evaluate each network on 10,000 images with batch size 64 by using one RTX 2080 Ti. Based on the test error rates, ShuffleNet V2, efficientnet-b0, and VGG11 are selected as student models while MobileNet V2, ResNet-18, and DenseNet-121 are selected as teacher models.

width=1 C-50K Networks 0.3 0.4 0.5 0.6 0.7 0.8 0.9 VGG11 6.84 6.40 6.90 6.83 6.78 6.49 6.65 ShuffleNet V2 6.91 6.93 6.80 6.95 6.54 6.62 6.35 efficientnet-b0 6.76 6.43 6.55 6.46 6.07 6.53 6.13 Average 6.84 6.59 6.75 6.75 6.46 6.55 6.38 C-20K Networks 0.3 0.4