Aheavyweight model is often a deep, overparameterized neural network or an ensemble of multiple deep neural networks in deep learning. It usually has high precision but also incurs some costs: (1) a high memory cost due to the large model size (i.e., many learnable parameters); (2) a low inference speed (i.e., number of images processed by a model per second). Note that a neural network’s inference speed is related to both the number of learnable parameters and the connection design. For example, DenseNet-121 has a densely connected pattern (leads to a high computation cost) while VGG11  is a simple, straightforward network. Although DenseNet-121 has fewer learnable parameters than VGG11 does, evaluating VGG11 is about 10 times faster than DenseNet-121. Unfortunately, in many application scenarios (e.g., deploying neural networks on mobile devices), limited computational resources are available to evaluate a trained model, therefore we can only afford some lightweight models that are fast and memory-efficient. However, the small model capacity and the limited amount of labeled data often prevent the lightweight model from achieving sufficiently high precision. Therefore, how to use an accurate heavyweight model to improve the performance of a lightweight model has been actively studied recently.
Knowledge distillation (KD), first proposed by  and then developed by , is a popular method to improve the performance of a lightweight model by utilizing the knowledge distilled from an accurate, heavyweight model [28, 41, 49, 30, 31, 27, 56]. The heavyweight and lightweight models in KD are often known respectively as a teacher model and a student model. After  introduces the baseline knowledge distillation (BLKD) , many KD methods have been proposed for the image classification task [50, 15], but we will only briefly review in this paper BLKD  and some recently proposed methods [42, 35, 54]. BLKD transfers knowledge from the teacher model to the student model by matching the logits between these two models (i.e., the raw predictions generated by a classification-oriented neural network, which is then passed to the softmax function). The teacher assistant knowledge distillation (TAKD)  is a recent improvement of BLKD.  finds that BLKD may fail if the teacher and student models’ performance gap is too big. TAKD applies BLKD to transfer knowledge from the teacher model to an intermediate model, termed the teacher assistant (TA) model, to fill this gap. The TA model often performs better than the student model but worse than the teacher model. Then, TAKD transfers knowledge from the TA model to the student model by applying BLKD again. However, our experiments in Section 5 show that TAKD is sometimes inferior to BLKD. The self-supervised learning as an auxiliary task for knowledge distillation (SSKD) 
is another recently proposed KD method, which introduces self-supervised learning as an auxiliary task into knowledge distillation to transfer knowledge from the teacher model to the student model. However, the architectures of the teacher and student models sometimes need to be adjusted to fit into the SSKD framework, which may deteriorate the performance of the teacher model. In many scenarios where we need to apply KD, we often have limited labeled images. Thus, incorporates the unsupervised data augmentation (UDA)  into BLKD to improve the performance of the baseline distillation. Different from the image classification task, the application of KD in image regression with a scalar response variable (e.g., angle and age) has rarely been studied. 
proposes a KD method specially designed for estimating ages from the images of human faces. However, this method may be inapplicable to other image regression tasks with a scalar response because some components of the proposed framework are only designed for age estimation. To our best knowledge, there is no KD method general enough for all image regression tasks with a scalar response. Moreover, all the above methods are either designed for image classification or image regression; however, there is no unified KD framework that is suitable for both tasks.
Generative adversarial networks (GANs) [14, 34, 40, 37, 57, 2, 38, 21, 22, 7, 8] are state-of-the-art generative models for image synthesis. Some modern GAN models such as BigGAN  and StyleGAN [21, 22] are able to generate high-resolution, even photo-realistic images. Conditional generative adversarial networks (cGANs) is an essential family of GANs, which can generate images in terms of some conditions. Most cGANs are designed for categorical conditions such as class labels [34, 38, 40, 37, 57, 2], and cGANs with class labels as conditions are also known as class-conditional GANs. Recently, [7, 8] propose a new cGAN framework, termed continuous conditional GANs (CcGANs). CcGANs can generate images conditional on continuous, scalar variables (termed regression labels). In the scenario with limited training data, the performance of GANs often deteriorates. To alleviate this problem for unconditional GANs and class-conditional GANs, DiffAugment  proposes to conduct online transformation on images during the GAN training. Our experiments show that it also applies to CcGANs. Besides the advances in the GAN theory, some papers [10, 46, 52, 62, 33] use GAN-generated data to do data augmentation in some image classification tasks with insufficient training data. However, even state-of-the-art GANs may generate low-quality samples, which may cause some negative effects on the classification task. Fortunately, some recently proposed subsampling methods [9, 6] may be applied to eliminate these low-quality samples. Additionally, some works [55, 51, 44, 29] propose to incorporate the adversarial loss of GANs into KD, but their performance is not state-of-the-art.
Motivated by the limitations of existing KD methods and the recent development of cGANs, we propose a general and flexible cGAN-based KD framework suitable for both image classification and regression (with a scalar response). Our contributions can be summarized as follows:
In Section 3, we introduce a novel KD framework termed cGAN-KD, which distills and transfers knowledge via cGAN-generated samples. As a preliminary, we propose to train BigGAN  for classification or CcGANs [7, 8] for regression. We also suggest incorporating DiffAugment  into the cGAN training when labeled data are limited. Fake image-label pairs (i.e., fake samples) generated from cGAN are then subsampled and filtered to drop low-quality samples. The knowledge distillation takes place when a pre-trained teacher model adjusts the labels of fake samples. Then, these processed fake samples are used to augment the training set. Finally, the student model is trained on the augmented training set, where the knowledge transfer is conducted implicitly.
Compared with existing KD methods, our framework has many advantages that are summarized in Section 3.6. Notably, cGAN-KD is a unified KD framework suitable for both classification and regression tasks. It is compatible with state-of-the-art KD methods and particularly ideal for limited labeled data scenarios. Moreover, unlike many existing KD methods, the teacher and student models’ architecture difference is no longer important in cGAN-KD.
In Section 4, we derive the error bound of a student model trained in the cGAN-KD framework, which not only helps us understand how cGAN-KD takes effect but also guides the implementation of cGAN-KD in practice. Such analysis is often omitted in many papers about knowledge distillation. The error bound implies we should generate as many processed fake samples as possible and choose a teacher model with high precision.
In Section 5, extensive experiments on CIFAR-10, Tiny-ImageNet, RC-49, and UTKFace datasets demonstrate the effectiveness of cGAN-KD in the image classification and regression tasks with limited training data. In image classification tasks, state-of-the-art KD methods are also improved if incorporated into cGAN-KD. An ablation study on CIFAR-10 and RC-49 is also conducted to show the necessity of the subsampling, filtering, and label adjustment modules in the cGAN-KD framework. In another ablation study, we show that more processed fake images often lead to more stable knowledge distillation performance.
Please note that our approach is fundamentally different from existing GAN-related KD methods [55, 51, 44, 29] because (1) our approach is the first framework that utilizes cGAN-generated samples as the knowledge carrier; (2) our approach applies to both classification and regression; (3) we do not need to incorporate the adversarial loss into KD. Please also note that, in our KD framework, we design a subsampling and filtering scheme to drop low-quality samples generated from cGANs, which does not exist in existing GAN-based data augmentation methods [10, 46, 52, 62, 33].
2 Related Work
2.1 Knowledge Distillation
In this section, we briefly review four KD methods implemented in our experiments for image classification, including BLKD , TAKD , SSKD , and BLKD+UDA . The KD method only designed for age estimation  is not considered in this paper since it is inapplicable to other regression tasks.
BLKD  transfers knowledge from the teacher model to the student model by matching the logits (i.e., the output of the last layer in a neural network) between these two models, so it is also known as a logits-based KD method. BLKD does not need to change the teacher and student models’ architectures, and it has been widely applied in many applications. Denote by the logits of an image from a neural network, where is a
by 1 vector and
is the number of classes. With softmax function, we can calculate the probability that the imagebelonging to class as follows:
where and is the temperature factor. The by 1 vector is also known as the soft label of image . A higherhard label. An example of hard labels and soft labels are shown in Fig. 1. Usually, the soft label is more informative than the hard label because it can reflect the similarity between classes and the confidence of prediction. The logits of the same image from the teacher model and the student model are denoted by and respectively. Then, the corresponding soft labels are denoted respectively by and . The student model is trained to minimize the cross entroy between and as follows:
The student model is also trained to minimized the cross entropy between the one-hot encoded class label and the soft label as follows:
is a hyperparameter controlling the trade-off between two losses.is the standard loss for classification and encourages the knowledge transfer.
TAKD  is a recent variant of BLKD.  finds that if the performance gap between a teacher model and a student model is big, BLKD usually does not perform well. Therefore, TAKD introduces an teacher assistant (TA) model, which often performs better than the student model but worse than the teacher model. BLKD is applied to the teacher-TA and TA-student pairs, respectively, where the knowledge is first transferred from the teacher model to the TA model and then from the TA model to the student model.
SSKD  is another recently proposed KD method. Like BLKD, SSKD encourages the student model to mimic the teacher model’s classification performance on labeled data. Additionally, SSKD also minimizes the difference between the student and teacher models’ performance on a self-supervised learning task  (a task to learn a more informative representation of images in an unsupervised manner). However, we often need to carefully adjust the architectures of the student model and the teacher model to let them fit into the proposed algorithm of SSKD. Our experiments in Section 5 show that such architecture adjustment may deteriorate the teacher model’s performance.
UDA  is an effective data augmentation method for deep learning models when labeled data are scarce.  incoporates UDA into BLKD (denoted by BLKD+UDA) to improve the performance of BLKD. Please note that, when implementing BLKD+UDA, UDA must be applied to both the teacher model training and the student model training.
2.2 Conditional Generative Adversarial Networks
cGANs  aim to estimate the distribution of images conditional on some auxiliary information. A cGAN model includes two neural networks, a generator and a discriminator . The generator takes as input a random noise and the condition , and outputs a fake image which follows the fake conditional image distribution . The discriminator takes as input an image and the condition , and outputs the probability that the image comes from the true conditional image distribution . A typical pipeline of cGAN is shown in Fig. 2. Mathematically, the cGAN model is trained to minimize the divergence between and . The condition
is usually a categorical variable such as a class label. cGANs with class labels as conditions are also known as class-conditional GANs[34, 38, 40, 37, 57, 2]. Class-conditional GANs have been widely studied and the state-of-the-art models such as BigGAN  are already able to generate photo-realistic images. However, GANs conditional on regressions labels (e.g., angles and ages) have rarely been studied because of two problems. First, very few (even zero) images exist for some regression labels so that the empirical cGAN losses may fail. Second, since regression labels are continuous and infinitely many, they cannot be embedded by one-hot encoding like class labels. Recently, [7, 8] propose a new formulation of cGANs, termed CcGANs. The CcGAN framework consists of novel empirical cGAN losses and novel label input mechanisms. To solve the first problem, the discriminator is trained by either the hard vicinal discriminator loss (HVDL) or the soft vicinal discriminator loss (SVDL). A new empirical generator loss is also proposed to alleviate the first problem. To solve the second problem, [7, 8] introduce a naive label input (NLI) mechanism and an improved label input (ILI) mechanism. Hence, [7, 8] propose four CcGAN models employing different discriminator losses and label input mechanisms, i.e., HVDL+NLI, SVDL+NLI, HVDL+ILI, and SVDL+ILI. The effectiveness of CcGANs has been demonstrated on multiple regression-oriented datasets.
The performance of cGANs often deteriorates when training data are insufficient. DiffAugment  is one of some recent works [60, 20, 48, 61] that are designed to stabilize the cGAN training in this setting. Although DiffAugment is designed for unconditional (e.g., styleGAN [21, 22]) and class-conditional GANs (e.g., BigGAN ), our experiment shows that it is also applicable to CcGANs [7, 8].
2.3 cDRE-F-cSP+RS: Subsampling Conditional Generative Adversarial Networks
Modern cGANs are demonstrated successful in many applications, but low-quality samples still appear frequently even with state-of-the-art network architectures (e.g., BigGAN ) and training setups. To filter out low-quality samples,  proposes a subsampling framework, termed cDRE-F-cSP+RS, for class-conditional GANs and CcGANs. This framework consists of two components: a conditional density ratio estimation (cDRE) method termed cDRE-F-cSP and a rejection sampling (RS) scheme. cDRE-F-cSP aims to estimate the conditional density raito function based on real images and fake images . Based on the estimated conditional density ratios, the rejection sampling scheme is utilized to sample from a trained cGAN. For class-conditional GANs, experiments in  demonstrate that cDRE-F-cSP+RS can substantially improve the Fréchet inception distance (FID)  and Intra-FID  scores. For CcGANs, cDRE-F-cSP+RS not only improves the Intra-FID score but also improve the image diversity and label consistency (i.e., the consistency of generated images with respect to the conditioning label) [7, 5].
3 Proposed Method
While many KD methods have been proposed for image classification [50, 15], there is only one KD method for image regression (scalar response) . Unfortunately, it is a specially designed method for age estimation instead of a general KD method. Moreover, there is no KD framework applicable to both tasks.
This section proposes a unified KD framework, termed cGAN-KD, which is suitable for both image classification and regression (scalar response) tasks. The proposed framework can also fit into many state-of-the-art KD methods for image classification to improve their performance. It is also suitable for tasks with limited labeled data and insensitive to the teacher and student models’ architecture differences.
3.1 Problem Formulation
Before we introduce cGAN-KD, let us formulate the KD task in the language of mathematics as follows. Assume we have a set of image-label pairs, i.e.,
which are randomly drawn from the true joint distribution. We also have a teacher model and a student model which are trained on . often has a smaller test error than does, i.e.,
where is either the cross entropy (CE) loss (i.e., Eq. (2)) for classification or the square error (SE) loss (i.e., ) for regression. The objective of KD is to reduce the test error of by using the knowledge learned by .
3.2 The Workflow of cGAN-KD
As a preliminary of cGAN-KD, we need to train a cGAN on . For image classification, we suggest adopting state-of-the-art class-conditional GANs such as BigGAN . For image regression with a scalar response, we should use CcGANs [7, 8]. In the scenario with very few training data, we propose to apply DiffAugment  to stabilize the cGAN training. After the cGAN training, the proposed KD framework can be applied. In Fig. 3, we visualize the workflow of cGAN-KD, which includes three important modules denoted respectively by M1, M2, and M3. First, we draw a set of unprocessed fake image-label pairs from the trained cGAN, i.e.,
These fake samples are then subsampled and filtered by M1 to drop low-quality samples and form a subset of , i.e.,
The next module M2 in the pipeline adjusts the labels of images in by a pre-trained teacher model and outputs a set of processed samples, i.e.,
The processed samples are then used to augment the training set . Finally, M3 trains the student model on the augmented training set . The student model trained on is expected to perform better than the one trained on . More details of the three modules are described in Sections 3.3 to 3.5 and the evolution of fake sample datasets are shown in Fig. 4.
3.3 M1: Drop Low-quality Fake Samples
Since low-quality samples may cause harmful effects on the prediction accuracy if used to augment the training set, M1 is adopted to drop these samples, including two sequential submodules: a subsampling module and a filtering module.
The subsampling module implements cDRE-F-SP+RS  which performs rejection sampling to accept or reject a fake image-label pair in terms of the density ratio of conditioning on .  shows that cDRE-F-SP+RS can effectively improve the overall image quality of both class-conditional GANs and CcGANs in the conditional image synthesis setting. Thus, the subsampling module is very suitable for dropping low-quality samples.
The subsequent filtering module is another strategy to drop low-quality samples. Assume we generate a fake image from a trained cGAN conditional on a label , then is called the assigned label of in this paper. In the filtering module, we use the pre-trained teacher model to predict the label of . Our experimental study in Supp. S.8.2 and S.10.2 shows that a significant error (i.e., the cross entropy loss for classification or mean absolute error for regression) between the assigned and predicted labels often implies terrible visual quality. Based on this observation, we propose to drop fake samples with errors larger than a threshold, which is summarized in Alg. 1. The filtering threshold equals to the
-th quantile of fake samples’ errors and the optimalis selected by a grid search algorithm (i.e., Alg. 2).
3.4 M2: Distill Knowledge via Label Adjustment
in semi-supervised learning, the label adjustment is conducted by replacing the-th assigned label in with the predicted label . Please note that in classification, the predicted labels are hard labels as we described in Fig. 1. This adjustment distills the knowledge about the relation between an image and its label from the trained and stores it in the adjusted dataset . Moreover, cGANs especially CcGANs more or less suffer from the label inconsistency problem, i.e., a fake image’s the assigned label may diverge from its ground truth label. The label adjustment can alleviate this issue.
3.5 M3: Transfer Knowledge via Data Augmentation
The adjusted samples in are also called the processed fake samples. They are used to augment the original training set , i.e., . To transfer knowledge distilled from the pre-trained , we train on the augmented dataset in M3. Please note that empirical studies in Section 5 show that as increases, the test error of often does not stop decreasing until larger than a certain threshold and then starts fluctuating over a small range. Since it is hard to obtain the optimal in practice and a hefty usually does not cause significant adverse effect on precision, we suggest generating the maximum number of processed samples allowed by the computational budge.
because the distilled knowledge is transferred through samples instead of specially designed loss functions or network architectures.M3 is also distinct from existing GAN-based data augmentaiton methods [10, 46, 52, 62, 33], because they do not have the subsampling, filtering, and label adjustment steps.
3.6 Advantages of cGAN-KD
3.6.1 A Unified Knowledge Distillation Framework for Image Classification and Regression
Since all necessary steps in the workflow of cGAN-KD are applicable to both classification and regression (scalar response), cGAN-KD is actually a unified KD framework. Moreover, the theoretical analysis of cGAN-KD (see Section 4) in both tasks also has the same general formulation.
3.6.2 Compatible with State-of-the-art KD Methods
cGAN-KD distills and transfers knowledge based on fake samples, and it does not require extra loss functions or network architecture changes. Thus, cGAN-KD can be combined with many state-of-the-art KD methods for image classification to improve their performance. To embed a state-of-the-art KD method into cGAN-KD, we just need to train the student model on the augmented training set with this KD method in M3 but keep other procedures in Fig. 3 unchanged.
3.6.3 Suitable for Limited Labeled Data Scenarios
The performance of cGANs heavily deteriorates given a limited amount of labeled training data. Fortunately, DiffAugment  can effectively alleviate this problem. Besides DiffAugment, the subsampling and filtering modules in the cGAN-KD framework can also deal with this issue by removing low-quality fake samples. Therefore, cGAN-KD is very suitable for scenarios with limited labeled data.
As shown by our experiments in Section 5 and some papers [35, 54], the architecture difference between a teacher model and a student model may influence the performance of some existing KD methods because these methods rely on logits or intermediate layers to transfer knowledge. Other KD methods such as SSKD  even require some adjustments on the teacher and student models’ architectures. Differently, since the proposed cGAN-KD framework distills and transfers knowledge via fake samples, there is no requirement on the teacher and student models’ architectures, making cGAN-KD more flexible than other KD methods.
4 Theoretical Analysis
In this section, we derive the error bound of the student model , which theoretically illustrates how the teacher model improves the precision of in the cGAN-KD framework. Before we move to the derivation, we first introduce some notations. Denote by the distribution of unprocessed fake samples. Denote by and the distributions of fake samples after processed by M1 and M2 respectively. The evolution of fake samples’ distributions and datasets are visualized in Fig. 4. Additionally, we denote the augmented training dataset as , i.e., . Then, we make the following notations:
where is either the CE loss for classification or the SE loss for regression. Let be the optimal predictor which minimizes . We denote by the hypothesis space of . Note that may not include . Then, we define and as
The error bound of in cGAN-KD is depicted by the distance of from , which is described in Theorem 1.
Theorem 1 (Error Bound).
(i.i.d. samples) , , and the augmented dataset is considered as i.i.d. samples from a mixture distribution, i.e.,
(Measurability) is measurable for all .
(Distribution gap) There is a constant such that
where denotes the total variation distance  between two probability distributions ; and means and .
(Boundedness) There exists a constant , such that , .
Then, , with probability at least ,
where stands for the empirical Rademacher complexity [39, Definition 3.1] of , which is defined on samples independently drawn from .
We first decompose as follows
The second term in Eq. (8) is a non-negative number because the student model’s hypothesis space may not cover the optimal predictor . in the first term of Eq. (8) can be bounded as follows. Using the triangular inequality and A4 (i.e., boundedness of ) yields that
where is a measurable function. Thus,
Remark 1 (Rationality of A3 and A4).
In the cGAN-KD framework, processed fake images are used to augment the training set, so the distribution gap between and (measured by the total variation distance) should have a significant impact on the student model’s performance. Thus, in A3 of Theorem 1, we model the distribution gap by the summation of two components. The first component stands for the divergence caused by the trained cGAN and the subsampling and filtering steps. The second component is controlled by the generalization performance of —the expected loss of trained teacher model over the true data distribution.
It is also worth discussing the rationality of A4. The two type of learning tasks considered in this work are the regression and classification, for which we use the square loss and the cross entropy loss respectively. Let’s consider the regression task first. In our experiments on the regression datasets, the last layer of12, 11], so . Since , as long as is not a too bad predictor, it should not output arbitrarily large values, which implies can be bounded by a positive constant. Therefore, the square loss is bounded and A4 is satisfied. For the classification task, a sufficient condition to A4 is that when
, representing that our classifier cannot produceprobability for the true label, which is reasonable in practice.
Remark 2 (Illustration of Theorem 1).
The four terms on the right side of Eq. (7) show that the error of may come from four aspects. The first and last terms are only relevant to the nature of , so they are not influenced by . If does not output arbitrarily extreme predictions (as discussed in Remark 1), stays at a moderate level, implying the first term is also small. The last term is inevitable because may not include . The second term diminishes if we set large. For the third term, as increases. Then, the third term is only controlled by the property of and the distribution gap. To reduce the distribution gap, we can either improve the cGAN model, subsampling, and filtering or choose a with better generalization performance.
Therefore, Theorem 1 implies when implementing cGAN-KD we should (1) use state-of-the-art cGANs and subsampling methods, (2) set large, and (3) choose a with highest precision as possible.
This section aims to experimentally demonstrate the effectiveness of the proposed cGAN-KD framework in image classification and regression (scalar response) tasks when limited training data are available. We conduct extensive experiments on four image datasets, i.e., CIFAR-10  and Tiny-ImageNet  for image classification; RC-49 [7, 8] and UTKFace  for image regression. Candidate baseline KD methods in classification tasks are NOKD (i.e., no KD method is applied), BLKD , TAKD , SSKD , BLKD+UDA , the proposed cGAN-KD, and incorporating state-of-the-art KD methods into our cGAN-KD. In regression tasks, candidate methods only include NOKD and cGAN-KD. Please not that for image regression, as suggested by [7, 8], when training CcGANs, , and , regression labels are normalized to real numbers in . Nevertheless, in the evaluation stage of and , we compute mean absolute error (MAE) on unnormalized regression labels. For detailed experimental setups, please refer to the supplementary material.
We first evaluate the effectiveness of the proposed cGAN-KD framework on the CIFAR-10 dataset .
Experimental setup: CIFAR-10 consists of 60,000 () RGB images uniformly from 10 classes. The overall number of training samples is 50,000 (5000 for each class), and the remaining 10,000 samples (1000 for each class) are for testing. To compare our proposed method with existing KD methods when limited training data are available, we design three settings denoted respectively by C-50K, C-20K, and C-10K with different numbers of training samples. Specifically, all 50,000 training samples are available in C-50K. C-20K has 20,000 randomly selected training samples (about 2000 per class). In C-10K, the size of training samples is further reduced to 10,000 (about 1000 per class) to simulate the limited training data scenario.
Next, to select student and teacher models for this experiment, some popular classifiers are trained from scratch in each setting, and their test errors are shown in Table S.8.9 in the supplementary material. We choose three teacher models (i.e., MobileNet V2 , ResNet-18 , and DenseNet-121 ) with similarly high precision and three student models (i.e., ShuffleNet V2 , efficientnet-b0 , and VGG11 ) with similarly low precision based on their performance. Note that although MobileNet V2 is a popular lightweight model, its performance is surprisingly good and comparable to other two teachers on CIFAR-10. Therefore, MobileNet V2 is chosen as a teacher model in our experiment. To implement BLKD and TAKD, we set and following . VGG13 is chosen as the TA model in TAKD. During implementing SSKD, MobileNet V2 performs poorly after the necessary architecture adjustment. Additionally, efficientnet-b0 and DenseNet-121 are not supported by the official implementation of SSKD, thus we only consider one teacher model (ResNet-18) and two student models (VGG11 and ShuffleNet V2). To implement the proposed framework, we train one BigGAN model  for each setting. DiffAugment  is also incorporated into the BigGAN training in C-20K and C-10K due to the availability of limited training samples. In all three settings, no matter the teacher-student combination, we use DenseNet-121 to do the filtering and label adjustment due to its highest average precision. We set and the optimal in Alg. 1 selected by Alg. 2 are 0.9, 0.6, and 0.7 for the three settings respectively. In the experiments with SSKD and BLKD+UDA, we only consider the combination of cGAN-KD with these two KD methods due to limited computational resources.
An ablation study is designed to test the effectiveness of the subsampling, filtering, and label adjustment modules in the cGAN-KD framework in C-10K with , aiming to show how cGAN-KD performs if these three modules are added into the framework one by one. A second ablation study is conducted in the C-10K setting to analyze the effect of , where varies from 0 to .
Please refer to Supp. S.8 for more detailed setups.
Quantitative results: The quantitative comparison results of the main study are shown in Tables I, II, and III among different baseline methods. Please note that since SSKD needs to modify network architectures, and BLKD+UDA incorporates UDA into the teachers’ training (the teacher model’s precision may change), their performances are incomparable. For the same reason, it is also inappropriate to compare them with BLKD and TAKD. Therefore, these existing KD methods can be classified into three groups: (1) BLKD and TAKD, (2) SSKD, and (3) BLKD+UDA. We compare the proposed framework with them separately in Tables I - III. In Table I, we can see cGAN-KD related methods consistently outperform NOKD, BLKD , and TAKD  under all three settings and all teacher-student combinations. BLKD  and TAKD  methods are improved after incorporated into the proposed cGAN-KD framework. We also observe that cGAN-KD leads to higher performance gains when there are fewer training samples. Tables II and III show that cGAN-KD can effectively improve the performance of SSKD  and BLKD+UDA . Additionally, we conduct ablation studies to evaluate the effect of our proposed M1, M2, M3, and the parameter . The quantitative results of our ablation studies are visualized respectively in Figs. 5 and 6. Fig. 5 reveals the necessity of the subsampling, filtering, and label adjustment modules because their interaction results in the highest precision. Fig. 6 shows that more processed fake images stabilize the student models’ performance without significantly decreasing the precision, which confirms the necessity of a large .
|C-50K||MobileNet V2 (5.92)||VGG11||8.42||6.72||7.35||6.83||5.89||6.23|
|C-20K||MobileNet V2 (9.28)||VGG11||12.48||10.94||11.48||10.81||9.60||9.39|
|C-10K||MobileNet V2 (13.53)||VGG11||18.57||15.76||15.98||14.32||12.65||12.28|
|C-50K||MobileNet V2 (5.40)||VGG11||8.18||6.18||5.68|
|C-20K||MobileNet V2 (9.25)||VGG11||12.28||9.57||9.19|
|C-10K||MobileNet V2 (13.43)||VGG11||18.61||15.14||12.54|
This experiment further demonstrates the effectiveness of cGAN-KD on the Tiny-ImageNet dataset .
Experimental setup: Tiny-ImageNet contains 200 image classes with 500 images per class for training and 50 images per class for testing. Most images are RGB images of size , and a few grey-scale images are excluded in this experiment.
Based on the test errors of some popular networks (refer to Fig. S.9.11), we choose two teacher models (ResNet-50  and DenseNet-121 ) and two student models (ShuffleNet V2  and VGG11 ). When implementing TAKD , MobileNet V2  is chosen as the TA model because its performance is around the average of the performance of teachers and students. To generate fake samples, we adopt the BigGAN model  and DiffAugment . Since DenseNet-121 performs the best in this experiment, it is used for the filtering and label adjustment. Other experimental setups are similar to those of the CIFAR-10 experiment except that all training images are used. Please refer to Supp. S.9 for more details.
Experimental setup: The RC-49 dataset is made by rendering 49 3-D chair models individually. Each chair model is rendered at 899 yaw angles from to with a stepsize of
. A yaw angle is selected for training if its last digit is odd, so only 450 angles are in the training set while others are left for testing. This dataset contains 44,051 RGB images of sizewith corresponding yaw angles as labels. In this experiment, we design three settings denoted respectively by R-25, R-15, and R-5. The numbers in three setting names specify the number of images for each distinct angle in the training set. For example, in the R-25 setting, each of the 450 angles in the training set has 25 images, so there are 11,250 images available for training. In all three settings, all images not used for training are held out for testing.
Three student models (ShuffleNet V2 , MobileNet V2 , and efficientnet-b0 ) and one teacher model (VGG16 ) are selected in this experiment based on the performance of some popular networks (refer to Table S.10.13). Since no general KD method exists for image regression tasks with a scalar response, we only compare cGAN-KD with NOKD. For cGAN-KD, we adopt the SNGAN architecture  and train one CcGAN model (SVDL+ILI)  with DiffAugment  for each setting. We use VGG16 to do the filtering and label adjustment. Alg. 2 is applied to select the optimal in the filtering module. 50,000 processed fake samples are generated to augment the training set in each setting. Detailed experimental setups can be found in Supp. S.10.
Similar to the CIFAR-10 experiment, an ablation study is conducted to show the necessity of the subsampling, filtering, and label adjustment modules. Another ablation study is conducted to show the effect of .
Quantitative results: The quantitative results are shown in Table VII. We can see that cGAN-KD outperforms NOKD by a large margin and the performance enhancement is even more significant if we have fewer training samples. The quantitative results of the two ablation studies are visualized in Figs. 7 and 8. The conclusion from these ablation studies is consistent with that we get in the CIFAR-10 experiment, except that the label adjustment module’s effect is more substantial in regression than in classification. One explanation of this difference is that the label inconsistency issue of CcGANs for regression is more severe than that of BigGAN for classification, and the label adjustment module can effectively alleviate this problem.
|R-25||VGG16 (0.20)||ShuffleNet V2||0.52||0.34|
|R-15||VGG16 (0.31)||ShuffleNet V2||0.76||0.40|
|R-5||VGG16 (0.49)||ShuffleNet V2||1.95||0.95|
The last experiment evaluates the performance of cGAN-KD on UTKFace , another benchmark regression dataset.
Experimental setup: UTKFace is an RGB human face image dataset with ages as regression labels. We use the processed UTKFace dataset [8, 7], which consists of 14,760 RGB images with ages in [1, 60]. The number of images ranges from 50 to 1051 for different ages, and all images are of size . Among these images, 80% images are randomly selected to create a training set, and the rest are held out for testing.
Similar to the RC-49 experiment, three student models (ShuffleNet V2 , MobileNet V2 , and efficientnet-b0 ) and one teacher model (VGG16 ) are selected in this experiment based on the performance of some popular networks (please refer to Table S.11.15). For cGAN-KD, we adopt the SAGAN architecture  and DiffAugment  in the CcGAN (SVDL+ILI) training . We apply VGG16 to conduct the filtering and label adjustment and the optimal is selected by following Alg. 2. 80,000 processed fake samples are generated to augment the training set. Please refer to Supp. S.11 for detailed setups.
Quantitative results: The quantitative results of this experiment is summarized in Table VIII. We observe that cGAN-KD performs substantially better than NOKD. Please also note that in the scenario with efficientnet-b0 as the student model, the test MAE of cGAN-KD is even smaller than that of the teacher model. This phenomenon implies the CcGAN model may generate some human faces which are quite different from those in the training set. In other words, the CcGAN model may synthesize some new information.
|VGG16 (5.14)||ShuffleNet V2||7.326||5.774|
This work proposes the first unified knowledge distillation framework widely applicable for both classification and regression tasks in the limited data scenario. Fundamentally different from existing knowledge distillation methods, we propose distilling and transferring knowledge from the teacher to student models through cGAN-generated samples, termed cGAN-KD. First, cGAN models are trained to generate a sufficient number of fake image samples. Then, high quality samples are obtained via subsampling and filtering procedures. Essentially, the knowledge is distilled by adjusting fake image labels utilizing the teacher model. Finally, the distilled knowledge is transferred to student models by training them on these knowledge-conveyed samples. The proposed novel knowledge distillation framework is particularly effective when labeled training data are scarce. Moreover, our framework is architecture-agnostic and it is compatible with existing state-of-the-art knowledge distillation methods. We also derive the error bound of a student model trained in the cGAN framework for theoretical guidance. Extensive experiments demonstrate that cGAN-KD incorporated methods can achieve state-of-the-art knowledge distillation performance on both classification and regression tasks.
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grants CRDPJ 476594-14, RGPIN-2019-05019, and RGPAS2017-507965.
-  (2020) Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §3.4.
-  (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: 1st item, §1, §2.2, §2.2, §2.3, §3.2, §5.1, §5.2.
-  (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1.
A simple framework for contrastive learning of visual representations.
International conference on machine learning, pp. 1597–1607. Cited by: §2.1.
-  (2019) On the evaluation of conditional GANs. arXiv preprint arXiv:1907.08175. Cited by: §2.3.
-  (2021) Efficient subsampling for generating high-quality images from conditional generative adversarial networks. arXiv preprint arXiv:2103.11166. Cited by: §1, §S.10.1, §2.3, §3.3, §S.8.1.
-  (2021) CcGAN: continuous conditional generative adversarial networks for image generation. In International Conference on Learning Representations, Cited by: 1st item, §1, §2.2, §2.2, §2.3, §3.2, §5.3, §5.4, §5.
-  (2020) Continuous conditional generative adversarial networks for image generation: novel losses and label input mechanisms. arXiv preprint arXiv:2011.07466. Cited by: 1st item, §1, §S.10.1, §2.2, §2.2, §3.2, §5.3, §5.3, §5.4, §5.4, §5.
-  (2020) Subsampling generative adversarial networks: density ratio estimation in feature space with softplus loss. IEEE Transactions on Signal Processing 68, pp. 1910–1922. Cited by: §1.
-  (2018) Synthetic data augmentation using GAN for improved liver lesion classification. In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pp. 289–293. Cited by: §1, §1, §3.5.
Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets, pp. 267–285. Cited by: Remark 1.
Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics 5 (4), pp. 322–333. Cited by: Remark 1.
-  (2002) On choosing and bounding probability metrics. International statistical review 70 (3), pp. 419–435. Cited by: §4, Theorem 1.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Cited by: §1.
-  (2020) Knowledge distillation: a survey. arXiv preprint arXiv:2006.05525. Cited by: §1, §3.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1, §5.2.
-  (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §2.3.
-  (2015) Distilling the knowledge in a neural network. NIPS Deep Learning Workshop. Cited by: §1, §2.1, §2.1, §3.5, §5.1, TABLE I, TABLE IV, §5.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1, §5.1, §5.2.
-  (2020) Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676. Cited by: §2.2.
-  (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §1, §2.2.
-  (2020) Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119. Cited by: §1, §2.2.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.1, §5.
-  Concentration of measure. External Links: Cited by: §4.
-  (2015) Tiny ImageNet visual recognition challenge. Cited by: §5.2, §5.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §3.4.
-  (2017) Mimicking very efficient network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6356–6364. Cited by: §1.
-  (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1.
-  (2020) KTAN: knowledge transfer adversarial network. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §1, §1.
Face model compression by distilling knowledge from neurons. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30. Cited by: §1.
-  (2018) Graph distillation for action detection with privileged modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 166–183. Cited by: §1.
-  (2018) Shufflenet V2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: §5.1, §5.2, §5.3, §5.4.
-  (2018) BAGAN: data augmentation with balancing GAN. arXiv preprint arXiv:1803.09655. Cited by: §1, §1, §3.5.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1, §2.2.
-  (2020) Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 5191–5198. Cited by: §1, §2.1, §2.1, §3.5, §3.6.4, §5.1, §5.2, TABLE I, TABLE IV, §5, §S.8.1.
-  (2012) Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20 (3), pp. 209–212. Cited by: Fig. S.10.10, §S.10.2, Fig. S.8.9, §S.8.2.
-  (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §1, §S.10.1, §2.2, §5.3.
-  (2018) CGANs with projection discriminator. In International Conference on Learning Representations, Cited by: §1, §2.2, §2.3.
-  (2018) Foundations of machine learning. MIT Press. Cited by: Theorem 1.
-  (2017) Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pp. 2642–2651. Cited by: §1, §2.2.
-  (2019) Few-shot image recognition with knowledge transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 441–449. Cited by: §1.
-  (2019) The state of knowledge distillation for classification. arXiv preprint arXiv:1912.10850. Cited by: §1, §3.5, §5.1, §5.1, TABLE III, TABLE VI, §5.
-  (2018) MobileNet V2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §5.1, §5.2, §5.3, §5.4.
-  (2019) Meal: multi-model ensemble via adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4886–4893. Cited by: §1, §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §5.1, §5.2, §5.3, §5.4.
-  (2018) RenderGAN: generating realistic labeled data. Frontiers in Robotics and AI 5, pp. 66. Cited by: §1, §1, §3.5.
Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §5.1, §5.3, §5.4.
-  (2020) On data augmentation for GAN training. External Links: Cited by: §2.2.
-  (2019) Deepvid: deep visual interpretation and diagnosis for image classifiers via knowledge distillation. IEEE transactions on visualization and computer graphics 25 (6), pp. 2168–2180. Cited by: §1.
-  (2021) Knowledge distillation and student-teacher learning for visual intelligence: a review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §3.
-  (2018) KDGAN: knowledge distillation with generative adversarial networks.. In NeurIPS, pp. 783–794. Cited by: §1, §1.
-  (2018) Conditional infilling GANs for data augmentation in mammogram classification. In Image Analysis for Moving Organ, Breast, and Thoracic Images, pp. 98–106. Cited by: §1, §1, §3.5.
-  (2020) Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems, Vol. 33, pp. 6256–6268. Cited by: §1, §2.1.
-  (2020) Knowledge distillation meets self-supervision. In European Conference on Computer Vision, pp. 588–604. Cited by: §1, §2.1, §2.1, §2.1, §3.5, §3.6.4, §5.1, TABLE II, TABLE V, §5.
-  (2018) Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. ICLR 2018 Workshop. Cited by: §1, §1.
-  (2018) Better and faster: knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 1135–1141. Cited by: §1.
-  (2019) Self-attention generative adversarial networks. In International Conference on Machine Learning, pp. 7354–7363. Cited by: §1, §2.2, §5.4.
Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: §5.4, §5.
-  (2020) Distilling ordinal relation and dark knowledge for facial age estimation. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §2.1, §3.
-  (2020) Differentiable augmentation for data-efficient GAN training. Advances in Neural Information Processing Systems 33. Cited by: 1st item, §1, §2.2, §3.2, §3.6.3, §5.1, §5.2, §5.3, §5.4.
-  (2020) Image augmentations for GAN training. External Links: Cited by: §2.2.
-  (2018) Emotion classification with data augmentation using generative adversarial networks. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 349–360. Cited by: §1, §1, §3.5.
s.7 GitHub repository
Please find some example codes for this paper at
s.8 More Details of Experiments on the CIFAR-10 Dataset
s.8.1 Experimental Setups
For each experimental setting described in Section 5.1, we implement all KD methods as follows.
All classifiers in this experiment except those related to SSKD are trained for 350 epochs with the SGD optimizer, initial learning rate 0.1 (decays at epoch 150 and 250 with factor 0.1), weight decay, and batch size 128.
To determine teacher and student models, we first train some popular classifiers from scratch, and their test errors are shown in Table S.8.9. The performance of ShuffleNet V2, efficentnet-b0, and VGG11 is the worst in the three settings, so they are chosen as the student models. Oppositely, MobileNet V2, ResNet-18, and DenseNet-121 are chosen as teacher models. Although VGG11 has much more parameters than DenseNet-121, it has the highest inference speed among these classifiers. Therefore, VGG11 is taken as a lightweight model in this paper. Please also note that, in all settings, we use DenseNet-121 to do the filtering and label adjustment in the cGAN-KD framework because DenseNet-121 has the highest average precision over three settings.
When implementing TAKD, we borrow some codes from its official implementation at https://github.com/imirzadeh/Teacher-Assistant-Knowledge-Distillation. In TAKD, the precision of a good TA model is usually the average of those of the teacher and student models . Therefore, VGG13 is chosen as the TA model based on Table S.8.9. SSKD is implemented based on https://github.com/xuguodong03/SSKD. We use the default training setups of SSKD at https://github.com/xuguodong03/SSKD/blob/master/command.sh. BLKD+UDA is implemented based on https://github.com/karanchahal/distiller.
The setups of cGAN-KD is shown as follows. The implementation of BigGAN is mainly based on https://github.com/ajbrock/BigGAN-PyTorch. The BigGAN model is trained for 2000, 2000, and 6000 epochs in C-50K, C-20K, and C-10K respectively with batch size 512. The DiffAugment is enabled in C-20K and C-10K, and we use its official implementation at https://github.com/mit-han-lab/data-efficient-gans. The strongest transformation combination (Color + Translation + Cutout) is used in the training. cDRE-F-cSP+RS is fitted by using the setups and codes in https://github.com/UBCDingXin/cDRE-based_Subsampling_cGANS. Different from , we use a specially designed DenseNet-121 instead of ResNet-34 to extract features for density ratio estimation. The grid search results for selecting the optimal in M1 is shown in Table S.8.10. In each setting, the optimal minimizes the average validation error of three student models.
In each setting, a student model trained by each KD method is evaluated on the 10,000 hold-out test samples of CIFAR-10. The performance of these student models is reflected by test error rates, i.e., the proportion of incorrectly classified test samples.
Please refer to our codes for more detailed experimental setup.