in recent years. Such great achievement largely attributes to several essential factors, including the availability of massive data, the rapid development of the computing hardware, and the more efficient optimization algorithms. Owing to the tremendous success of deep learning and the open-source spirit encouraged by the research fields, an enormous amount of pretrained deep networks can be obtained freely from the Internet nowadays.
However, many problems may occur when we deploy these pretrained models into real-world scenarios. One prominent obstacle is that the pretrained deep model obtained online is usually large in volume, consuming expensive computing resources that we can not afford with the low-capacity edge devices. A large literature has been devoted to compressing the cumbersome deep models into a more lightweight one, from which Knowledge Distillation (KD)  is one of the most popular paradigms. In most existing KD methods, given the original training data or alternative data similar to the original one, a lightweight student model learns from the pretrained teacher by directly imitating its output. We term these methods data-driven KD.
Unfortunately, the training data of released pretrained models are often unavailable due to privacy, transmission, or legal issues, as seen in Figure 1. One strategy to deal with this problem is to use some alternative data , but it leads to a new problem where users are utterly ignorant of the data domain, making it almost impossible to collect similar data. Meanwhile, even if the domain information is known, it is still onerous and expensive to collec a large amount of data. Another compromising strategy in this situation is using somewhat unrelated data for training. However, it drastically deteriorates the performance of the student due to the incurred data bias.
An effective way to avert the problems mentioned above is using synthetic samples, leading to the data-free knowledge distillation [26, 6, 27]. Data-free distillation is currently a new research area where traditional generation techniques such as GANS  and VAE  can not be directly applied due to the lack of real data. Nayak et al.  and Chen et al.  have made some pilot studies on this problem. In Nayak’s work , some “Data Impressions” are constructed from the teacher model. Besides, in Chen’s work 
, they also propose to generate some one-hot samples, which can highly activate the neurons of the teacher model. These exploratory researches achieve impressive results on classification tasks but still have several limitations. For example, their generation constraints are empirically designed based on assumption that an appropriate sample usually has a high degree of confidence in the teacher model. Actually, the model maps the samples from the data space to a very small output space, and a large amount of information is lost. It is difficult to construct samples with a fixed criterion on such a limited space. Besides, these existing data-free methods[6, 27] only take the fixed teacher model into account, ignoring the information from the student. It means that the generated samples can not be customized with the student model.
To avoid the one-sidedness of empirically designed constraints, we propose a data-free adversarial distillation framework to customize training samples for the student model and teacher model adaptively. In our work, a model discrepancy is introduced to demonstrates the functional difference between models. We construct an optimizable upper bound for the discrepancy so that it can be reduced to train the student model. The contributions of our proposed framework can be summarized as three points:
We propose the first adversarial training framework for data-free knowledge distillation. To our knowledge, it is the first approach that can be applied to semantic segmentation.
We introduce a novel method to quantitatively measure the discrepancy between models without any real data.
Extensive experiments demonstrate that the proposed method not only behaves significantly superior to data-free methods, and also yields comparable results to some data-driven approaches.
2 Related Work
2.1 Knowledge Distillation (KD)
Knowledge distillation  aims at learning a compact and comparable student model from pretrained teacher models. With a teacher-student training schema, it efficiently reduces the complexity and redundancy of the large teacher model. In order to extend the KD framework, researchers have proposed several techniques. According to the requirements of data, we divide those methods into two categories, which are data-driven knowledge distillation and data-free knowledge distillation.
2.1.1 Data-driven Knowledge Distillation
Data-driven knowledge distillation requires real data to extract the knowledge from teacher models. Bucilua et al. use a large-scale unlabeled dataset to get pseudo training labels from teacher models . For generalization, Hinton et al. propose the concept of Knowledge Distillation (KD) . In KD, the targets, softened by a temperature , are obtained from the teacher model. The temperature allows the student model to capture the similarities between different categories.
In order to learn more knowledge, some methods are proposed to utilize intermediate representation as supervision. For example, Romero et al. learn a student model by matching the aligned intermediate representation . Moreover, Zagoruyko et al. add a constraint of attention matching to let the student network learn similar attention. . In addition to classification tasks [14, 39, 10], knowledge distillation can also be applied to other tasks such as semantic segmentation [25, 17]
and depth estimation. Recently, it has also been extended to multitasking [38, 34]. By learning from multiple models, the student model can combine knowledge from different tasks to achieve better performance.
2.1.2 Data-free Knowledge Distillation
The data-driven methods mentioned above are difficult to practice if training data is not accessible. Intuitively, the parameters of a model are independent of its training data. It is possible to distill the knowledge out without real data with data-free methods.
To achieve it, Lopes et al. propose to store some metadata during training and reconstruct the training samples during distillation . However, this method still requires metadata during distillation, so it is not completely data-free. Furthermore, Nayak et al. propose to craft Data Impressions (DI) as training data from random noisy images . They model the softmax space as a Dirichlet distribution and update random noise images to obtain training data. Another kine of method for data-free distillation is to synthesize training samples with a generator directly. Chen et al. propose DAFL , in which the teacher model is fixed as a discriminator . They utilize the generator to construct some training samples, which enable the teacher network to produce highly activated intermediate representations and one-hot predictions.
2.2 Generative Adversarial Networks (GANs)
GANs demonstrate powerful capabilities in image generation [11, 31, 2] for the past few years. It setups a min-max game between a discriminator and a generator. The discriminator aims to distinguish generated data from real ones when the generator is dedicated to generating more realistic and indistinguishable samples to fool the discriminator. Through the adversarial training, GANs can implicitly measure the difference between two distributions. However, GANs are also facing some problems such as training instability and mode collapse [1, 12]. Arjovsky et al. propose Wasserstein GAN (WGAN) to make training more stable. WGAN replaces traditional adversarial loss  with an approximated Wasserstein distance under 1-Lipschitz constraints so that the gradients of generator will be more stable. Similarly, Qi et al. propose to regularize the adversarial loss with Lipschitz regularization 
. In practical applications, GANs are highly scalable and can be extended to many tasks such as image-to-image translation[40, 16]
, image super-resolution[37, 24] and domain adaptation [36, 22]. The powerful capabilities theoretically are qualified for sample generation for data-free knowledge distillation.
Harnessing the learned knowledge of a pretrained teacher model , our goal is to craft a more lightweight student model without any access to real-world data. To achieve this, we approximates the model with a parameterized by minimizing the model discrepancy , which indicates the differences between the teacher and the student . With the discrepancy, an optimal student model can be expressed as follows:
In vanilla data-driven distillation, we design a loss function, e.g., Mean Square Error, and optimize it with real data. The loss function in this procedure can be seen as a specific measurement of the model discrepancy. However, the measurement becomes intractable when the original training data is unavailable. To tackle this problem, we introduce our data-free adversarial distillation (DFAD) framework to approximately estimate the discrepancy so that it can be optimized to achieve data-free distillation.
3.1 Discrepancy Estimation
Given a teacher model , a student model and a specific data distribution , we firstly define a data-driven model discrepancy :
The constant factor in Eqn. 2 indicates the number of elements in model output. This discrepancy simply measures the Mean Absolute Error (MAE) of model output across all data points. Note that is functionally identical to if and only if they produce the same output for any input . Therefore, if
is a uniform distributioncovering the whole data space, we can obtain the true model discrepancy
. Optimizing such a discrepancy is equivalent to training with random inputs sampled from the whole data space, which is obviously impossible due to the curse of dimensionality. To avert estimating the intractable, we introduce a generator network
to control the data distribution. Like in GANs, the generator accepts a random variablefrom a distribution and generate a fake sample . Then the discrepancy can be evaluated with the generator:
The key idea of our framework is to approximate with . In other words, we estimate the true discrepancy between the teacher model and student with a limited number of generated samples. In this work, we divide the generated samples into two types: “hard sample” and “easy sample”. The hard sample is able to produce a relatively larger output differences with the model and model , while the easy sample corresponds to small differences. Suppose that we have a generator that can always generate hard samples, according to Eqn. 3, we can obtain a “hard sample discrepancy” . Since hard samples always cause large output differences, it is clear the following inequality is true:
In this inequality, is the uniform distribution covering the whole data space, which comprises a large amount of hard samples and easy samples. Those easy samples make numerically lower than that is estimated on hard samples. The inequality is always established when the generated samples are guaranteed to be hard samples. Under this constant, provides an upper bound for the real model discrepancy . Note that our goal is to optimize the true model discrepancy , which can be achieved by optimizing its upper bound .
However, in the process of training the student model , hard samples will be mastered by the student and converted into easy samples. Hence we need a mechanism to push the generator to continuously generate hard samples, which can be achieved by adversarial distillation.
3.2 Advserarial Distillation
In order to maintain the constraints of generating hard samples, we introduce a two-stage adversarial training in this section. Similar to GANs, there is also a generator and a discriminator in our framework. The generator , as aforementioned, is used to generate hard samples. The student model , together with the teacher model are jointly viewed as the discriminator to measure the hard sample discrepancy . The adversarial training process consists of two stages: the imitation stage that minimizes the discrepancy and the generation stage that maximize the discrepancy, as shown in Fig. 2.
3.2.1 Imitation Stage
In this stage, we fix the generator and only update the student in the discriminator. we sample a batch of random noises
from Gaussian distribution and construct fake sampleswith the generator . Then each sample is fed to both the teacher and the student models to produce the output and . In classification tasks,
is a vector indicating the scores of different categories. In other tasks such as semantic segmentation,can be a matrix.
Actually, there are several ways to define the discrepancy to drive the student learning. Hinton et al
. utilize the KD loss, which can be Kullback–Leibler Divergence (KLD) or Mean Square Error (MSE), to train the student model. These loss functions are very effective in data-driven KD, yet problematic if directly applied to our framework. An important reason is that, when the student converges on the generated samples, these two loss function will produce decayed gradients, which will deactivate the learning of generator, resulting in a dying minmax game. Hence, the Mean Absolute Error (MAE) betweenand is used as the loss funtion. Now we can define the loss function for imitation stage as follows:
Given the output , The gradient of with respect to is shown in Eqn. 6. It simply multiply the gradients with the sign of when is very close to , which provides stable gradients for the generator so that the vanishing gradients can be alleviated.
Intuitively, this stage is very similar to KD, but the goals are slightly different. In KD, students can greedily learn from the soft targets produced by the teacher, as these targets are obtained from real data  and contain useful knowledge for the specific task. However, in our setting, we have no access to any real data. The fake samples synthesized by the generator are not guaranteed to be useful, especially at the beginning of training. As aforementioned, the generator is required to produce hard samples to measure the model discrepancy between teacher and student. Another essential purpose of the imitation stage, in addition to learning knowledge from the teacher, is to construct a better search space to force the generator to find new hard samples.
3.2.2 Generation Stage
The goal of the generation stage is to push the generation of hard samples and maintain the constraint for Formula 4. In this stage, we fix the discriminator and only update the generator. It is inspired by the human learning process, where basic knowledge is learned at the beginning, and then more advanced knowledge is mastered by solving more challenging problems. Therefore, in this stage, we encourge the generator to produce more confusing training samples. A straightforward way to achieve this goal is to simply take the negative MAE loss as the objective for optimizing the generator:
With the generation loss, the error firstly back-propagates through the discriminator, i.e., teacher and the student model, then the generator, yielding the gradients for optimizing the generator. The gradient from the teacher model is indispensable at the beginning of adversarial training, because the randomly initialized student practically provides no instructive information for exploring hard samples.
However, the training procedure with the objective shown in Eqn. 7 may be unstable if the student learning is relatively much slower. By minimizing the objective in Eqn. 7, the generator tends to generate “abnormal” training samples, which produce extremely different predictions when fed to the teacher and the student. It deteriorates the adversarial training process and makes the data distribution change drastically. Therefore, it is essential to ensure the generated samples to be normal. To this end, we propose to take the log value of MAE as an adaptive loss function for the generation stage:
Different from which always encourages the generator to produce hard samples with large discrepancy, in the proposed new objective in Eqn. 8, the gradients of the generator are gradually decayed to zero when discrepancy becomes large. It slow down the training of the generator and make training more stable. Without the log term, we have to carefully adjust the learning rate to make the training as stable as possible.
Two-stage training. The whole distillation process is summarized in Algorithm 1. Our framework trains the student and the generator by repeating the two stages. It begins with the imitation stage to minimize . Then in the generation stage, we update the generator to maximize . Based on the learning progress of student model, the generator crafts hard samples to further estimate the model discrepancy. The competition in this adversarial game drives the generator to discovers missing knowledge, leading to complete knowledge. After several steps of training, the system will ideally reach a balance point, at which the student model has mastered all hard samples, and the generator is not able to differentiate between the two models and . In this case, is functionally identical to .
Training Stability It is essential to maintain stability in adversarial training.In the imitation stage, we update the student model for times so as to ensure its convergence. However, since the generating samples are not guaranteed to be useful for our tasks, the value of k cannot be set too large, as it leads to an extraordinarily biased student model. We find that setting k to 5 can make training stable. In addition, we suggest using adaptive loss in dense prediction tasks, such as segmentation, in which each pixel will provide statistical information for adjusting the gradient. In classification tasks, only a few samples are used to calculate the generation loss and the statistical information is not accurate, hence the is more prefered.
Sample Diversity Unlike GANs, our approach naturally maintains diversity of generated samples. When mode collapse occurs, it is easy for students to fit these duplicated samples in our framework, resulting in a very low model discrepancy. In this case, the generator is forced to generate different samples to enlarge the discrepancy.
We conduct extensive experiments to verify the effectiveness of the proposed method, in which knowledge distillation on two types of models are explored: the classification models and the segmentation model.
4.1 Experimental Settings
|KD-ORI||139K||0.988 0.001||557M||0.939 0.011||558M||0.733 0.003||595M||0.775 0.002|
|KD-REL||139K||0.960 0.006||557M||0.912 0.002||558M||0.690 0.004||595M||0.748 0.003|
|KD-UNR||139K||0.957 0.007||557M||0.445 0.012||558M||0.133 0.003||595M||0.352 0.015|
|RANDOM||139K||0.747 0.033||557M||0.101 0.002||558M||0.015 0.001||595M||0.010 0.000|
|DAFL||139K||0.981 0.001||557M||0.885 0.003||558M||0.614 0.005||595M||FAILED|
|Ours||139K||0.983 0.002||557M||0.933 0.000||558M||0.677 0.003||595M||0.735 0.008|
4.1.1 Models and Datasets
We adopt the following six pretrained models to demonstrate the effectiveness of the proposed method: MNIST , CIFAR10 , CIFAR100 , Caltech101 for classification and CamVid [4, 3], NYUv2  for semantic segmentation. Here the models are named after the corresponding training data.
MNIST. MNIST  is a simple image dataset for recognition of handwritten digits containing 60,000 training images and 10,000 test images from 10 categories. Following [26, 6], we use a LeNet-5 as the pretrained teacher model and use a LeNet-5-Half as the student model.
CIFAR10 and CIFAR100. CIFAR10  and CIFAR100 both contain 60,000 RGB images. Among them, 50,000 images are used for training and 10,000 for testing. CIFAR10 contains 10 classes when CIFAR100 contains 100 classes. Due to the limitations of the small resolution, we use a modified ResNet-34  as our teacher, which has only three downsample layers. We utilize a ResNet-18 as our student model.
Caltech101. Caltech101  is a classification dataset. There are 101 categories, each of which contains at least 40 images. We randomly split the dataset into two parts: a training set with 6982 images and a test set with 1695 images. During training, the images are resized and cropped to . We use the standard ResNet-34 architecture as the teacher model and use ResNet-18 as the student model.
CamVid. Camvid [4, 3] is a road scene segmentation dataset, consisting of 367 training and 233 testing RGB images. There are 11 categories in CamVid, such as road, cars, poles, traffic lights, etc. The original resolution of images is . Due to the difficulty in generating high-resolution images, we resize the short side to 256 and train our teacher with random crop. The teacher model is a DeepLabV3  model with ResNet-50  as backbone. For student model, we adopt a Mobilenet-V2  as the backbone.
NYUv2. The NYUv2  is collected for indoor scene parsing. It provides 1449 labeled RGB-D images with 13 categories and 407024 unlabeled images. We use 795 pixel-wise labeled images to train our teacher and use the left 654 images as the test set. Similar to CamVid, we also resize and crop the images to blocks for training and use the DeeplabV3 as our model architecture.
4.1.2 Implementation Details and Evaluation Metrics
Our method is implemented with Pytorch on an NVIDIA Titan Xp. For training, We use SGD with momentum 0.9 and weight decay 5e-4 to update student models. The generator in our method is trained with Adam 
. During training, the learning rates of SGD and Adam are decayed by 0.1 for every 100 epochs. In order to measure function discrepancy, we use a large batch size for adversarial training. The batch size is set to 512 for MNIST, 256 for CIFAR, and 64 for other datasets. In our experiments, all models are randomly initialized except that, in semantic segmentation tasks, the backbone of teacher models are pretrained on ImageNet
. More detailed hyperparameter settings for different datasets can be found in supplementary materials.
To evaluate our methods, we take the accuracy of prediction as our metric for classification tasks. Furthermore, for semantic segmentation, we calculate Mean Intersection over Union (mIoU) on the whole test set.
A bunch of baselines is compared to demonstrate the effectiveness of our proposed method, including both data-driven and data-free methods. The baselines are briefly described as follows.
Teacher: the given pretrained model which serves as the teacher in the distillation process.
KD-ORI: the student trained with the vanilla KD  method on original training data.
KD-REL: the student trained with the vanilla KD on an alternative data set which is similar to the original training data.
KD-UNR: the student trained with the vanilla KD on an alternative data set which is unrelated to the original training data.
RANDOM: the student trained with randomly generated noise images.
DAFL: the student trained with DAta-Free Learning  without any data without any real data.
4.2 KD in Classification Models
The testing accuracy of our methods and the compared baselines are provided in Table 2. In order to eliminate the effects of randomization, we repeat each experiment for 5 times and record average value and standard deviation of the highest accuracy. The first part of the tables gives the results on data-driven distillation methods. KD-ORI requires the original training data when KD-REL and KD-UNR use some unlabeled alternative data for training. In KD-REL, the training data should be similar to the original training data. However, the domain different between alternative and original data is unavoidable, which will result in incomplete knowledge. As shown in the table, the accuracy of KD-REL is slightly lower than KD-ORI. Note that in our experiments, the original training data is available, so that we can easily find some similar data for training. Nevertheless, in the real world, we are ignorant of the domain information, which makes it impossible to collect similar data. In this case, the blindly collected data may contain many unrelated samples, leading to the KD-UNR methods. The incurred data bias makes training very difficult and deteriorates the performance of the student model.
The second part of the table shows the results of data-free distillation methods. We compare our methods with DALF  using their released code. In our experiment, we set the batch size to 256 for CIFAR and 64 for Caltech101 train each model for 500 epochs. Our adversarial learning method achieves the highest accuracy among the data-free methods, and the performance is even comparable to those data-driven methods. Note that we set the batch size of Caltech101 to 64, DAFL methods failed in this case when our method is still able to learn a student model from the teacher. The influence of different batch sizes can be found in supplementary materials.
Visualization of Generated Samples. The generated samples and real samples are shown in Figure 3. The images in the first row are produced by the generator during adversarial learning, and the real images are listed in the second row. Although those generated samples are not recognizable by humans, they can be used to craft a comparable student model. It means that using realistic samples are not the only way for knowledge distillation. Comparing the generated samples on CIFAR10 and CIFAR100, we can find that the generator on CIFAR100 produces more complicated samples than on CIFAR10. As the difficulty of classification increases, the teacher model becomes more knowledgeable so that, in adversarial learning, the generator can recover more complicated images. As mentioned above, the diversity of generated samples are guaranteed by the adversarial loss. In our results, the generator does maintain a perfect image diversity, and almost every generated image is different.
Comparison between Loss Functions. It is essential to keep the balance of the adversarial game. An appropriate adversarial loss should provide stable gradients during training. In this experiment, four candidates are explored, which are MAE, MSE, KLD, and MSE+MAE. By comparing the accuracy curves of the different loss function, we find that MAE indeed provides the best results owing to its stable gradient for generator.
4.3 KD in Segmentation Models
Our method can be naturally extended to semantic segmentation tasks. In this experiment, we adopt ImageNet-pretrained ResNet-50 to initialize the teacher model and train all student models from scratch. All Models in data-driven methods are trained with cropped images. For data-free methods, images are directly generated for training. Table 3 shows the performance of the student model obtained with different methods. We can see that, on CamVid, our method obtains a competitive student model even compared with KD-ORI, which requires the original training data. On NYUv2, our approach goes beyond KD-UNR and all data-free methods, although not comparable to KD-ORI. In fact, our method is the first data-free distillation method proposed to work on segmentation tasks.
The main difficulty DAFL encounters is that the one-hot constraint is detrimental to segmentation tasks, in which each pixel has strong correlations with its neighboring ones. In our framework, the generator are encouraged to produce complicated patterns by combining multiple pixels to make the game more challenging. As shown in Figure 6, the generator for CamVid indeed catches the co-occurrence of traffic lights and poles with reasonable spatial correlations. To further study these generated samples, we also conduct an experiment to train a student model with a fixed generator obtained from adversarial distillation and train a student model with mIoU of 0.460. It demonstrates that the generator indeed learns ”what should be generated.”
This paper intoroduces a data-free adversarial distillation framwork for model compression. We propose a novel method to estimate the optimizable upper bound of the intractable model discrepancy between the teacher and the student. Without any access to real data, we successfully reduce the discrepancy by optimizing the upper bound and obtain a comparable student model. Our experiments on classification and segmentation demonstrate that our framework is highly scalable and can be effectively applied to different network architectures. To the best of our knowledge, it is also the first effective data-free method for semantic segmentation. However, it is still very difficult to generate complicated samples. We believe that introducing human priori can effectively improve the generator by avoding useless search space. In the future, we will explore the impact of different prior information on the proposed adversarial distillation framework.
Wasserstein generative adversarial networks.
International conference on machine learning, pp. 214–223. Cited by: §2.2.
-  (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: Appendix B, §2.2.
-  (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §4.1.1, §4.1.1.
Segmentation and recognition using structure from motion point clouds.
European conference on computer vision, pp. 44–57. Cited by: §4.1.1, §4.1.1.
-  (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §1, §2.1.1.
-  (2019) Data-free learning of student networks. arXiv preprint arXiv:1904.01186. Cited by: §1, §2.1.2, §4.1.1, §4.1.3, §4.2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §A.2, §4.1.1.
-  (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §1.
-  (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §4.1.1.
Born again neural networks. arXiv preprint arXiv:1805.04770. Cited by: §2.1.1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2.1.2, §2.2.
-  (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §2.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §A.2, §A.2, §4.1.1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.1.1, §2.1.1, §2.1, §3.2.1, §4.1.3.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4.1.1.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.2.
-  (2019) Geometry-aware distillation for indoor semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2869–2878. Cited by: §2.1.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.1, §4.1.2.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.1, §4.1.1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §4.1.2.
-  (2019) Attending to discriminative certainty for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 491–500. Cited by: §2.2.
The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §4.1.1, §4.1.1.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §2.2.
-  (2019) Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2604–2613. Cited by: §2.1.1.
-  (2017) Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535. Cited by: §1, §2.1.2, §4.1.1.
-  (2019) Zero-shot knowledge distillation in deep networks. arXiv preprint arXiv:1905.08114. Cited by: §1, §2.1.2.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.1.2.
-  (2019) Refine and distill: exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9768–9777. Cited by: §2.1.1.
-  (2017) Loss-sensitive generative adversarial networks on lipschitz densities. arXiv preprint arXiv:1701.06264. Cited by: §2.2.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §A.1, Table 4, §1, §2.2.
-  (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §2.1.1.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §A.2.
-  (2019) Customizing student networks from heterogeneous teachers via adaptive knowledge amalgamation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3504–3513. Cited by: §2.1.1.
-  (2012) Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746–760. Cited by: §4.1.1, §4.1.1.
-  (2017) Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Cited by: §2.2.
-  (2018) Esrgan: enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.2.
-  (2019) Student becoming the master: knowledge amalgamation for joint scene parsing, depth estimation, and more. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2829–2838. Cited by: §2.1.1.
-  (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §2.1.1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.2.
Appendix A Model Architectures and Hyperparameters
Table 7 summarizes the basic configurations for each dataset. In our experiments, teacher models are obtained from labeled data, when student models and generators are trained without access to real-world data. We validate our models every 50 iterations. For the sake of simplicity, we regard such a period as an “epoch”.
As illustrated in Fig. 4, two kinds of vanilla generator architectures are aodpted in our experiments. The first generator, denoted as “Generator-A”
, uses Nearest Neighbor Interpolation for upsampling. The second one, denoted as“Generator-B”, is isomorphic to the generator proposed by DCGAN , which replaces the interpolations with deconvolutions. We use the Generator-A for MNIST and CIFAR, and apply the more powerful Generator-B to other datasets. The slope of LeakyReLU is set to 0.2 for more stable gradients. In distillation, all generators are optimized with Adam  with a learning rate of 1e-3. The betas are set to their default values, which are 0.9 and 0.999.
a.2 Teachers and Students
MNIST. Table 5 provides detailed information about the architectures of LeNet-5 and LeNet-5-Half. In distillation, we use SGD with a fixed learning rate of 0.01 to train the student model for 40 epochs.
CIFAR10 and CIFAR100. The modified ResNet  architectures with 8 downsampling for CIFAR10 and CIFAR100 are presented in Table 6. The learning rate starts from 0.1 and is divided by 10 at 100 epochs and 200 epochs. We apply a weight decay of 5e-4 and train the student model for 500 epochs.
Caltech101. The standard ResNet  architectures with 32 downsampling are adopted for Caltech101. During training, the learning rate of SGD is assigned as 0.05 and is decayed every 100 epochs. The student model and generator are optimized for 300 epochs with a weight decay of 5e-4.
|FC, Reshape, BN||FC, Reshape, BN|
|Upsample||33 512 Deconv , BN, LReLU|
|33 128 Conv, BN, LReLU||33 256 Deconv , BN, LReLU|
|Upsample||33 128 Deconv , BN, LReLU|
|33 64 Conv, BN, LReLU||33 64 Deconv , BN, LReLU|
|33 3 Conv, Tanh, BN||33 3 Conv, Tanh|
5 6 Conv, ReLU
|55 3 Conv, ReLU|
|55||55 16 Conv, ReLU||55 8 Conv, ReLU|
|11||55 120 Conv, ReLU||55 60 Conv, ReLU|
|11||84 FC||42 FC|
|3232||33 64 Conv, BN, ReLU|
|Dataset||Teacher||Student||Generator||Input Size||Batch Size||lrS||lrG||wd|
CamVid. We use DeepLabV3  with dilated convolutions to tackle this segmentation problem. In distillation, we construct a MobileNet-V2  student model and optimize it for 300 epochs using SGD with a learning rate of 0.1 and a weight decay of 5e-4. The learning rate of SGD and Adam is decayed every 100 epochs.
NYUv2. We adopt the same architectures and hyperparameters as NYUv2 for NYUv2, except that the learning rate of SGD is modified to 0.05 and the weight decay is reduced to 5e-5. We train the student model and the generator for 300 epochs and multiply the learning rate by 0.3 at 150 epochs and 250 epochs.
Appendix B Influence of Different Batch Sizes
In our method, a large batch size is required to train the generator  and ensure the accuracy of the discrepancy estimation. To explore the influence of different batch sizes, we conduct several experiments on classification and semantic segmentation datasets. As illustrated in Fig. 7, a small batch size injures the performance of student models. We also found that increasing batch size can bring tremendous benefits to our method. An important reason for the phenomenon is that the large batch size can provide sufficient statistical information for hard sample generation and make the training more stable.
Appendix C More Visualization
We provide more visualization results in this section. Fig. 8 and Fig. 9 compare the generated samples with real samples on classification datasets. Those fake samples can not be recognized by humans, but indeed contain sufficient knowledge for their tasks. Fig. 10 provides some generated samples on segmentation datasets, as well as the predictions produced by teacher models. To further demonstrate the effectiveness of our method, we provide more segmentation results in Fig. 11 and Fig. 12. As in the main paper, our data-free method is compared with several data-driven baselines, such as KD-REL and KD-UNR. KD-REL requires related data and KD-UNR uses some unrelated data.