Deep convolutional neural networks (CNNs) have been successfully used in various computer vision applications such as image classification[24, 10], object detection  and semantic segmentation . However, launching most of the widely used CNNs requires heavy computation and storage, which can only be used on PCs with modern GPU cards. For example, over MB of memory and over multiplications are demanded for processing one image using VGGNet , which is almost impossible to be applied on edge devices such as autonomous cars and micro robots. Although these pre-trained CNNs have a number of parameters, Han  showed that discarding over of weights in a given neural network would not obviously damage its performance, which demonstrates that there is a significant redundancy in these CNNs.
In order to compress and speed-up pre-trained heavy deep models, various effective approaches have been proposed recently. For example, Gong 
utilized vector quantization approach to represent similar weights as cluster centers. Denton exploited low-rank decomposition to process the weight matrices of fully-connected layers. Chen  proposed a hashing based method to encode parameters in CNNs. Han  employed pruning, quantization and Huffman coding to obtain a compact deep CNN with lower computational complexity. Hinton  proposed the knowledge distillation approach, which distills the information of the pre-trained teacher network for learning a portable student network, .
Although the above mentioned methods have made tremendous efforts on benchmark datasets and models, an important issue has not been widely noticed, most existing network compression and speed-up algorithms have a strong assumption that training samples of the original network are available. However, the training dataset is routinely unknown in real-world applications due to privacy and transmission limitations. For instance, users do not want to let their photos leaked to others, and some of the training datasets are too huge to quickly upload to the cloud. In addition, parameters and architecture of pre-trained networks are also unknown sometimes except the input and output layers. Therefore, conventional methods cannot be directly used for learning portable deep models under these practice constrains.
Nevertheless, only a few works have been proposed for compressing deep models without training data. Lopes 
utilized the “meta-data” (means and standard deviation of activations from each layer) recorded from the original training dataset, which is not provided for most well-trained CNNs. Srinivas and Babu
compressed the pre-trained network by merging similar neurons in fully-connected layers. However, the performance of compressed networks using these methods is much lower than that of the original network, due to they cannot effectively utilize the pre-trained neural networks. To address the aforementioned problem, we propose a novel framework for compressing deep neural networks without the original training dataset. To be specific, the given heavy neural network is regarded as a fixed discriminator. Then, a generative network is established for alternating the original training set by extracting information from the network during the adversarial procedure, which can be utlized for learning smaller networks with acceptable performance. The superiority of the proposed method is demonstrated through extensive experiments on benchmark datasets and models.
Rest of this paper is organized as follows. Section 2 investigates related works on CNN compression algorithms. Section 3 proposes the data-free teacher-student paradigm by exploiting GAN. Section 4 illustrates experimental results of the proposed method on benchmark datasets and models and Section 5 concludes the paper.
2 Related Works
Based on different assumptions and applications, existing portable network learning methods can be divided into two categories, data-driven and data-free methods.
2.1 Data-Driven Network Compression
In order to learn efficient deep neural networks, a number of methods have been proposed to eliminate redundancy in pre-trained deep models. For example, Gong  employed the vector quantization scheme to represent similar weights in neural networks. Denton 
exploited the singular value decomposition (SVD) approach to decompose weight matrices of fully-connected layers. Han proposed the pruning approach for removing subtle weights in pre-trained neural networks. Wang  further introduced the discrete cosine transform (DCT) bases and converted convolution filters into the frequency domain to achieve higher compression and speed-up ratios.
Besides eliminating redundant weights or filters, Hinton  proposed a knowledge distillation (KD) paradigm for transferring useful information from a given teacher network to a portable student network. Yim  introduced the FSP (Flow of Solution Procedure) matrix to inherit the relationship between features from two layers. Li  further presented a feature mimic framework to train efficient convolutional networks for objective detection. Shen  conducted feature amalgamation to learn a compact student model by inherting knowledge from multiple teacher networks. In addition, Rastegari  and Courbariaux 
explored binarized neural networks to achieve considerable compression and speed-up ratios, which weights are -1/1 or -1/0/1, .
Although the above mentioned algorithms obtained promising results on most of benchmark datasets and deep models, they cannot be effectively launched without the original training dataset. In practice, the training dataset could be unavailable for some reasons, transmission limitations and privacy. Therefore, it is necessary to study the data-free approach for compressing neural networks.
2.2 Data-Free Network Compression
There are only a few methods that are proposed for compressing deep neural networks without the original training dataset. Srinivas and Babu  proposed to directly merge similar neurons in fully-connected layers, which cannot be applied on convolutional layers and networks which detail architectures and parameters information are unknown. In addition, Lopes  attempted to reconstruct the original data from “meta-data” and utilize the knowledge distillation scheme to learn a smaller network.
Since the fine-tuning procedure cannot be accurately conducted without the original training dataset, performance of compressed methods by existing algorithms is worse than that of baseline models. Therefore, an effective data-free approach for learning efficient CNNs with comparable performance is highly required.
3 Data-free Student Network learning
In this section, we will propose a novel data-free framework for compressing deep neural networks by embedding a generator network into the teacher-student learning paradigm.
3.1 Teacher-Student Interactions
As mentioned above, the original training dataset is not usually provided by customers for various concerns. In addition, parameters and detailed architecture information could also be unavailable sometimes. Thus, we propose to utilized the teacher-student learning paradigm for learning portable CNNs.
Knowledge Distillation (KD)  is a widely used approach to transfer the output information from a heavy network to a smaller network for achieving higher performance, which does not utilize parameters and the architecture of the given network. Although the given deep models may only be provided with limited interfaces (input and output interfaces), we can transfer the knowledge to inherit the useful information from the teacher networks. Let and
denote the original pre-trained convolutional neural network (teacher network) and the desired portable network (student network), the student network can be optimized using the following loss function based on knowledge distillation:
where is the cross-entropy loss, and are the outputs of the teacher network and student network , respectively. Therefore, utilizing the knowledge transfer technique, a portable network can be optimized without the specific architecture of the given network.
3.2 GAN for Generating Training Samples
In order to learn portable network without original data, we exploit GAN to generate training samples utilizing the available information of the given network.
Generative adversarial networks (GANs) have been widely applied for generating samples. GANs consist of a generator and a discriminator . is expected to generate desired data while is trained to identify the differences between real images and those produced by the generator. To be specific, given an input noise vector , maps to the desired data x, . On the other hand, the goal of is to distinguish the real data from synthetic data . For an aribitrary vanilla GAN, the objective function can be formulated as
In the adversarial procedure, the generator is continuously upgraded according to the training error produced by . The optimal is obtained by optimizing the following problem
where is the optimal discriminator. Adversarial learning techniques can be naturally employed to synthesize training data. However according to Eq. (2), the discriminator requires real images for training. In the absence of training data, it is thus impossible to train the discriminator as vanilla GANs.
Recent works  have proved that the discriminator can learn the hierarchy of representations from samples, which encourages the generalization of in other tasks like image classification. Odena  further suggested that the tasks of discrimination and classification can improve each other. Instead of training a new discriminator as vanilla GANs, the given deep neural network can extract semantic features from images as well, since it has already been well trained on large-scale datasets. Hence, we propose to regard this given deep neural network (ResNet-50 ) as a fixed discriminator. Therefore, can be optimized directly without training together, the parameters of original network are fixed during training
. In addition, the output of the discriminator is a probability indicating whether an input image is real or fake in vanilla GANs. However, given the teacher deep neural network as the discriminator, the output is to classify images to different concept sets, instead of indicating the reality of images. The loss function in vanilla GANs is therefore inapplicable for approximating the original training set. Thus, we conduct thorough analysis on real images and their responses on this teacher network. Several new loss functions will be devised to reflect our observations.
On the image classification task, the teacher deep neural network adopts the cross entropy loss in the training stage, which enforces the outputs to be close to ground-truth labels of inputs. Specifically for multi-class classification, the outputs are encouraged to be one-hot vectors, where only one entry is 1 and all the others are 0s. Denote the generator and the teacher network as and , respectively. Given a set of random vector , images generated from these vectors are , where . Inputting these images into the teacher network, we can obtain the outputs with . The predicted labels are then calculated by . If images generated by follow the same distribution as that of the training data of the teacher network, they should also have similar outputs as the training data. We thus introduce the one-hot loss, which encourages the outputs of generated images by the teacher network to be close to one-hot like vectors. By taking as pseudo ground-truth labels, we formulate the one-hot loss function as
where is the cross-entropy loss function. By introducing the one-hot loss, we expect that a generated image can be classified into one particular category concerned by the teacher network with a higher probability. In other words, we pursue synthetic images that are exclusively compatible with the teacher network, rather than general real images for any scenario.
Besides predicted class labels by DNNs, intermediate features extracted by convolution layers are also important representations of input images. A large number of works have investigated the interpretability of deep neural networks[28, 21, 4]. Features extracted by convolution filters are supposed to contain valuable information about the input images. In particular, Zhang  assigned each filter in a higher convolution layer with a part of object, which demonstrates that each filter stands for different semantics. We denote features of extracted by the teacher network as , which corresponds to the output before the fully-connected layer. Since filters in the teacher DNNs have been trained to extract intrinsic patterns in training data, feature maps tend to receive higher activation value if input images are real rather than some random vectors. Hence, we define an activation loss function as:
where is the conventional norm. Different from norm which prefers a dense representation, norm yields a sparse solution, which is naturally suitable for our aim since images of a category could only receive response from some filters.
|Algorithm||Required data||LeNet-5 ||HintonNet |
|Standard back-propagation||Original data||98.65%||144K||16K||98.11%||1.28M||1.28M|
|Knowledge Distillation ||Original data||98.91%||144K||16K||98.39%||1.28M||1.28M|
|Normal distribution||No data||88.01%||144K||16K||87.58%||1.28M||1.28M|
|Alternative data||USPS dataset||94.56%||144K||16K||93.99%||1.28M||1.28M|
|Meta data ||Meta data||92.47%||144K||16K||91.24%||1.28M||1.28M|
|Data-Free Learning||No data||98.20%||144K||16K||97.91%||1.28M||1.28M|
Moreover, to ease the training procedure of a deep neural network, the number of training examples in each category is usually balanced, there are 6,000 images in each class in the MNIST dataset. We employ the information entropy loss to measure the class balance of generated images. Specifically, given a probability vector , the information entropy, which measures the degree of confusion, of p is calculated as . The value of indicates the amount of information that p owns, which will take the maximum when all variables equal to . Given a set of output vectors , where , the frequency distribution of generated images for every class is . The information entropy loss of generated images is therefore defined as
When the loss takes the minimum, every element in vector would equal to , which implies that could generate images of each category with roughly the same probability. Therefore, minimizing the information entropy of generated images can lead to a balanced set of synthetic images.
By combining the aforementioned three loss functions, we obtain the final objective function
where and are hyper parameters for balancing three different terms. By minimizing the above function, the optimal generator can synthesize images that have the similar distribution as that of the training data previously used for training the teacher network (the discriminator network).
It is noted that some previous works [23, 16] could synthesize images by optimizing the input of the neural network using back-propagation. But it is difficult to generate abundant images for the subsequent student network training, for each synthetic image leads to an independent optimization problem solved by back-propagation. In contrast, the proposed method can imitate the distribution of training data directly, which is more flexible and efficient to generate new images.
|Information entropy loss|
|Feature maps activation loss|
|Top 1 accuracy||88.01%||78.77%||88.14%||15.95%||42.07%||97.25%||95.53%||98.20%|
The learning procedure of our algorithm can be divided into two stages of training. First, we regard the well-trained teacher network as a fixed discriminator. Using the loss function in Eq. 7, we optimize a generator to generate images that follow the similar distribution as that of the original training images for the teacher network. Second, we utilize the knowledge distillation approach to directly transfer knowledge from the teacher network to the student network. The student network with fewer parameters is then optimized using the KD loss in Eq. 1. The diagram of the proposed method is shown in Figure 1.
We use stochastic gradient descent (SGD) method to optimize the image generatorand the student network . In the training of , the first term of is the cross entropy loss, which can be trained traditionally. The second term in Eq. 7 is exactly a linear operation, and the gradient of with respect to can be easily calculated as:
where denotes sign function. Parameters in will be updated by:
where is the gradient of the feature . The gradient of the final term with respect to can be easily calculated as:
where denotes -dimensional vector with all values as . Parameters in will be additionally updated by:
Detailed procedures of the proposed Data-Free Learning (DFL) scheme for learning efficient student neural networks is summarized in Algorithm 1.
In this section, we will demonstrate the effectiveness of our proposed data-free knowledge distillation method and conduct massive ablation experiments to have an explicit understanding of each component in the proposed method.
4.1 Experiments on MNIST
We first implement experiments on the MNIST dataset, which is composed of pixel images from 10 categories (from 0 to 9). The whole dataset consists of 60,000 training images and 10,000 testing images. For choosing hyper-parameters of the proposed methods, we take 10,000 images as a validation set from training images. Then, we train models on the full 60,000 images to obtain the ultimate network.
To make a fair comparison, we follow the setting in . Two architectures are used for investigating the performance of proposed method, a convolution-based architecture and a network consists of fully-connect layers. For convolution models, we use LeNet-5  as the teacher model and LeNet-5-HALF (a modified version with half the number of channels per layer) as the student model. For the second architecture, the teacher network consists of two hidden layers of 1,200 units (Hinton-784-1200-1200-10) 
and student network consists of two hidden layers of 800 units (Hinton-784-800-800-10). The student networks have significantly fewer parameters than teacher networks. The models are trained for 30 epochs using Adam with a learning rate of 0.001. For our method,and in Fcn.7 are 0.1 and 5, respectively, and are tuned on the validation set. The generator was trained for 200 epochs using Adam and the learning rate is set as . We use a deep convolutional generator111https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/dcgan/dcgan.py following 
and add a batch normalization at the end of the generator to smooth the sample values.
Table 1 reports the results of different methods on the MNIST datasets. On LeNet-5 models, the teacher network achieves a accuracy while the student network using the standard back-propagation achieves a accuracy, respectively. Knowledge distillation improved the accuracy of student network to . These methods use the original data to train the student network. We then train a student network exploiting the proposed method to evaluate the effectiveness of the synthetic data.
We first use the data randomly generated from normal distribution to training the student network. By utilizing the knowledge distillation, the student network achieves only an accuracy. In addition, we further use another handwritten digits dataset, namely USPS , to conduct the same experiment for training the student network. Although images in two datasets have similar properties, the student network learned using USPS can only obtain a 94.56% accuracy on the MNIST dataset, which demonstrates that it is extremely hard to find an alternative to the original training dataset. To this end, Lopes  using the “meta data”, which is the activation record of original data, to reconstruct the dataset and achieved only a 92.47% accuracy. Noted that the upper bound of the accuracy of student network is 98.65%, which could be achieved only if we could find a dataset whose distribution is same as the original dataset (MNIST dataset). The proposed method utilizing generative adversarial networks achieved a 98.20% accuracy, which is much close to this upper bound. Also, the accuracy of student network using the proposed algorithm is superior to these using other data (normal distribution, USPS dataset and reconstructed dataset using “meta data”), which suggest that our method could imitate the distribution of training dataset better.
On the fully-connected models, the classification accuracies of teacher and student network are and , respectively. Knowledge Distillation brought the performance of student network by transferring information from teacher network to . However, in the absence of training data, the result became unacceptable. Randomly generated noise only achieves accuracy and “meta data”  achieves a higher accuracy of . Using USPS dataset as alternatives achieves an accuracy of 93.99%. The proposed method results in the highest performance of among all methods without the original data, which demonstrates the effectiveness of the generator.
|Standard back-propagation||Original data||557M||11M||93.92%||76.53%|
|Knowledge Distillation ||Original data||557M||11M||94.34%||76.87%|
|Normal distribution||No data||557M||11M||14.89%||1.44%|
|Alternative data||Similar data||557M||11M||90.65%||69.88%|
|Data-Free Learning (DFL)||No data||557M||11M||92.22%||74.47%|
Impact of parameters. As discussed above, the proposed method has two hyper-parameters: and . We test their impact on the accuracy of the student network by conducting the experiments on the MNIST dataset. We use LeNet and LeNet-Half as the teacher and student network, respectively. Other settings are same as above.
It can be seen from Figure 2 that the student network trained utilizing the proposed method achieves the highest accuracy (98.20%) when and . Based on the above analysis, we keep the setting of hyper-parameters for the proposed method.
4.2 Ablation Experiments
In the above sections, we have tested and verified the effectiveness of the proposed generative method for student network learning without training data. However, there are a number of components, three terms in Eq. 7, when optimizing the generator. We further conduct the ablation experiments for an explicit understanding and analysis.
The ablation experiment is also conducted on the MNIST dataset. We used the LeNet-5 as a teacher network and LeNet-5-HALF as a student network. The training settings are same as those in Section 4.1. Table 2 reports the results of various design components. Using randomly generated samples, the generator is not trained, the student network achieves an 88.01% accuracy. However, by utilizing one-hot loss and feature map activation loss or one of them, the generated samples are unbalanced, which results in the poor performance of the student networks. Only introducing information entropy loss, the student network achieves an 88.14% accuracy since the samples do not contain enough useful information. When combining or with , the student network achieves higher performance of 97.25% and 95.53%, respectively. Moreover, the accuracy of student network is 98.20% when using all these loss functions, which achieves the best performance.
The ablation experiments suggest that each component of the loss function of is meaningful. By applying the proposed method, can generate balanced samples from different classes with a similar distribution as that in the original dataset, which is effective for the training of the student network.
4.3 Experiments on CIFAR
To further evaluate the effectiveness of our method, we conduct experiments on the CIFAR dataset. The CIFAR dataset consists of 3232 pixel RGB images. There are 50,000 training images and 10,000 test images in this dataset and the last 10,000 training images are selected as a validation set for tuning hyper-parameters. CIFAR-10 contains 10 categories and CIFAR-100 contains 100 categories, respectively. We used a ResNet-34 as the teacher network and ResNet-18 as the student network222https://github.com/kuangliu/pytorch-cifar, which is complex and advanced for further investigating the effectiveness of the proposed method. These networks are optimized using Nesterov Accelerated Gradient (NAG) and the weight decay and the momentum are set as
and 0.9, respectively. We train the networks for 200 epochs and the initial learning rate is set as 0.1 and divided by 10 at 80 and 120 epochs, respectively. Random flipping, random crop and zero padding are used for data augmentation as suggested in. and the student networks of the proposed method are trained for 1,500 epochs and the other settings and the hyper-parameters are same as those in MNIST experiments.
Table 3 reports the classification results on the CIFAR-10 and CIFAR-100 datasets. The teacher network achieves a 94.85% accuracy in CIFAR-10. The student network using knowledge distillation achieves a 94.34% accuracy, which is slightly higher than that of standard BP (93.92%).
We then explore to optimize the student network without true data. Since the CIFAR dataset is more complex than MNIST, it is impossible to optimize a student network using randomly generated data which follows the normal distribution. Therefore, we then regard the MNIST dataset without labels as an alternative data to train the student network using the knowledge distillation. The student network only achieves a 28.29% accuracy on the CIFAR-10 dataset. Moreover, we train the student network using the CIFAR-100 dataset, which has considerable overlaps with the original CIFAR-10 dataset, but this network only achieves a 90.65% accuracy, which is obviously lower than that of the teacher model. In contrast, the student network trained utilizing the proposed method achieved a 92.22% accuracy with only synthetic data.
Besides CIFAR-10, we further verify the capability of the proposed method on the CIFAR-100 dataset, which has 100 categories and 600 images per class. Therefore, the dimensionality of the input random vectors for the generator in our method is increased to 1,000. The accuracy of the teacher network is 77.34% and that of the student network is only 76.53%, respectively. Using normal distribution data, MNIST, and CIFAR-10 to train the student network cannot obtain promising results, as shown in Table 3. In contrast, the student network learned by exploiting the proposed method obtained a 74.47% accuracy without any real-world training data.
4.4 Experiments on CelebA
Besides the CIFAR dataset, we conduct our experiments on the CelebA dataset, which contains 202,599 face images of pixel . To evaluate our approach fairly, we used AlexNet  to classify the most balanced attribute in CelebA  following the settings in . The student network is AlexNet-Half, which number of filters is half of AlexNet. The original teacher network has about 57M parameters while the student network has only about 40M parameters. The networks is optimized for 100 epochs using Adam with a learning rate of . We use an alternative model of DCGAN  to generate color images of . The hyper-parameters of the proposed method are same as those in MNIST and CIFAR experiments and .
|Knowledge Distillation ||222M||81.35%|
|Meta data ||222M||77.56%|
|Data-Free Learning (DFL)||222M||80.03%|
Table 4 reported the classification results of student networks on the CelebA dataset by exploiting the proposed method and state-of-the-art learning methods. The teacher network achieves an 81.59% accuracy and the student network using the standard BP achieves an 80.82% accuracy, respectively. Lopes  achieves only a 77.56% accuracy rate using the “meta data”. The accuracy of the student network trained using the proposed method is 80.03%, which is comparable with that of the teacher network.
4.5 Extended Experiments
Massive experiments are conducted on several benchmarks to verify the performance of the DFL method for learning student networks using generated images. Wherein, architectures of used student networks are more portable than those of teacher networks. To investigate the difference between original training images and generated images, we use these generated images to train networks of the same architectures as those of teacher networks using the proposed methods. The results are reported in Table 5.
It can be found in Table 5 that LeNet-5 and HintonNet on the MNIST dataset achieve a 98.91% accuracy and a 98.39% accuracy, respectively. In contrast, accuracies of student networks trained from scratch with same architectures are 98.47% and 98.08%, respectively, which are very close to those of teacher networks. In addition, student networks on the CIFAR-10 and the CIFAR-100 datasets also obtain similar results to those of teacher networks. These results demonstrate that the proposed method can effectively approximate the original training dataset by extracting information from teacher networks. If the network architectures are given, we can even replicate the teacher networks and achieve similar accuracies.
Filter visualization. Moreover, we visualize the filters of the LeNet-5 teacher network and student network in Figure 3. Though the student network is trained without real-world data, filters of the student network learned by the proposed method (see Figure 3 (b)) are still similar to those of the teacher network (see Figure 3 (a)). The visualization experiments further demonstrate that the generator can produce images that have similar patterns as the original images, and by utilizing generated samples, the student network could acquire valuable knowledge from the teacher network.
Conventional methods require the original training dataset for fine-tuning the compressed deep neural networks with an acceptable accuracy. However, the training set and detailed architecture information of the given deep network are routinely unavailable due to some privacy and transmission limitations. In this paper, we present a novel framework to train a generator for approximating the original dataset without the training data. Then, a portable networks can be learned effectively through the knowledge distillation scheme. By regarding the given pre-trained network as a fixed discriminator, the generator can produce images with similar properties as those in the training set. Experiments on benchmark datasets demonstrate that the proposed method DFL method is able to learn portable deep neural networks without any training data.
-  W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
-  E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
-  Y. Dong, H. Su, J. Zhu, and F. Bao. Towards interpretable deep neural networks by leveraging adversarial examples. arXiv preprint arXiv:1708.05493, 2017.
-  Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
-  S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. In CVPR, pages 7341–7349. IEEE, 2017.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
-  S. F. Lopes, Raphael Gontijo and T. Starner. Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535, 2017.
A. Mahendran and A. Vedaldi.
Understanding deep image representations by inverting them.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
-  A. Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
-  C. Shen, X. Wang, J. Song, L. Sun, and M. Song. Amalgamating knowledge towards comprehensive classification. arXiv preprint arXiv:1811.02796, 2018.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
-  Y. Wang, C. Xu, S. You, D. Tao, and C. Xu. Cnnpack: packing convolutional neural networks in the frequency domain. In NIPS, pages 253–261, 2016.
J. Yim, D. Joo, J. Bae, and J. Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.In CVPR, 2017.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014.
-  Q. Zhang, Y. N. Wu, and S.-C. Zhu. Interpretable convolutional neural networks. In CVPR, pages 8827–8836, 2018.