1 Introduction
Deep convolutional neural networks (CNNs) have been successfully used in various computer vision applications such as image classification
[24, 10], object detection [20] and semantic segmentation [14]. However, launching most of the widely used CNNs requires heavy computation and storage, which can only be used on PCs with modern GPU cards. For example, over MB of memory and over multiplications are demanded for processing one image using VGGNet [24], which is almost impossible to be applied on edge devices such as autonomous cars and micro robots. Although these pretrained CNNs have a number of parameters, Han [6] showed that discarding over of weights in a given neural network would not obviously damage its performance, which demonstrates that there is a significant redundancy in these CNNs.In order to compress and speedup pretrained heavy deep models, various effective approaches have been proposed recently. For example, Gong [5]
utilized vector quantization approach to represent similar weights as cluster centers. Denton
[3] exploited lowrank decomposition to process the weight matrices of fullyconnected layers. Chen [1] proposed a hashing based method to encode parameters in CNNs. Han [6] employed pruning, quantization and Huffman coding to obtain a compact deep CNN with lower computational complexity. Hinton [8] proposed the knowledge distillation approach, which distills the information of the pretrained teacher network for learning a portable student network, .Although the above mentioned methods have made tremendous efforts on benchmark datasets and models, an important issue has not been widely noticed, most existing network compression and speedup algorithms have a strong assumption that training samples of the original network are available. However, the training dataset is routinely unknown in realworld applications due to privacy and transmission limitations. For instance, users do not want to let their photos leaked to others, and some of the training datasets are too huge to quickly upload to the cloud. In addition, parameters and architecture of pretrained networks are also unknown sometimes except the input and output layers. Therefore, conventional methods cannot be directly used for learning portable deep models under these practice constrains.
Nevertheless, only a few works have been proposed for compressing deep models without training data. Lopes [15]
utilized the “metadata” (means and standard deviation of activations from each layer) recorded from the original training dataset, which is not provided for most welltrained CNNs. Srinivas and Babu
[25]compressed the pretrained network by merging similar neurons in fullyconnected layers. However, the performance of compressed networks using these methods is much lower than that of the original network, due to they cannot effectively utilize the pretrained neural networks. To address the aforementioned problem, we propose a novel framework for compressing deep neural networks without the original training dataset. To be specific, the given heavy neural network is regarded as a fixed discriminator. Then, a generative network is established for alternating the original training set by extracting information from the network during the adversarial procedure, which can be utlized for learning smaller networks with acceptable performance. The superiority of the proposed method is demonstrated through extensive experiments on benchmark datasets and models.
Rest of this paper is organized as follows. Section 2 investigates related works on CNN compression algorithms. Section 3 proposes the datafree teacherstudent paradigm by exploiting GAN. Section 4 illustrates experimental results of the proposed method on benchmark datasets and models and Section 5 concludes the paper.
2 Related Works
Based on different assumptions and applications, existing portable network learning methods can be divided into two categories, datadriven and datafree methods.
2.1 DataDriven Network Compression
In order to learn efficient deep neural networks, a number of methods have been proposed to eliminate redundancy in pretrained deep models. For example, Gong [5] employed the vector quantization scheme to represent similar weights in neural networks. Denton [3]
exploited the singular value decomposition (SVD) approach to decompose weight matrices of fullyconnected layers. Han
[6] proposed the pruning approach for removing subtle weights in pretrained neural networks. Wang [26] further introduced the discrete cosine transform (DCT) bases and converted convolution filters into the frequency domain to achieve higher compression and speedup ratios.Besides eliminating redundant weights or filters, Hinton [8] proposed a knowledge distillation (KD) paradigm for transferring useful information from a given teacher network to a portable student network. Yim [27] introduced the FSP (Flow of Solution Procedure) matrix to inherit the relationship between features from two layers. Li [12] further presented a feature mimic framework to train efficient convolutional networks for objective detection. Shen [22] conducted feature amalgamation to learn a compact student model by inherting knowledge from multiple teacher networks. In addition, Rastegari [19] and Courbariaux [2]
explored binarized neural networks to achieve considerable compression and speedup ratios, which weights are 1/1 or 1/0/1, .
Although the above mentioned algorithms obtained promising results on most of benchmark datasets and deep models, they cannot be effectively launched without the original training dataset. In practice, the training dataset could be unavailable for some reasons, transmission limitations and privacy. Therefore, it is necessary to study the datafree approach for compressing neural networks.
2.2 DataFree Network Compression
There are only a few methods that are proposed for compressing deep neural networks without the original training dataset. Srinivas and Babu [25] proposed to directly merge similar neurons in fullyconnected layers, which cannot be applied on convolutional layers and networks which detail architectures and parameters information are unknown. In addition, Lopes [15] attempted to reconstruct the original data from “metadata” and utilize the knowledge distillation scheme to learn a smaller network.
Since the finetuning procedure cannot be accurately conducted without the original training dataset, performance of compressed methods by existing algorithms is worse than that of baseline models. Therefore, an effective datafree approach for learning efficient CNNs with comparable performance is highly required.
3 Datafree Student Network learning
In this section, we will propose a novel datafree framework for compressing deep neural networks by embedding a generator network into the teacherstudent learning paradigm.
3.1 TeacherStudent Interactions
As mentioned above, the original training dataset is not usually provided by customers for various concerns. In addition, parameters and detailed architecture information could also be unavailable sometimes. Thus, we propose to utilized the teacherstudent learning paradigm for learning portable CNNs.
Knowledge Distillation (KD) [8] is a widely used approach to transfer the output information from a heavy network to a smaller network for achieving higher performance, which does not utilize parameters and the architecture of the given network. Although the given deep models may only be provided with limited interfaces (input and output interfaces), we can transfer the knowledge to inherit the useful information from the teacher networks. Let and
denote the original pretrained convolutional neural network (teacher network) and the desired portable network (student network), the student network can be optimized using the following loss function based on knowledge distillation:
(1) 
where is the crossentropy loss, and are the outputs of the teacher network and student network , respectively. Therefore, utilizing the knowledge transfer technique, a portable network can be optimized without the specific architecture of the given network.
3.2 GAN for Generating Training Samples
In order to learn portable network without original data, we exploit GAN to generate training samples utilizing the available information of the given network.
Generative adversarial networks (GANs) have been widely applied for generating samples. GANs consist of a generator and a discriminator . is expected to generate desired data while is trained to identify the differences between real images and those produced by the generator. To be specific, given an input noise vector , maps to the desired data x, . On the other hand, the goal of is to distinguish the real data from synthetic data . For an aribitrary vanilla GAN, the objective function can be formulated as
(2)  
In the adversarial procedure, the generator is continuously upgraded according to the training error produced by . The optimal is obtained by optimizing the following problem
(3) 
where is the optimal discriminator. Adversarial learning techniques can be naturally employed to synthesize training data. However according to Eq. (2), the discriminator requires real images for training. In the absence of training data, it is thus impossible to train the discriminator as vanilla GANs.
Recent works [18] have proved that the discriminator can learn the hierarchy of representations from samples, which encourages the generalization of in other tasks like image classification. Odena [17] further suggested that the tasks of discrimination and classification can improve each other. Instead of training a new discriminator as vanilla GANs, the given deep neural network can extract semantic features from images as well, since it has already been well trained on largescale datasets. Hence, we propose to regard this given deep neural network (ResNet50 [7]) as a fixed discriminator. Therefore, can be optimized directly without training together, the parameters of original network are fixed during training
. In addition, the output of the discriminator is a probability indicating whether an input image is real or fake in vanilla GANs. However, given the teacher deep neural network as the discriminator, the output is to classify images to different concept sets, instead of indicating the reality of images. The loss function in vanilla GANs is therefore inapplicable for approximating the original training set. Thus, we conduct thorough analysis on real images and their responses on this teacher network. Several new loss functions will be devised to reflect our observations.
On the image classification task, the teacher deep neural network adopts the cross entropy loss in the training stage, which enforces the outputs to be close to groundtruth labels of inputs. Specifically for multiclass classification, the outputs are encouraged to be onehot vectors, where only one entry is 1 and all the others are 0s. Denote the generator and the teacher network as and , respectively. Given a set of random vector , images generated from these vectors are , where . Inputting these images into the teacher network, we can obtain the outputs with . The predicted labels are then calculated by . If images generated by follow the same distribution as that of the training data of the teacher network, they should also have similar outputs as the training data. We thus introduce the onehot loss, which encourages the outputs of generated images by the teacher network to be close to onehot like vectors. By taking as pseudo groundtruth labels, we formulate the onehot loss function as
(4) 
where is the crossentropy loss function. By introducing the onehot loss, we expect that a generated image can be classified into one particular category concerned by the teacher network with a higher probability. In other words, we pursue synthetic images that are exclusively compatible with the teacher network, rather than general real images for any scenario.
Besides predicted class labels by DNNs, intermediate features extracted by convolution layers are also important representations of input images. A large number of works have investigated the interpretability of deep neural networks
[28, 21, 4]. Features extracted by convolution filters are supposed to contain valuable information about the input images. In particular, Zhang [29] assigned each filter in a higher convolution layer with a part of object, which demonstrates that each filter stands for different semantics. We denote features of extracted by the teacher network as , which corresponds to the output before the fullyconnected layer. Since filters in the teacher DNNs have been trained to extract intrinsic patterns in training data, feature maps tend to receive higher activation value if input images are real rather than some random vectors. Hence, we define an activation loss function as:(5) 
where is the conventional norm. Different from norm which prefers a dense representation, norm yields a sparse solution, which is naturally suitable for our aim since images of a category could only receive response from some filters.
Algorithm  Required data  LeNet5 [11]  HintonNet [8]  
Accuracy  FLOPs  #params  Accuracy  FLOPs  #params  
Teacher  Original data  98.91%  436K  62K  98.39%  2.39M  2.4M 
Standard backpropagation  Original data  98.65%  144K  16K  98.11%  1.28M  1.28M 
Knowledge Distillation [8]  Original data  98.91%  144K  16K  98.39%  1.28M  1.28M 
Normal distribution  No data  88.01%  144K  16K  87.58%  1.28M  1.28M 
Alternative data  USPS dataset  94.56%  144K  16K  93.99%  1.28M  1.28M 
Meta data [15]  Meta data  92.47%  144K  16K  91.24%  1.28M  1.28M 
DataFree Learning  No data  98.20%  144K  16K  97.91%  1.28M  1.28M 
Moreover, to ease the training procedure of a deep neural network, the number of training examples in each category is usually balanced, there are 6,000 images in each class in the MNIST dataset. We employ the information entropy loss to measure the class balance of generated images. Specifically, given a probability vector , the information entropy, which measures the degree of confusion, of p is calculated as . The value of indicates the amount of information that p owns, which will take the maximum when all variables equal to . Given a set of output vectors , where , the frequency distribution of generated images for every class is . The information entropy loss of generated images is therefore defined as
(6) 
When the loss takes the minimum, every element in vector would equal to , which implies that could generate images of each category with roughly the same probability. Therefore, minimizing the information entropy of generated images can lead to a balanced set of synthetic images.
By combining the aforementioned three loss functions, we obtain the final objective function
(7) 
where and are hyper parameters for balancing three different terms. By minimizing the above function, the optimal generator can synthesize images that have the similar distribution as that of the training data previously used for training the teacher network (the discriminator network).
It is noted that some previous works [23, 16] could synthesize images by optimizing the input of the neural network using backpropagation. But it is difficult to generate abundant images for the subsequent student network training, for each synthetic image leads to an independent optimization problem solved by backpropagation. In contrast, the proposed method can imitate the distribution of training data directly, which is more flexible and efficient to generate new images.
3.3 Optimization
Onehot loss  
Information entropy loss  
Feature maps activation loss  
Top 1 accuracy  88.01%  78.77%  88.14%  15.95%  42.07%  97.25%  95.53%  98.20% 
The learning procedure of our algorithm can be divided into two stages of training. First, we regard the welltrained teacher network as a fixed discriminator. Using the loss function in Eq. 7, we optimize a generator to generate images that follow the similar distribution as that of the original training images for the teacher network. Second, we utilize the knowledge distillation approach to directly transfer knowledge from the teacher network to the student network. The student network with fewer parameters is then optimized using the KD loss in Eq. 1. The diagram of the proposed method is shown in Figure 1.
We use stochastic gradient descent (SGD) method to optimize the image generator
and the student network . In the training of , the first term of is the cross entropy loss, which can be trained traditionally. The second term in Eq. 7 is exactly a linear operation, and the gradient of with respect to can be easily calculated as:(8) 
where denotes sign function. Parameters in will be updated by:
(9) 
where is the gradient of the feature . The gradient of the final term with respect to can be easily calculated as:
(10) 
where denotes dimensional vector with all values as . Parameters in will be additionally updated by:
(11) 
Detailed procedures of the proposed DataFree Learning (DFL) scheme for learning efficient student neural networks is summarized in Algorithm 1.
4 Experiments
In this section, we will demonstrate the effectiveness of our proposed datafree knowledge distillation method and conduct massive ablation experiments to have an explicit understanding of each component in the proposed method.
4.1 Experiments on MNIST
We first implement experiments on the MNIST dataset, which is composed of pixel images from 10 categories (from 0 to 9). The whole dataset consists of 60,000 training images and 10,000 testing images. For choosing hyperparameters of the proposed methods, we take 10,000 images as a validation set from training images. Then, we train models on the full 60,000 images to obtain the ultimate network.
To make a fair comparison, we follow the setting in [15]. Two architectures are used for investigating the performance of proposed method, a convolutionbased architecture and a network consists of fullyconnect layers. For convolution models, we use LeNet5 [11] as the teacher model and LeNet5HALF (a modified version with half the number of channels per layer) as the student model. For the second architecture, the teacher network consists of two hidden layers of 1,200 units (Hinton7841200120010) [8]
and student network consists of two hidden layers of 800 units (Hinton78480080010). The student networks have significantly fewer parameters than teacher networks. The models are trained for 30 epochs using Adam with a learning rate of 0.001. For our method,
and in Fcn.7 are 0.1 and 5, respectively, and are tuned on the validation set. The generator was trained for 200 epochs using Adam and the learning rate is set as . We use a deep convolutional generator^{1}^{1}1https://github.com/eriklindernoren/PyTorchGAN/blob/master/implementations/dcgan/dcgan.py following [18]and add a batch normalization at the end of the generator to smooth the sample values.
Table 1 reports the results of different methods on the MNIST datasets. On LeNet5 models, the teacher network achieves a accuracy while the student network using the standard backpropagation achieves a accuracy, respectively. Knowledge distillation improved the accuracy of student network to . These methods use the original data to train the student network. We then train a student network exploiting the proposed method to evaluate the effectiveness of the synthetic data.
We first use the data randomly generated from normal distribution to training the student network. By utilizing the knowledge distillation, the student network achieves only an accuracy. In addition, we further use another handwritten digits dataset, namely USPS [9], to conduct the same experiment for training the student network. Although images in two datasets have similar properties, the student network learned using USPS can only obtain a 94.56% accuracy on the MNIST dataset, which demonstrates that it is extremely hard to find an alternative to the original training dataset. To this end, Lopes [15] using the “meta data”, which is the activation record of original data, to reconstruct the dataset and achieved only a 92.47% accuracy. Noted that the upper bound of the accuracy of student network is 98.65%, which could be achieved only if we could find a dataset whose distribution is same as the original dataset (MNIST dataset). The proposed method utilizing generative adversarial networks achieved a 98.20% accuracy, which is much close to this upper bound. Also, the accuracy of student network using the proposed algorithm is superior to these using other data (normal distribution, USPS dataset and reconstructed dataset using “meta data”), which suggest that our method could imitate the distribution of training dataset better.
On the fullyconnected models, the classification accuracies of teacher and student network are and , respectively. Knowledge Distillation brought the performance of student network by transferring information from teacher network to . However, in the absence of training data, the result became unacceptable. Randomly generated noise only achieves accuracy and “meta data” [15] achieves a higher accuracy of . Using USPS dataset as alternatives achieves an accuracy of 93.99%. The proposed method results in the highest performance of among all methods without the original data, which demonstrates the effectiveness of the generator.
Algorithm  Required data  FLOPS  #params  CIFAR10  CIFAR100 
Teacher  Original data  1.16G  21M  94.85%  77.34% 
Standard backpropagation  Original data  557M  11M  93.92%  76.53% 
Knowledge Distillation [8]  Original data  557M  11M  94.34%  76.87% 
Normal distribution  No data  557M  11M  14.89%  1.44% 
Alternative data  Similar data  557M  11M  90.65%  69.88% 
DataFree Learning (DFL)  No data  557M  11M  92.22%  74.47% 
Impact of parameters. As discussed above, the proposed method has two hyperparameters: and . We test their impact on the accuracy of the student network by conducting the experiments on the MNIST dataset. We use LeNet and LeNetHalf as the teacher and student network, respectively. Other settings are same as above.
It can be seen from Figure 2 that the student network trained utilizing the proposed method achieves the highest accuracy (98.20%) when and . Based on the above analysis, we keep the setting of hyperparameters for the proposed method.
4.2 Ablation Experiments
In the above sections, we have tested and verified the effectiveness of the proposed generative method for student network learning without training data. However, there are a number of components, three terms in Eq. 7, when optimizing the generator. We further conduct the ablation experiments for an explicit understanding and analysis.
The ablation experiment is also conducted on the MNIST dataset. We used the LeNet5 as a teacher network and LeNet5HALF as a student network. The training settings are same as those in Section 4.1. Table 2 reports the results of various design components. Using randomly generated samples, the generator is not trained, the student network achieves an 88.01% accuracy. However, by utilizing onehot loss and feature map activation loss or one of them, the generated samples are unbalanced, which results in the poor performance of the student networks. Only introducing information entropy loss, the student network achieves an 88.14% accuracy since the samples do not contain enough useful information. When combining or with , the student network achieves higher performance of 97.25% and 95.53%, respectively. Moreover, the accuracy of student network is 98.20% when using all these loss functions, which achieves the best performance.
The ablation experiments suggest that each component of the loss function of is meaningful. By applying the proposed method, can generate balanced samples from different classes with a similar distribution as that in the original dataset, which is effective for the training of the student network.
4.3 Experiments on CIFAR
To further evaluate the effectiveness of our method, we conduct experiments on the CIFAR dataset. The CIFAR dataset consists of 3232 pixel RGB images. There are 50,000 training images and 10,000 test images in this dataset and the last 10,000 training images are selected as a validation set for tuning hyperparameters. CIFAR10 contains 10 categories and CIFAR100 contains 100 categories, respectively. We used a ResNet34 as the teacher network and ResNet18 as the student network^{2}^{2}2https://github.com/kuangliu/pytorchcifar, which is complex and advanced for further investigating the effectiveness of the proposed method. These networks are optimized using Nesterov Accelerated Gradient (NAG) and the weight decay and the momentum are set as
and 0.9, respectively. We train the networks for 200 epochs and the initial learning rate is set as 0.1 and divided by 10 at 80 and 120 epochs, respectively. Random flipping, random crop and zero padding are used for data augmentation as suggested in
[7]. and the student networks of the proposed method are trained for 1,500 epochs and the other settings and the hyperparameters are same as those in MNIST experiments.Table 3 reports the classification results on the CIFAR10 and CIFAR100 datasets. The teacher network achieves a 94.85% accuracy in CIFAR10. The student network using knowledge distillation achieves a 94.34% accuracy, which is slightly higher than that of standard BP (93.92%).
We then explore to optimize the student network without true data. Since the CIFAR dataset is more complex than MNIST, it is impossible to optimize a student network using randomly generated data which follows the normal distribution. Therefore, we then regard the MNIST dataset without labels as an alternative data to train the student network using the knowledge distillation. The student network only achieves a 28.29% accuracy on the CIFAR10 dataset. Moreover, we train the student network using the CIFAR100 dataset, which has considerable overlaps with the original CIFAR10 dataset, but this network only achieves a 90.65% accuracy, which is obviously lower than that of the teacher model. In contrast, the student network trained utilizing the proposed method achieved a 92.22% accuracy with only synthetic data.
Besides CIFAR10, we further verify the capability of the proposed method on the CIFAR100 dataset, which has 100 categories and 600 images per class. Therefore, the dimensionality of the input random vectors for the generator in our method is increased to 1,000. The accuracy of the teacher network is 77.34% and that of the student network is only 76.53%, respectively. Using normal distribution data, MNIST, and CIFAR10 to train the student network cannot obtain promising results, as shown in Table 3. In contrast, the student network learned by exploiting the proposed method obtained a 74.47% accuracy without any realworld training data.
4.4 Experiments on CelebA
Besides the CIFAR dataset, we conduct our experiments on the CelebA dataset, which contains 202,599 face images of pixel . To evaluate our approach fairly, we used AlexNet [10] to classify the most balanced attribute in CelebA [13] following the settings in [15]. The student network is AlexNetHalf, which number of filters is half of AlexNet. The original teacher network has about 57M parameters while the student network has only about 40M parameters. The networks is optimized for 100 epochs using Adam with a learning rate of . We use an alternative model of DCGAN [18] to generate color images of . The hyperparameters of the proposed method are same as those in MNIST and CIFAR experiments and .
Algorithm  FLOPS  Accuracy 

Teacher  711M  81.59% 
Standard backpropagation  222M  80.82% 
Knowledge Distillation [8]  222M  81.35% 
Meta data [15]  222M  77.56% 
DataFree Learning (DFL)  222M  80.03% 
Table 4 reported the classification results of student networks on the CelebA dataset by exploiting the proposed method and stateoftheart learning methods. The teacher network achieves an 81.59% accuracy and the student network using the standard BP achieves an 80.82% accuracy, respectively. Lopes [15] achieves only a 77.56% accuracy rate using the “meta data”. The accuracy of the student network trained using the proposed method is 80.03%, which is comparable with that of the teacher network.
4.5 Extended Experiments
Massive experiments are conducted on several benchmarks to verify the performance of the DFL method for learning student networks using generated images. Wherein, architectures of used student networks are more portable than those of teacher networks. To investigate the difference between original training images and generated images, we use these generated images to train networks of the same architectures as those of teacher networks using the proposed methods. The results are reported in Table 5.
It can be found in Table 5 that LeNet5 and HintonNet on the MNIST dataset achieve a 98.91% accuracy and a 98.39% accuracy, respectively. In contrast, accuracies of student networks trained from scratch with same architectures are 98.47% and 98.08%, respectively, which are very close to those of teacher networks. In addition, student networks on the CIFAR10 and the CIFAR100 datasets also obtain similar results to those of teacher networks. These results demonstrate that the proposed method can effectively approximate the original training dataset by extracting information from teacher networks. If the network architectures are given, we can even replicate the teacher networks and achieve similar accuracies.
Dataset  Model  Accuracy  
Teacher  Student  
MNIST  LeNet5 [11]  98.91%  98.47% 
MNIST  HintonNet [8]  98.39%  98.08% 
CIFAR10  ResNet34 [7]  94.85%  93.21% 
CIFAR100  ResNet34 [7]  77.34%  75.32% 
CelebA  AlexNet [10]  81.59%  80.56% 
Filter visualization. Moreover, we visualize the filters of the LeNet5 teacher network and student network in Figure 3. Though the student network is trained without realworld data, filters of the student network learned by the proposed method (see Figure 3 (b)) are still similar to those of the teacher network (see Figure 3 (a)). The visualization experiments further demonstrate that the generator can produce images that have similar patterns as the original images, and by utilizing generated samples, the student network could acquire valuable knowledge from the teacher network.
5 Conclusion
Conventional methods require the original training dataset for finetuning the compressed deep neural networks with an acceptable accuracy. However, the training set and detailed architecture information of the given deep network are routinely unavailable due to some privacy and transmission limitations. In this paper, we present a novel framework to train a generator for approximating the original dataset without the training data. Then, a portable networks can be learned effectively through the knowledge distillation scheme. By regarding the given pretrained network as a fixed discriminator, the generator can produce images with similar properties as those in the training set. Experiments on benchmark datasets demonstrate that the proposed method DFL method is able to learn portable deep neural networks without any training data.
References
 [1] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
 [2] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 [3] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
 [4] Y. Dong, H. Su, J. Zhu, and F. Bao. Towards interpretable deep neural networks by leveraging adversarial examples. arXiv preprint arXiv:1708.05493, 2017.
 [5] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 [6] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [8] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [9] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
 [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [12] Q. Li, S. Jin, and J. Yan. Mimicking very efficient network for object detection. In CVPR, pages 7341–7349. IEEE, 2017.
 [13] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
 [14] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
 [15] S. F. Lopes, Raphael Gontijo and T. Starner. Datafree knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535, 2017.

[16]
A. Mahendran and A. Vedaldi.
Understanding deep image representations by inverting them.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 5188–5196, 2015.  [17] A. Odena. Semisupervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016.
 [18] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [19] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542. Springer, 2016.
 [20] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, pages 91–99, 2015.
 [21] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Gradcam: Visual explanations from deep networks via gradientbased localization. In ICCV, pages 618–626, 2017.
 [22] C. Shen, X. Wang, J. Song, L. Sun, and M. Song. Amalgamating knowledge towards comprehensive classification. arXiv preprint arXiv:1811.02796, 2018.
 [23] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 [24] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [25] S. Srinivas and R. V. Babu. Datafree parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
 [26] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu. Cnnpack: packing convolutional neural networks in the frequency domain. In NIPS, pages 253–261, 2016.

[27]
J. Yim, D. Joo, J. Bae, and J. Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.
In CVPR, 2017.  [28] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833. Springer, 2014.
 [29] Q. Zhang, Y. N. Wu, and S.C. Zhu. Interpretable convolutional neural networks. In CVPR, pages 8827–8836, 2018.
Comments
There are no comments yet.