1 Introduction
Deep learning is now leading many performance breakthroughs in various computer vision tasks
[34]. The stateoftheart performance of deep learning came with overparameterized deep neural networks, which enable extracting useful representations (features) of the data automatically for a target task, when trained on a very large dataset. The optimization framework of deep neural networks with stochastic gradient descent has become very fast and efficient recently with the backpropagation technique
[18, Section 6.5], using hardware units specialized for matrix/tensor computations such as graphical processing units (GPUs). The benefit of overparameterization is empirically shown to be the key factor of the great success of deep learning, but once we find a welltrained highaccuracy model, its deployment on various inference platforms faces different requirements and challenges
[53, 11]. In particular, to deploy pretrained models on resourcelimited platforms such as mobile or edge devices, computational costs and memory requirements are the critical factors that need to be considered carefully for efficient inference. Hence, model compression, also called network compression, is an important procedure for development of efficient inference models.Model compression includes various methods such as (1) weight pruning, (2) network quantization, and (3) distillation to a network with a more efficient architecture. Weight pruning and network quantization reduce the computational cost as well as the storage/memory size, without altering the network architecture. Weight pruning compresses a model by removing redundant weights completely from it, i.e., by setting them to be zero, so we can skip computation as well as memorization for the pruned weights [24, 59, 22, 41, 37, 38, 17, 15]. Network quantization reduces the memory footprint for weights and activations by quantization and is usually followed by lossless source coding for compression [23, 12, 55, 48, 54, 13]. Moreover, the convolutional and fullyconnected layers can be implemented with lowprecision fixedpoint operations, e.g., 8bit fixedpoint operations, to lower latency and to increase power efficiency [51, 65, 66, 7, 64, 29, 58]. On the other hand, the network architecture can be modified to be simpler and easier to implement on a target platform. For example, the number of layers and/or the number of channels in each layer can be curtailed. Conventional spatialdomain convolution can be replaced with more efficient depthwise separable convolution as in MobileNet [28].
Knowledge distillation (KD) is a wellknown knowledge transfer framework to train a small “student” network under a guidance of a large pretrained “teacher” model. The original idea from Hinton et al. in [26] utilizes the soft decision output of a welltrained classification model in order to help to train another smallsize network. This original idea was further refined and advanced mostly (1) by introducing losses of matching the outputs from intermediate layers of the teacher and student [52, 63, 1], and (2) by using more sophisticate distance metrics, for example, mutual relations for multiple samples [10, 49].
One issue with existing model compression approaches (including KD) is that they are developed under a strong assumption that the original training data is accessible during the compression procedure. As datasets get larger, the distribution of datasets becomes more expensive and more difficult. Additionally, data privacy and security have emerged as one of primary concerns in deep learning. Consequently, regulations and compliance requirements around security and privacy complicate both data sharing by the original model trainer and data collection by the model compressor, for example, in the case of medical and biometric data. Thus, there is a strong need to compress a pretrained model without access to the original or even alternative datasets.
There have been some attempts to address the problem of data sharing in model compression [35, 6, 9, 40]. They aim to perform KD without the original datasets. The early attempts in [35, 6] circumvent this issue by assuming that some form of compressed and/or partial information on the original training data is provided instead, called metadata, to protect the privacy and to reduce the size of the data to share. Given a pretrained model with metadata, for example, statistics of activation outputs (feature maps) at any intermediate layers, the input is inferred in a backward manner so it matches the statistics in the metadata. On the other hand, in [9, 40], generators are introduced to produce synthetic samples for KD. Chen et al. [9] proposed training a generator by using the pretrained teacher as a fixed discriminator. Micaelli et al. [40] used the mismatch between the teacher and the student as an adversarial loss for training a generator to produce adversarial examples for KD. The previous generatorbased KD framework in [9]
is rather heuristic, relying on adhoc losses. In
[40], adversarial examples can be any images far different from the original data, which degrade the KD performance.In this paper, we propose an adversarial knowledge distillation framework, which minimizes the possible loss for a worst case (maximum loss) via adversarial learning, when the loss with the original training data is not accessible. The key difference from [40] lies in the fact that given any metadata, we utilize them to constrain a generator in the adversarial learning framework. To avoid additional efforts to craft new metadata to share, we use the statistics stored in batch normalization layers to constrain a generator to produce synthetic samples that mimic the original training data. Furthermore, we propose producing diverse synthetic samples by using multiple generators. We also empirically show that performing adversarial KD concurrently for multiple students yields better results. The proposed datafree adversarial KD framework is summarized in Figure 1.
For model compression, we perform experiments on two scenarios, (1) datafree KD and (2) datafree network quantization. The proposed scheme shows the stateoftheart datafree KD performance on residual networks [25] and wide residual networks [62] for SVHN [46], CIFAR10, CIFAR100 [33], and TinyImageNet^{1}^{1}1https://tinyimagenet.herokuapp.com, compared to the previous work [9, 40, 60]
. Datafree network quantization (datafree quantizationaware training) has not been investigated before to the best of our knowledge. We use TensorFlow’s quantizationaware training
[29, 32] as the baseline scheme, and we evaluate the performance on residual networks, wide residual networks, and MobileNet trained on various datasets, when quantizationaware training is performed with the synthetic data generated from our datafree KD framework. The experimental results show marginal performance loss from the proposed datafree framework, compared to the case of using the original training datasets.2 Related work
Datafree KD and quantization. Datafree KD attracts the interest with the need to compress pretrained models for deployment on resourcelimited mobile or edge platforms, while sharing original training data is often restricted due to privacy and license issues.
Some of early attempts to address this issue suggest using metadata that are the statistics of intermediate features collected from a pretrained model in [35, 6]
. For example, the mean and variance of activation outputs for selected intermediate layers are proposed to be collected and assumed to be provided, instead of the original dataset. Given any metadata, they find samples that help to train student networks by directly inferring them in the image domain such that they produce similar statistics as the metadata when fed to the teacher. Recent approaches, however, aim to solve this problem without metadata specifically designed for the datafree KD task. In
[44], class similarities are computed from the weights of the last fullyconnected layer, and they are used instead of metadata. Very recently, it is proposed to use the statistics stored in batch normalization layers with no additional costs instead of crafting new metadata [60].On the other hand, some of the previous approaches introduce another network, called generator, that yields synthetic samples for training student networks [9, 40, 61]. They basically propose optimizing a generator so that the generator output produces high accuracy when fed to a pretrained teacher. Adversarial learning was introduced to produce dynamic samples for which the teacher and the student poorly matched in their classification output and to perform KD on those adversarial samples [40].
To our knowledge, there are few works on datafree network quantization. Weight equalization and bias correction are proposed for datafree weight quantization in [43], but datafree activation quantization is not considered. Weight equalization is a procedure to transform a pretrained model into a quantizationfriendly model by redistributing (equalizing) its weights across layers so they have smaller deviation in each layer and smaller quantization errors. The biases introduced in activations owing to weight quantization are calculated and corrected with no data but based on the statistics stored in batch normalization layers. We note that no synthetic data are produced in [43], and no datafree quantizationaware training is considered in [43]. We compare datafree KD and quantization schemes in Table 1.
Synthetic data  Metadata  Datafree 
Not used  N/A  [43]* 
Inferred in the image domain  [35], [6]  [44], [60]* 
Generated from generators  N/A  [9], [40], Ours* 
* Used the statistics stored in batch normalization layers. 
Robust optimization. Robust optimization is a subfield of optimization that addresses data uncertainty in optimization problems (e.g., see [4, 5]). Under this framework, the objective and constraint functions are assumed to belong to certain sets, called “uncertainty sets.” The goal is to make a decision that is feasible no matter what the constraints turn out to be, and optimal for the worstcase objective function. With no data provided, we formulate the problem of datafree KD into a robust optimization problem, while the uncertainty sets are decided based on the pretrained teacher using the statistics at its batch normalization layers.
Adversarial attacks. Generating synthetic data that fool a pretrained model is closely related to the problem of adversarial attacks (e.g., see [2]). Although their purpose is completely different from ours, the way of generating synthetic data (or adversarial samples) follows a similar procedure. In adversarial attacks, there are also two approaches, i.e., (1) generating adversarial images directly in the image domain [19, 8, 39] and (2) using generators to produce adversarial images [50, 57, 30].
Deep image prior. We also note that generator networks consisting of a series of convolutional layers can be used as a good regularizer that we can impose for image generation as prior [56]. Hence, we adopt generators, instead of adding any prior regularization [42] that is employed in [60] to obtain synthetic images without generators.
Generative adversarial networks (GANs). Adversarial learning is also wellknown in GANs [20]. GANs are of great interest in deep learning for image synthesis problems. Mode collapse is one of wellknown issues in GANs (e.g., see [21]). A straightforward but effective way to overcome mode collapse is to introduce multiple generators and/or multiple discriminators [16, 47, 3, 27]. We also found that using multiple generators and/or multiple students (a student acts as a discriminator in our case) helps to produce diverse samples and avoid overfitting in our datafree KD framework.
3 Datafree model compression
3.1 Knowledge distillation (KD)
Let
be a general nonlinear neural network for classification, which is designed to yield a categorical probability distribution
for the label of input over the label set , i.e., . Letbe the onehot encoded groundtruth label
over the set for input . The network is pretrained with a labeled dataset, called training dataset, of probability distribution , as below:where is, in practice, an empirical expectation over the training dataset, and stands for KullbackLeibler (KL) divergence (e.g., see [14, Section 2.3]); note that the minimization of KL divergence is equivalent to the minimization of crossentropy, given the distribution .
Suppose that we want to train another neural network , called “student”, possibly smaller and less complex than the pretrained network
, called “teacher.” The student also produces its estimate of the categorical probability distribution for input
such that . Knowledge distillation [26] suggests to optimize the student by(1) 
where ; note that we omitted the temperature parameter for simplicity, which can be applied before softmax for and in the second KL divergence term of (1).
3.2 Datafree adversarial KD
As shown in (1), the original KD is developed under the assumption that a training dataset is given for the expectation over . However, sharing a large dataset is expensive and sometimes not even possible due to privacy and security concerns. Hence, it is of interest to devise a method of KD in the situation where the training dataset is not accessible, but only a pretrained teacher is given.
Robust optimization (e.g. see [4]) suggests minimizing the possible loss for a worst case scenario (maximum loss) with adversarial learning under data uncertainty, which is similar to the situation we encounter when we are not given a training dataset for optimization. To adopt the robust minimax optimization (also known as adversarial learning) in KD, we first introduce a generator network , which is used to produce synthetic adversarial data for the input to KD. Then, using the minimax approach, we propose datafree adversarial KD, which is given by
(2) 
for , where is an additional loss that a pretrained teacher can provide for the generator based on the generator output. We defer our proposed terms in to Section 3.3.
Remark 1.
Comparing (2) to the original KD in (1), we omit the first KL divergence term related to ground truth labels:
(3) 
If we have a generator optimized to mimic the training data exactly such that , then (3) reduces to
However, we do not have access to the original training data and cannot find the optimal generator . Instead, we minimize the upper bound of by solving the minimax problem in (2), while we give the generator some constraints with the auxiliary loss for the generator to produce similar data as the original training data.
3.3 Generator constraints
We consider the following three auxiliary loss terms for the generator in the maximization step of (2) to make the generator produce “good” adversarial samples similar to the original data as much as possible based on the teacher.

[noitemsep,topsep=0em]

Batch normalization statistics
. Batch normalization layers contain the mean and variance of layer inputs, which we can utilize as a proxy to confirm that the generator output is similar to the original training data. We propose using the KL divergence of two Gaussian distributions to match the mean and variance stored in batch normalization layers (which are obtained from the original data) and the empirical statistics obtained with the generator output.

Instance categorical entropy. If the teacher is trained well enough for accurate classification, the generator output is of interest only when the categorical distribution output, i.e., softmax output, of the teacher yields small entropy (the probability for one category should be high); the entropy is minimized to zero if one category has probability . That is, we need small entropy for on each sampled .

Batch categorical entropy
. Assuming that each class appears in the dataset with similar probability, the categorical probability distribution averaged for any batch should tend to uniform distribution where the entropy is maximized to
. That is, we need high entropy for .
Let and be the mean and the variance stored in batch normalization layer for channel , which is learned from the original training data. Let and be the corresponding mean and variance computed for the synthetic samples from the generator . The auxiliary loss for the generator in (2) is given by
(4) 
where denotes entropy (e.g., see [14, Section 2.1]), and is the KL divergence of two Gaussian distributions, which can be represented as
(5) 
Remark 2.
If in (2), the proposed scheme reduces to the adversarial belief matching presented in [40]. Adding the auxiliary loss , we constrain the generator so it produces synthetic images that yield similar statistics in the teacher as the original data, which helps the minimax optimization avoid any adversarial samples that are very different from the original data and leads to better distillation performance (basically we reduce the loss due to fitting the model for “bad” examples not close to the original dataset). For (b) and (c), we found that similar entropy loss terms are already proposed in [9]. Batch normalization statistics are used in [43, 60]. Yin et al. [60] find synthetic samples directly in the image domain with no generators by optimizing an input batch such that it produces similar batch normalization statistics in a pretrained model. In contrast, we utilize batch normalization statistics to constrain generators. Furthermore, to match the mean and variance, the squared L2 distance is used in [60]
, while we propose using the KL divergence of two Gaussian distributions, which is a distance measure normalized by scale (i.e., standard deviation
in (5)). In [43], batch normalization statistics are used to calculate any quantization biases for correction. No synthetic images are produced in [43].3.4 Multiple generators and multiple students
Using mixture of generators has been proposed to avoid the mode collapse issue and to yield diverse samples that cover the whole support of a target dataset [27]. Similarly we propose training multiple generators in our datafree KD framework to increase the diversity of generated samples. Moreover, using multiple discriminators has been also proposed to reduce the mode collapse problem in GANs [16]. A similar idea can be adopted in our framework, since we utilize the KL divergence of the student and teacher outputs as the discriminator output. The average KL divergence between the teacher and the students are maximized in minimax optimization. Intuitively, taking average not only reduces the noise in minimax optimization using stochastic gradient descent, but also steers a generator to produce better adversarial samples that are poorly matched to every student in average. The final objective with multiple generators and multiple students is given by
where is the th student and is the th generator for and .
3.5 Implementation
We summarize the proposed datafree adversarial KD scheme in Algorithm 1. Let be the random input batch of size to generators, and let and be the losses computed and averaged over batch . We suggest “warmup” training of generators, optionally, before the main adversarial KD. In the warmup stage, we train generators only to minimize the auxiliary loss so its output matches batch normalization statistics and entropy constraints when fed to the teacher. This pretraining procedure reduces generation of unreliable samples in the early steps of datafree KD. Furthermore, updating students more frequently than generators reduces the chances of falling into any local maximum in the minimax optimization. In the minimization step, one can additionally match intermediate layer outputs as proposed in [52, 63, 1]. Finally, datafree network quantization is implemented by letting the student be a quantized version of the teacher (see Section 4.2).
4 Experiments
We evaluate the proposed datafree adversarial KD algorithm on two model compression tasks: (1) datafree KD to smaller networks and (2) datafree network quantization.
Generator architecture. Let conv3 denote a convolutional layer with
filters and stride
. Let fc be a fullyconnected layer with units. Let upsampling be a nearestneighbor upsampling layer. The generator input is of sizeand is sampled from the standard normal distribution. Given that the image size of the original data is
(W,H,3), we build a generator as below:fc8WH, reshape(W/8,H/8,512) 
upsampling, conv3256, batchnorm, ReLU 
upsampling, conv3128, batchnorm, ReLU 
upsampling, conv364, batchnorm, ReLU 
conv33, tanh, batchnorm 
Training. For training generators in maximization, we use Adam optimizer [31] with momentum and learning rate . On the other hand, for training students in minimization, we use Nesterov accelerated gradient [45] with momentum and learning rate . The learning rates are annealed by cosine decaying [36]. We adopt the vanilla KD for datafree KD from WRN402 to WRN161 on CIFAR10. We use epochs in the warmup stage and epochs for the main adversarial KD, where each epoch consists of batches of batch size . In the other cases, we adopt variational information distillation (VID) [1] to match intermediate layer outputs, where we reduce the number of batches per epoch to ; VID is one of the stateoftheart KD variants, and it yields better student accuracy with faster convergence. For the weighting factor in (2), we perform experiments on and choose the best results. The generator update interval is set to be for wide residual networks and for the others. Except the results in Table 3, we use one generator and one student in our datafree KD, i.e., in Algorithm 1.
4.1 Datafree model compression
Original dataset  Teacher (# params)  Student (# params)  Teacher accuracy (%)  Student accuracy (%)  
Datafree KD methods  Training from scratch*  VID [1]*  
Ours  [40]  [9]  [60]  
SVHN  WRN402 (2.2M)  WRN161 (0.2M)  98.04  96.48  94.06  N/A  N/A  97.67  97.60 
CIFAR10  WRN402 (2.2M)  WRN161 (0.2M)  94.77  86.14  83.69  N/A  N/A  90.97  91.78 
WRN401 (0.6M)  91.69  86.60  N/A  N/A  93.35  93.67  
WRN162 (0.7M)  92.01  89.71  N/A  N/A  93.72  94.06  
VGG11 (9.2M)  ResNet18 (11.2M)  92.37  90.84  N/A  N/A  90.36  94.56  91.47  
ResNet34 (21.3M)  ResNet18 (11.2M)  95.11  94.61  N/A  92.22  93.26  94.56  94.90  
CIFAR100  ResNet34 (21.3M)  ResNet18 (11.2M)  78.34  77.01  N/A  74.47  N/A  77.32  77.77 
TinyImageNet  ResNet34 (21.4M)  ResNet18 (11.3M)  66.34  63.73  N/A  N/A  N/A  64.87  66.01 
* Used the original datasets. 
We evaluate the performance of the proposed datafree model compression scheme on SVHN, CIFAR10, CIFAR100, and TinyImageNet datasets for KD of residual networks (ResNets) and wide residual networks (WRNs). We summarize the main results in Table 2. We compare our scheme to the previous datafree KD methods in [40, 9, 60] and show that we achieve the stateoftheart datafree KD performance in all evaluation cases. We also obtain the student accuracy when students are trained with the original datasets from scratch and by using variational information distillation (VID) in [1]. Table 2 shows that the accuracy losses of our datafree KD method are marginal, compared to the cases of using the original datasets.
Example synthetic images. We show example synthetic images obtained from generators trained with teachers pretrained for SVHN, CIFAR10, and CIFAR100 datasets, respectively, in Figure 2, Figure 3, and Figure 4. The figures show that the generators regularized with pretrained teachers produce samples that are similar to the original datasets.
(a) Training KL divergence  (b) Student test accuracy 
Epochs  Automobile  Bird  Horse  Automobile  Bird  Horse 

10  
50  
100  
200  
(a)  (b) 
Ablation study. For ablation study, we evaluate the proposed datafree KD scheme with and without each term of the auxiliary loss for the generator in (4), and the results are summarized in Figure 5. The bar graph shows that the major contribution comes from (a), which is to match batch normalization statistics (see Section 3.3). In Figure 6, we present the impact of the weighting factor in (2) on KD performance. Moreover, to visually show the impact of on the generation of synthetic data, we collect synthetic images for and and show them at different epochs in Figure 7. The figures show that smaller yields more diverse adversarial images, since the generator is constrained less. As gets larger, the generated images collapse to one mode for each class, which leads to overfitting.
[0pt][l]# students ()# generators ()  1  2 
1  86.14  86.67 
2  86.44  87.04 
Multiple generators and multiple students. We show the gain of using multiple generators and/or multiple students in Table 3. We compare the cases of using two generators and/or two students. For the second generator, we replace one middle convolutional layer with a residual block. For KD to two students, we use identical students with different initialization. Table 3 shows that increasing the number of generators and/or the number of students results in better student accuracy in datafree KD.
4.2 Datafree network quantization
Original dataset  Pretrained model (accuracy %)  Quantization bitwidth for weights / activations  Quantized model accuracy (%)  
Ours (datafree)  Datadependent [29]*  
DFQ  DFQATKD  Q  QAT  QATKD  
SVHN  WRN161 (97.67)  8 / 8  97.67  97.74  97.70  97.71  97.78 
4 / 8  91.92  97.53  93.83  97.66  97.70  
CIFAR10  WRN161 (90.97)  8 / 8  90.51  90.90  90.95  91.21  91.16 
4 / 8  86.29  88.91  86.74  90.92  90.71  
WRN402 (94.77)  8 / 8  94.47  94.76  94.75  94.91  95.02  
4 / 8  93.14  94.22  93.56  94.73  94.42  
CIFAR100  ResNet18 (77.32)  8 / 8  76.68  77.30  77.43  77.84  77.73 
4 / 8  71.02  75.15  69.63  75.52  75.62  
TinyImageNet  MobileNet v1 (64.34)  8 / 8  51.76  63.11  54.48  61.94  64.53 
* Used the original datasets. 
Dataset used in KD  Quantized model accuracy (%) before / after finetuning with KD  

WRN161 (SVHN)  WRN402 (CIFAR10)  ResNet18 (CIFAR100)  
SVHN  93.83 / 97.70  71.89 / 92.08  13.41 / 65.07 
CIFAR10  93.50 / 97.24  93.56 / 94.42  67.50 / 75.62 
CIFAR100  94.11 / 97.26  92.18 / 94.10  69.63 / 75.62 
Ours (datafree)  91.92 / 97.53  93.14 / 94.22  71.02 / 75.15 
In this subsection, we present the experimental results of the proposed datafree adversarial KD scheme on network quantization. For the baseline quantization scheme, we use TensorFlow’s quantization framework. In particular, we implement our datafree KD scheme in the quantizationaware training framework [29, 32] of TensorFlow^{2}^{2}2https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/contrib/quantize.
TensorFlow’s quantizationaware training performs perlayer asymmetric quantization of weights and activations. For quantization only, no data are needed for weight quantization, but quantization of activations requires representative data, which are used to collect the range (the minimum and the maximum) of activations and to determine the quantization bin size based on the range. In our datafree quantization, we use synthetic data from a generator as the representative data. To this end, we train a generator with no adversarial loss as in the warmup stage of Algorithm 1 (see DFQ in Table 4). For our datafree quantizationaware training, we utilize the proposed adversarial KD on top of Tensorflow’s quantizationaware framework, where a quantized network is set as the student and a pretrained floatingpoint model is given as the teacher, which is denoted by DFQATKD in Table 4.
We follow the training hyperparameters as described in Section
4.1, while we set the initial learning rate for KD to be . We use epochs for the warmup stage and epochs for quantizationaware training with datafree KD. We adopt the vanilla KD with no intermediate layer output matching terms. We summarize the results in Table 4.For comparison, we evaluate three conventional datadependent quantization schemes using the original training datasets, i.e., quantization only (Q), quantizationaware training (QAT), and quantizationaware training with KD (QATKD). As presented in Table 4, our datafree quantization shows very marginal accuracy losses less than 2% for 4bit/8bit weight and 8bit activation quantization in all the evaluated cases, compared to using the original datasets.
Finally, we compare our datafree quantization to using alternative datasets. We consider two cases (1) when a similar dataset is used (e.g., CIFAR100 instead of CIFAR10) and (2) when a mismatched dataset is used (e.g., SVHN instead of CIFAR10). The results in Table 5 show that using a mismatched dataset degrades the performance considerably. Using a similar dataset achieves comparable performance to our datafree scheme, which shows small accuracy losses less than 0.5% compared to using the original datasets. We note that even alternative data, which are safe from privacy and regulatory concerns, are hard to collect in usual cases.
5 Conclusion
In this paper, we proposed datafree adversarial KD for network quantization and compression. No original data are used in the proposed framework, while we train a generator to produce synthetic data adversarial to KD. In particular, we propose matching batch normalization statistics in the teacher to additionally constrain the generator to produce samples similar to the original training data. We used the proposed datafree KD scheme for compression of various models trained on SVHN, CIFAR10, CIFAR100, and TinyImageNet datasets. In our experiments, we achieved the stateoftheart datafree KD performance over the existing datafree KD schemes. For network quantization, we obtained quantized models that achieve comparable accuracy to the models quantized and finetuned with the original training datasets. The proposed framework shows great potential to keep data privacy in model compression.
References

[1]
(2019)
Variational information distillation for knowledge transfer.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9163–9171. Cited by: §1, §3.5, §4.1, Table 2, §4.  [2] (2018) Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, pp. 14410–14430. Cited by: §2.

[3]
(2017)
Generalization and equilibrium in generative adversarial nets (GANs).
In
International Conference on Machine Learning
, pp. 224–232. Cited by: §2.  [4] (2009) Robust optimization. Vol. 28, Princeton University Press. Cited by: §2, §3.2.
 [5] (2011) Theory and applications of robust optimization. SIAM review 53 (3), pp. 464–501. Cited by: §2.
 [6] (2019) Dream distillation: a dataindependent model compression framework. In ICML Joint Workshop on OnDevice Machine Learning and Compact Deep Neural Network Representations (ODMLCDNNR), Cited by: §1, Table 1, §2.
 [7] (2017) Deep learning with low precision by halfwave Gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §1.
 [8] (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57. Cited by: §2.
 [9] (2019) Datafree learning of student networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3514–3522. Cited by: §1, §1, Table 1, §2, §4.1, Table 2, Remark 2.

[10]
(2018)
DarkRank: accelerating deep metric learning via cross sample similarities transfer.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: §1.  [11] (2018) Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Processing Magazine 35 (1), pp. 126–136. Cited by: §1.
 [12] (2017) Towards the limit of network quantization. In International Conference on Learning Representations, Cited by: §1.
 [13] (2020) Universal deep neural network compression. IEEE Journal of Selected Topics in Signal Processing. Cited by: §1.
 [14] (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.1, §3.3.
 [15] (2018) Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, pp. 1135–1144. Cited by: §1.
 [16] (2017) Generative multiadversarial networks. In International Conference on Learning Representations, Cited by: §2, §3.4.
 [17] (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §1.
 [18] (2016) Deep learning. MIT press. Cited by: §1.
 [19] (2014) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §2.
 [20] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §2.
 [21] (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §2.
 [22] (2016) Dynamic network surgery for efficient DNNs. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §1.
 [23] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations, Cited by: §1.
 [24] (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §1.
 [25] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1.
 [26] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §3.1.
 [27] (2018) MGAN: training generative adversarial nets with multiple generators. In International Conference on Learning Representations, Cited by: §2, §3.4.

[28]
(2017)
MobileNets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §1.  [29] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1, §1, §4.2, Table 4.
 [30] (2019) Adversarial defense via learning to generate diverse attacks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2740–2749. Cited by: §2.
 [31] (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.
 [32] (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1, §4.2.
 [33] (2009) Learning multiple layers of features from tiny images. Technical report, Univ. of Toronto. Cited by: §1.
 [34] (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
 [35] (2017) Datafree knowledge distillation for deep neural networks. In NeurIPS Workshop on Learning with Limited Data, Cited by: §1, Table 1, §2.
 [36] (2017) SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations, Cited by: §4.
 [37] (2017) Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pp. 3290–3300. Cited by: §1.
 [38] (2018) Learning sparse neural networks through regularization. In International Conference on Learning Representations, Cited by: §1.
 [39] (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §2.
 [40] (2019) Zeroshot knowledge transfer via adversarial belief matching. In Advances in Neural Information Processing Systems, pp. 9547–9557. Cited by: §1, §1, §1, Table 1, §2, §4.1, Table 2, Remark 2.
 [41] (2017) Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pp. 2498–2507. Cited by: §1.
 [42] (2015)(Website) Note: https://research.googleblog.com/2015/06/inceptionismgoingdeeperintoneural.html [Online; accessed 18April2020] Cited by: §2.
 [43] (2019) Datafree quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: Table 1, §2, Remark 2.
 [44] (2019) Zeroshot knowledge distillation in deep networks. In International Conference on Machine Learning, pp. 4743–4751. Cited by: Table 1, §2.
 [45] (1983) A method for unconstrained convex minimization problem with the rate of convergence . In Doklady AN USSR, Vol. 269, pp. 543–547. Cited by: §4.
 [46] (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §1.
 [47] (2017) Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2670–2680. Cited by: §2.
 [48] (2017) Weightedentropybased quantization for deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7197–7205. Cited by: §1.
 [49] (2019) Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §1.
 [50] (2018) Generative adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4422–4431. Cited by: §2.
 [51] (2016) XNORNet: imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, pp. 525–542. Cited by: §1.
 [52] (2015) FitNets: hints for thin deep nets. In International Conference on Learning Representations, Cited by: §1, §3.5.
 [53] (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §1.
 [54] (2018) Deep neural network compression by inparallel pruningquantization. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
 [55] (2017) Soft weightsharing for neural network compression. In International Conference on Learning Representations, Cited by: §1.
 [56] (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.
 [57] (2019) A direct approach to robust deep learning using adversarial networks. In International Conference on Learning Representations, Cited by: §2.
 [58] (2019) HAQ: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1.
 [59] (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082. Cited by: §1.
 [60] (2019) Dreaming to distill: datafree knowledge transfer via DeepInversion. arXiv preprint arXiv:1912.08795. Cited by: §1, Table 1, §2, §2, §4.1, Table 2, Remark 2.
 [61] (2019) Knowledge extraction with no observable data. In Advances in Neural Information Processing Systems, pp. 2701–2710. Cited by: §2.
 [62] (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference, pp. 87.1–87.12. Cited by: §1.
 [63] (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, Cited by: §1, §3.5.
 [64] (2018) LQNets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision, pp. 365–382. Cited by: §1.
 [65] (2016) DoReFaNet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1.
 [66] (2017) Trained ternary quantization. In International Conference on Learning Representations, Cited by: §1.