Data-Free Network Quantization With Adversarial Knowledge Distillation

05/08/2020 ∙ by Yoojin Choi, et al. ∙ DGIST SAMSUNG 12

Network quantization is an essential procedure in deep learning for development of efficient fixed-point inference models on mobile or edge platforms. However, as datasets grow larger and privacy regulations become stricter, data sharing for model compression gets more difficult and restricted. In this paper, we consider data-free network quantization with synthetic data. The synthetic data are generated from a generator, while no data are used in training the generator and in quantization. To this end, we propose data-free adversarial knowledge distillation, which minimizes the maximum distance between the outputs of the teacher and the (quantized) student for any adversarial samples from a generator. To generate adversarial samples similar to the original data, we additionally propose matching statistics from the batch normalization layers for generated data and the original data in the teacher. Furthermore, we show the gain of producing diverse adversarial samples by using multiple generators and multiple students. Our experiments show the state-of-the-art data-free model compression and quantization results for (wide) residual networks and MobileNet on SVHN, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The accuracy losses compared to using the original datasets are shown to be very minimal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Data-free adversarial knowledge distillation. We minimize the maximum of the Kullback-Leibler (KL) divergence between the teacher and student outputs. In the maximization step for training the generator to produce adversarial images, the generator is constrained to produce synthetic images similar to the original data by matching the statistics from the batch normalization layers of the teacher.

Deep learning is now leading many performance breakthroughs in various computer vision tasks 

[34]

. The state-of-the-art performance of deep learning came with over-parameterized deep neural networks, which enable extracting useful representations (features) of the data automatically for a target task, when trained on a very large dataset. The optimization framework of deep neural networks with stochastic gradient descent has become very fast and efficient recently with the backpropagation technique 

[18, Section 6.5]

, using hardware units specialized for matrix/tensor computations such as graphical processing units (GPUs). The benefit of over-parameterization is empirically shown to be the key factor of the great success of deep learning, but once we find a well-trained high-accuracy model, its deployment on various inference platforms faces different requirements and challenges 

[53, 11]. In particular, to deploy pre-trained models on resource-limited platforms such as mobile or edge devices, computational costs and memory requirements are the critical factors that need to be considered carefully for efficient inference. Hence, model compression, also called network compression, is an important procedure for development of efficient inference models.

Model compression includes various methods such as (1) weight pruning, (2) network quantization, and (3) distillation to a network with a more efficient architecture. Weight pruning and network quantization reduce the computational cost as well as the storage/memory size, without altering the network architecture. Weight pruning compresses a model by removing redundant weights completely from it, i.e., by setting them to be zero, so we can skip computation as well as memorization for the pruned weights [24, 59, 22, 41, 37, 38, 17, 15]. Network quantization reduces the memory footprint for weights and activations by quantization and is usually followed by lossless source coding for compression [23, 12, 55, 48, 54, 13]. Moreover, the convolutional and fully-connected layers can be implemented with low-precision fixed-point operations, e.g., 8-bit fixed-point operations, to lower latency and to increase power efficiency [51, 65, 66, 7, 64, 29, 58]. On the other hand, the network architecture can be modified to be simpler and easier to implement on a target platform. For example, the number of layers and/or the number of channels in each layer can be curtailed. Conventional spatial-domain convolution can be replaced with more efficient depth-wise separable convolution as in MobileNet [28].

Knowledge distillation (KD) is a well-known knowledge transfer framework to train a small “student” network under a guidance of a large pre-trained “teacher” model. The original idea from Hinton et al. in [26] utilizes the soft decision output of a well-trained classification model in order to help to train another small-size network. This original idea was further refined and advanced mostly (1) by introducing losses of matching the outputs from intermediate layers of the teacher and student [52, 63, 1], and (2) by using more sophisticate distance metrics, for example, mutual relations for multiple samples [10, 49].

One issue with existing model compression approaches (including KD) is that they are developed under a strong assumption that the original training data is accessible during the compression procedure. As datasets get larger, the distribution of datasets becomes more expensive and more difficult. Additionally, data privacy and security have emerged as one of primary concerns in deep learning. Consequently, regulations and compliance requirements around security and privacy complicate both data sharing by the original model trainer and data collection by the model compressor, for example, in the case of medical and bio-metric data. Thus, there is a strong need to compress a pre-trained model without access to the original or even alternative datasets.

There have been some attempts to address the problem of data sharing in model compression [35, 6, 9, 40]. They aim to perform KD without the original datasets. The early attempts in [35, 6] circumvent this issue by assuming that some form of compressed and/or partial information on the original training data is provided instead, called meta-data, to protect the privacy and to reduce the size of the data to share. Given a pre-trained model with meta-data, for example, statistics of activation outputs (feature maps) at any intermediate layers, the input is inferred in a backward manner so it matches the statistics in the meta-data. On the other hand, in [9, 40], generators are introduced to produce synthetic samples for KD. Chen et al. [9] proposed training a generator by using the pre-trained teacher as a fixed discriminator. Micaelli et al. [40] used the mismatch between the teacher and the student as an adversarial loss for training a generator to produce adversarial examples for KD. The previous generator-based KD framework in [9]

is rather heuristic, relying on ad-hoc losses. In

[40], adversarial examples can be any images far different from the original data, which degrade the KD performance.

In this paper, we propose an adversarial knowledge distillation framework, which minimizes the possible loss for a worst case (maximum loss) via adversarial learning, when the loss with the original training data is not accessible. The key difference from [40] lies in the fact that given any meta-data, we utilize them to constrain a generator in the adversarial learning framework. To avoid additional efforts to craft new meta-data to share, we use the statistics stored in batch normalization layers to constrain a generator to produce synthetic samples that mimic the original training data. Furthermore, we propose producing diverse synthetic samples by using multiple generators. We also empirically show that performing adversarial KD concurrently for multiple students yields better results. The proposed data-free adversarial KD framework is summarized in Figure 1.

For model compression, we perform experiments on two scenarios, (1) data-free KD and (2) data-free network quantization. The proposed scheme shows the state-of-the-art data-free KD performance on residual networks [25] and wide residual networks [62] for SVHN [46], CIFAR-10, CIFAR-100 [33], and Tiny-ImageNet111https://tiny-imagenet.herokuapp.com, compared to the previous work [9, 40, 60]

. Data-free network quantization (data-free quantization-aware training) has not been investigated before to the best of our knowledge. We use TensorFlow’s quantization-aware training 

[29, 32] as the baseline scheme, and we evaluate the performance on residual networks, wide residual networks, and MobileNet trained on various datasets, when quantization-aware training is performed with the synthetic data generated from our data-free KD framework. The experimental results show marginal performance loss from the proposed data-free framework, compared to the case of using the original training datasets.

2 Related work

Data-free KD and quantization. Data-free KD attracts the interest with the need to compress pre-trained models for deployment on resource-limited mobile or edge platforms, while sharing original training data is often restricted due to privacy and license issues.

Some of early attempts to address this issue suggest using meta-data that are the statistics of intermediate features collected from a pre-trained model in [35, 6]

. For example, the mean and variance of activation outputs for selected intermediate layers are proposed to be collected and assumed to be provided, instead of the original dataset. Given any meta-data, they find samples that help to train student networks by directly inferring them in the image domain such that they produce similar statistics as the meta-data when fed to the teacher. Recent approaches, however, aim to solve this problem without meta-data specifically designed for the data-free KD task. In

[44], class similarities are computed from the weights of the last fully-connected layer, and they are used instead of meta-data. Very recently, it is proposed to use the statistics stored in batch normalization layers with no additional costs instead of crafting new meta-data [60].

On the other hand, some of the previous approaches introduce another network, called generator, that yields synthetic samples for training student networks [9, 40, 61]. They basically propose optimizing a generator so that the generator output produces high accuracy when fed to a pre-trained teacher. Adversarial learning was introduced to produce dynamic samples for which the teacher and the student poorly matched in their classification output and to perform KD on those adversarial samples [40].

To our knowledge, there are few works on data-free network quantization. Weight equalization and bias correction are proposed for data-free weight quantization in [43], but data-free activation quantization is not considered. Weight equalization is a procedure to transform a pre-trained model into a quantization-friendly model by re-distributing (equalizing) its weights across layers so they have smaller deviation in each layer and smaller quantization errors. The biases introduced in activations owing to weight quantization are calculated and corrected with no data but based on the statistics stored in batch normalization layers. We note that no synthetic data are produced in [43], and no data-free quantization-aware training is considered in [43]. We compare data-free KD and quantization schemes in Table 1.

Synthetic data Meta-data Data-free
Not used N/A [43]*
Inferred in the image domain [35], [6] [44], [60]*
Generated from generators N/A [9], [40], Ours*
* Used the statistics stored in batch normalization layers.
Table 1: Comparison of data-free KD and network quantization schemes based on (1) how they generate synthetic data and (2) whether they rely on meta-data or not.

Robust optimization. Robust optimization is a sub-field of optimization that addresses data uncertainty in optimization problems (e.g., see [4, 5]). Under this framework, the objective and constraint functions are assumed to belong to certain sets, called “uncertainty sets.” The goal is to make a decision that is feasible no matter what the constraints turn out to be, and optimal for the worst-case objective function. With no data provided, we formulate the problem of data-free KD into a robust optimization problem, while the uncertainty sets are decided based on the pre-trained teacher using the statistics at its batch normalization layers.

Adversarial attacks. Generating synthetic data that fool a pre-trained model is closely related to the problem of adversarial attacks (e.g., see [2]). Although their purpose is completely different from ours, the way of generating synthetic data (or adversarial samples) follows a similar procedure. In adversarial attacks, there are also two approaches, i.e., (1) generating adversarial images directly in the image domain [19, 8, 39] and (2) using generators to produce adversarial images [50, 57, 30].

Deep image prior. We also note that generator networks consisting of a series of convolutional layers can be used as a good regularizer that we can impose for image generation as prior [56]. Hence, we adopt generators, instead of adding any prior regularization [42] that is employed in [60] to obtain synthetic images without generators.

Generative adversarial networks (GANs). Adversarial learning is also well-known in GANs [20]. GANs are of great interest in deep learning for image synthesis problems. Mode collapse is one of well-known issues in GANs (e.g., see [21]). A straightforward but effective way to overcome mode collapse is to introduce multiple generators and/or multiple discriminators [16, 47, 3, 27]. We also found that using multiple generators and/or multiple students (a student acts as a discriminator in our case) helps to produce diverse samples and avoid over-fitting in our data-free KD framework.

3 Data-free model compression

3.1 Knowledge distillation (KD)

Let

be a general non-linear neural network for classification, which is designed to yield a categorical probability distribution 

for the label  of input over the label set , i.e., . Let

be the one-hot encoded ground-truth label 

over the set for input . The network  is pre-trained with a labeled dataset, called training dataset, of probability distribution , as below:

where is, in practice, an empirical expectation over the training dataset, and stands for Kullback-Leibler (KL) divergence (e.g., see [14, Section 2.3]); note that the minimization of KL divergence is equivalent to the minimization of cross-entropy, given the distribution .

Suppose that we want to train another neural network , called “student”, possibly smaller and less complex than the pre-trained network

, called “teacher.” The student also produces its estimate of the categorical probability distribution for input 

such that . Knowledge distillation [26] suggests to optimize the student by

(1)

where ; note that we omitted the temperature parameter for simplicity, which can be applied before softmax for and in the second KL divergence term of (1).

3.2 Data-free adversarial KD

As shown in (1), the original KD is developed under the assumption that a training dataset is given for the expectation over . However, sharing a large dataset is expensive and sometimes not even possible due to privacy and security concerns. Hence, it is of interest to devise a method of KD in the situation where the training dataset is not accessible, but only a pre-trained teacher is given.

Robust optimization (e.g. see [4]) suggests minimizing the possible loss for a worst case scenario (maximum loss) with adversarial learning under data uncertainty, which is similar to the situation we encounter when we are not given a training dataset for optimization. To adopt the robust minimax optimization (also known as adversarial learning) in KD, we first introduce a generator network , which is used to produce synthetic adversarial data for the input to KD. Then, using the minimax approach, we propose data-free adversarial KD, which is given by

(2)

for , where is an additional loss that a pre-trained teacher can provide for the generator based on the generator output. We defer our proposed terms in to Section 3.3.

Remark 1.

Comparing (2) to the original KD in (1), we omit the first KL divergence term related to ground truth labels:

(3)

If we have a generator  optimized to mimic the training data exactly such that , then (3) reduces to

However, we do not have access to the original training data and cannot find the optimal generator . Instead, we minimize the upper bound of by solving the minimax problem in (2), while we give the generator some constraints with the auxiliary loss  for the generator to produce similar data as the original training data.

3.3 Generator constraints

We consider the following three auxiliary loss terms for the generator in the maximization step of (2) to make the generator produce “good” adversarial samples similar to the original data as much as possible based on the teacher.

  1. [noitemsep,topsep=0em]

  2. Batch normalization statistics

    . Batch normalization layers contain the mean and variance of layer inputs, which we can utilize as a proxy to confirm that the generator output is similar to the original training data. We propose using the KL divergence of two Gaussian distributions to match the mean and variance stored in batch normalization layers (which are obtained from the original data) and the empirical statistics obtained with the generator output.

  3. Instance categorical entropy. If the teacher is trained well enough for accurate classification, the generator output is of interest only when the categorical distribution output, i.e., softmax output, of the teacher yields small entropy (the probability for one category should be high); the entropy is minimized to zero if one category has probability . That is, we need small entropy for on each sampled .

  4. Batch categorical entropy

    . Assuming that each class appears in the dataset with similar probability, the categorical probability distribution averaged for any batch should tend to uniform distribution where the entropy is maximized to

    . That is, we need high entropy for .

Let and be the mean and the variance stored in batch normalization layer  for channel , which is learned from the original training data. Let and be the corresponding mean and variance computed for the synthetic samples from the generator . The auxiliary loss  for the generator in (2) is given by

(4)

where denotes entropy (e.g., see [14, Section 2.1]), and is the KL divergence of two Gaussian distributions, which can be represented as

(5)
Remark 2.

If in (2), the proposed scheme reduces to the adversarial belief matching presented in [40]. Adding the auxiliary loss , we constrain the generator so it produces synthetic images that yield similar statistics in the teacher as the original data, which helps the minimax optimization avoid any adversarial samples that are very different from the original data and leads to better distillation performance (basically we reduce the loss due to fitting the model for “bad” examples not close to the original dataset). For (b) and (c), we found that similar entropy loss terms are already proposed in [9]. Batch normalization statistics are used in [43, 60]. Yin et al. [60] find synthetic samples directly in the image domain with no generators by optimizing an input batch such that it produces similar batch normalization statistics in a pre-trained model. In contrast, we utilize batch normalization statistics to constrain generators. Furthermore, to match the mean and variance, the squared L2 distance is used in [60]

, while we propose using the KL divergence of two Gaussian distributions, which is a distance measure normalized by scale (i.e., standard deviation 

in (5)). In [43], batch normalization statistics are used to calculate any quantization biases for correction. No synthetic images are produced in [43].

3.4 Multiple generators and multiple students

Using mixture of generators has been proposed to avoid the mode collapse issue and to yield diverse samples that cover the whole support of a target dataset [27]. Similarly we propose training multiple generators in our data-free KD framework to increase the diversity of generated samples. Moreover, using multiple discriminators has been also proposed to reduce the mode collapse problem in GANs [16]. A similar idea can be adopted in our framework, since we utilize the KL divergence of the student and teacher outputs as the discriminator output. The average KL divergence between the teacher and the students are maximized in minimax optimization. Intuitively, taking average not only reduces the noise in minimax optimization using stochastic gradient descent, but also steers a generator to produce better adversarial samples that are poorly matched to every student in average. The final objective with multiple generators and multiple students is given by

where is the -th student and is the -th generator for and .

3.5 Implementation

We summarize the proposed data-free adversarial KD scheme in Algorithm 1. Let be the random input batch of size to generators, and let and be the losses computed and averaged over batch . We suggest “warm-up” training of generators, optionally, before the main adversarial KD. In the warm-up stage, we train generators only to minimize the auxiliary loss  so its output matches batch normalization statistics and entropy constraints when fed to the teacher. This pre-training procedure reduces generation of unreliable samples in the early steps of data-free KD. Furthermore, updating students more frequently than generators reduces the chances of falling into any local maximum in the minimax optimization. In the minimization step, one can additionally match intermediate layer outputs as proposed in [52, 63, 1]. Finally, data-free network quantization is implemented by letting the student be a quantized version of the teacher (see Section 4.2).

  Generator update interval:
  Warm-up training for generators (optional)
  for  to  do
     for  to  do
         
         
     end for
  end for
  Adversarial knowledge distillation
  for  to  do
     Maximization
     if  mod  then
         for  to  do
            
            for  to  do
               
            end for
            
         end for
     end if
     Minimization
     
     for  to  do
         
         
     end for
     for  to  do
         
     end for
  end for
Algorithm 1 Data-free adversarial knowledge distillation.

4 Experiments

We evaluate the proposed data-free adversarial KD algorithm on two model compression tasks: (1) data-free KD to smaller networks and (2) data-free network quantization.

Generator architecture. Let conv3- denote a convolutional layer with

filters and stride

. Let fc- be a fully-connected layer with units. Let upsampling be a nearest-neighbor upsampling layer. The generator input  is of size

and is sampled from the standard normal distribution. Given that the image size of the original data is

(W,H,3), we build a generator as below:

fc-8WH, reshape-(W/8,H/8,512)
upsampling, conv3-256, batchnorm, ReLU
upsampling, conv3-128, batchnorm, ReLU
upsampling, conv3-64, batchnorm, ReLU
conv3-3, tanh, batchnorm

Training. For training generators in maximization, we use Adam optimizer [31] with momentum  and learning rate . On the other hand, for training students in minimization, we use Nesterov accelerated gradient [45] with momentum  and learning rate . The learning rates are annealed by cosine decaying [36]. We adopt the vanilla KD for data-free KD from WRN40-2 to WRN16-1 on CIFAR-10. We use epochs in the warm-up stage and epochs for the main adversarial KD, where each epoch consists of batches of batch size . In the other cases, we adopt variational information distillation (VID) [1] to match intermediate layer outputs, where we reduce the number of batches per epoch to ; VID is one of the state-of-the-art KD variants, and it yields better student accuracy with faster convergence. For the weighting factor  in (2), we perform experiments on and choose the best results. The generator update interval  is set to be for wide residual networks and for the others. Except the results in Table 3, we use one generator and one student in our data-free KD, i.e., in Algorithm 1.

4.1 Data-free model compression

Original dataset Teacher (# params) Student (# params) Teacher accuracy (%) Student accuracy (%)
Data-free KD methods Training from scratch* VID [1]*
Ours [40] [9] [60]
SVHN WRN40-2 (2.2M) WRN16-1 (0.2M) 98.04 96.48 94.06 N/A N/A 97.67 97.60
CIFAR-10 WRN40-2 (2.2M) WRN16-1 (0.2M) 94.77 86.14 83.69 N/A N/A 90.97 91.78
WRN40-1 (0.6M) 91.69 86.60 N/A N/A 93.35 93.67
WRN16-2 (0.7M) 92.01 89.71 N/A N/A 93.72 94.06
VGG-11 (9.2M) ResNet-18 (11.2M) 92.37 90.84 N/A N/A 90.36 94.56 91.47
ResNet-34 (21.3M) ResNet-18 (11.2M) 95.11 94.61 N/A 92.22 93.26 94.56 94.90
CIFAR-100 ResNet-34 (21.3M) ResNet-18 (11.2M) 78.34 77.01 N/A 74.47 N/A 77.32 77.77
Tiny-ImageNet ResNet-34 (21.4M) ResNet-18 (11.3M) 66.34 63.73 N/A N/A N/A 64.87 66.01
* Used the original datasets.
Table 2: Comparison of the proposed data-free adversarial KD scheme to the previous works.

We evaluate the performance of the proposed data-free model compression scheme on SVHN, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets for KD of residual networks (ResNets) and wide residual networks (WRNs). We summarize the main results in Table 2. We compare our scheme to the previous data-free KD methods in [40, 9, 60] and show that we achieve the state-of-the-art data-free KD performance in all evaluation cases. We also obtain the student accuracy when students are trained with the original datasets from scratch and by using variational information distillation (VID) in [1]. Table 2 shows that the accuracy losses of our data-free KD method are marginal, compared to the cases of using the original datasets.

Figure 2:

Example synthetic images generated in data-free KD from WRN40-2 to WRN16-1 for SVHN. Just for better presentation, we classify the synthetic images using the teacher and show 4 samples from

to in each column.
Figure 3: Example synthetic images generated in data-free KD from WRN40-2 to WRN16-1 for CIFAR-10. Similar to Figure 2, we classify the synthetic images using the teacher and show 4 samples for each class of CIFAR-10 (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) in each column.
Figure 4: Example synthetic images generated in data-free KD from ResNet-34 to ResNet-18 for CIFAR-100.

Example synthetic images. We show example synthetic images obtained from generators trained with teachers pre-trained for SVHN, CIFAR-10, and CIFAR-100 datasets, respectively, in Figure 2, Figure 3, and Figure 4. The figures show that the generators regularized with pre-trained teachers produce samples that are similar to the original datasets.

Figure 5: Ablation study on the three terms in the auxiliary loss  of (4), i.e., (a) batch normalization statistics, (b) instance categorical entropy, and (c) batch categorical entropy (see Section 3.3).
(a) Training KL divergence   (b) Student test accuracy
Figure 6: Training KL divegence and student test accuracy of data-free KD for different values of in (2). The student over-fits to the generator output when the weighting factor  is too large ().
Epochs Automobile Bird Horse Automobile Bird Horse
10
50
100
200
(a) (b)
Figure 7: Example synthetic images generated for CIFAR-10 automobile, bird, and horse classes in different training epochs. We compare two cases with and to show the impact of the weighting factor  in (2) on the generator output.

Ablation study. For ablation study, we evaluate the proposed data-free KD scheme with and without each term of the auxiliary loss  for the generator in (4), and the results are summarized in Figure 5. The bar graph shows that the major contribution comes from (a), which is to match batch normalization statistics (see Section 3.3). In Figure 6, we present the impact of the weighting factor  in (2) on KD performance. Moreover, to visually show the impact of on the generation of synthetic data, we collect synthetic images for and and show them at different epochs in Figure 7. The figures show that smaller yields more diverse adversarial images, since the generator is constrained less. As gets larger, the generated images collapse to one mode for each class, which leads to over-fitting.

[0pt][l]# students ()# generators () 1 2
1 86.14 86.67
2 86.44 87.04
Table 3: Comparison of the student accuracy (%) when using multiple generators and/or multiple students in our data-free KD from WRN40-2 to WRN16-1 on CIFAR-10.

Multiple generators and multiple students. We show the gain of using multiple generators and/or multiple students in Table 3. We compare the cases of using two generators and/or two students. For the second generator, we replace one middle convolutional layer with a residual block. For KD to two students, we use identical students with different initialization. Table 3 shows that increasing the number of generators and/or the number of students results in better student accuracy in data-free KD.

4.2 Data-free network quantization

Original dataset Pre-trained model (accuracy %) Quantization bit-width for weights / activations Quantized model accuracy (%)
Ours (data-free) Data-dependent [29]*
DF-Q DF-QAT-KD Q QAT QAT-KD
SVHN WRN16-1 (97.67) 8 / 8 97.67 97.74 97.70 97.71 97.78
4 / 8 91.92 97.53 93.83 97.66 97.70
CIFAR-10 WRN16-1 (90.97) 8 / 8 90.51 90.90 90.95 91.21 91.16
4 / 8 86.29 88.91 86.74 90.92 90.71
WRN40-2 (94.77) 8 / 8 94.47 94.76 94.75 94.91 95.02
4 / 8 93.14 94.22 93.56 94.73 94.42
CIFAR-100 ResNet-18 (77.32) 8 / 8 76.68 77.30 77.43 77.84 77.73
4 / 8 71.02 75.15 69.63 75.52 75.62
Tiny-ImageNet MobileNet v1 (64.34) 8 / 8 51.76 63.11 54.48 61.94 64.53
* Used the original datasets.
Table 4: Results of network quantization with the proposed data-free adversarial KD scheme. For our data-free quantization, we show the results for data-free quantization only (DF-Q) and data-free quantization-aware training with data-free KD (DF-QAT-KD). For conventional data-dependent quantization [29], we show the results for quantization only (Q), quantization-aware training (QAT), and quantization-aware training with KD (QAT-KD).
Dataset used in KD Quantized model accuracy (%) before / after fine-tuning with KD
WRN16-1 (SVHN) WRN40-2 (CIFAR-10) ResNet-18 (CIFAR-100)
SVHN 93.83 / 97.70 71.89 / 92.08 13.41 / 65.07
CIFAR-10 93.50 / 97.24 93.56 / 94.42 67.50 / 75.62
CIFAR-100 94.11 / 97.26 92.18 / 94.10 69.63 / 75.62
Ours (data-free) 91.92 / 97.53 93.14 / 94.22 71.02 / 75.15
Table 5: Impact of using different datasets for 4-bit weight and 8-bit activation quantization.

In this subsection, we present the experimental results of the proposed data-free adversarial KD scheme on network quantization. For the baseline quantization scheme, we use TensorFlow’s quantization framework. In particular, we implement our data-free KD scheme in the quantization-aware training framework [29, 32] of TensorFlow222https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/contrib/quantize.

TensorFlow’s quantization-aware training performs per-layer asymmetric quantization of weights and activations. For quantization only, no data are needed for weight quantization, but quantization of activations requires representative data, which are used to collect the range (the minimum and the maximum) of activations and to determine the quantization bin size based on the range. In our data-free quantization, we use synthetic data from a generator as the representative data. To this end, we train a generator with no adversarial loss as in the warm-up stage of Algorithm 1 (see DF-Q in Table 4). For our data-free quantization-aware training, we utilize the proposed adversarial KD on top of Tensorflow’s quantization-aware framework, where a quantized network is set as the student and a pre-trained floating-point model is given as the teacher, which is denoted by DF-QAT-KD in Table 4.

We follow the training hyperparameters as described in Section 

4.1, while we set the initial learning rate for KD to be . We use epochs for the warm-up stage and epochs for quantization-aware training with data-free KD. We adopt the vanilla KD with no intermediate layer output matching terms. We summarize the results in Table 4.

For comparison, we evaluate three conventional data-dependent quantization schemes using the original training datasets, i.e., quantization only (Q), quantization-aware training (QAT), and quantization-aware training with KD (QAT-KD). As presented in Table 4, our data-free quantization shows very marginal accuracy losses less than 2% for 4-bit/8-bit weight and 8-bit activation quantization in all the evaluated cases, compared to using the original datasets.

Finally, we compare our data-free quantization to using alternative datasets. We consider two cases (1) when a similar dataset is used (e.g., CIFAR-100 instead of CIFAR-10) and (2) when a mismatched dataset is used (e.g., SVHN instead of CIFAR-10). The results in Table 5 show that using a mismatched dataset degrades the performance considerably. Using a similar dataset achieves comparable performance to our data-free scheme, which shows small accuracy losses less than 0.5% compared to using the original datasets. We note that even alternative data, which are safe from privacy and regulatory concerns, are hard to collect in usual cases.

5 Conclusion

In this paper, we proposed data-free adversarial KD for network quantization and compression. No original data are used in the proposed framework, while we train a generator to produce synthetic data adversarial to KD. In particular, we propose matching batch normalization statistics in the teacher to additionally constrain the generator to produce samples similar to the original training data. We used the proposed data-free KD scheme for compression of various models trained on SVHN, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. In our experiments, we achieved the state-of-the-art data-free KD performance over the existing data-free KD schemes. For network quantization, we obtained quantized models that achieve comparable accuracy to the models quantized and fine-tuned with the original training datasets. The proposed framework shows great potential to keep data privacy in model compression.

References

  • [1] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and Z. Dai (2019) Variational information distillation for knowledge transfer. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9163–9171. Cited by: §1, §3.5, §4.1, Table 2, §4.
  • [2] N. Akhtar and A. Mian (2018) Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, pp. 14410–14430. Cited by: §2.
  • [3] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017) Generalization and equilibrium in generative adversarial nets (GANs). In

    International Conference on Machine Learning

    ,
    pp. 224–232. Cited by: §2.
  • [4] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski (2009) Robust optimization. Vol. 28, Princeton University Press. Cited by: §2, §3.2.
  • [5] D. Bertsimas, D. B. Brown, and C. Caramanis (2011) Theory and applications of robust optimization. SIAM review 53 (3), pp. 464–501. Cited by: §2.
  • [6] K. Bhardwaj, N. Suda, and R. Marculescu (2019) Dream distillation: a data-independent model compression framework. In ICML Joint Workshop on On-Device Machine Learning and Compact Deep Neural Network Representations (ODML-CDNNR), Cited by: §1, Table 1, §2.
  • [7] Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017) Deep learning with low precision by half-wave Gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §1.
  • [8] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57. Cited by: §2.
  • [9] H. Chen, Y. Wang, C. Xu, Z. Yang, C. Liu, B. Shi, C. Xu, C. Xu, and Q. Tian (2019) Data-free learning of student networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3514–3522. Cited by: §1, §1, Table 1, §2, §4.1, Table 2, Remark 2.
  • [10] Y. Chen, N. Wang, and Z. Zhang (2018) DarkRank: accelerating deep metric learning via cross sample similarities transfer. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • [11] Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2018) Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Processing Magazine 35 (1), pp. 126–136. Cited by: §1.
  • [12] Y. Choi, M. El-Khamy, and J. Lee (2017) Towards the limit of network quantization. In International Conference on Learning Representations, Cited by: §1.
  • [13] Y. Choi, M. El-Khamy, and J. Lee (2020) Universal deep neural network compression. IEEE Journal of Selected Topics in Signal Processing. Cited by: §1.
  • [14] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.1, §3.3.
  • [15] B. Dai, C. Zhu, B. Guo, and D. Wipf (2018) Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, pp. 1135–1144. Cited by: §1.
  • [16] I. Durugkar, I. Gemp, and S. Mahadevan (2017) Generative multi-adversarial networks. In International Conference on Learning Representations, Cited by: §2, §3.4.
  • [17] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. In International Conference on Learning Representations, Cited by: §1.
  • [18] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §1.
  • [19] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, Cited by: §2.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §2.
  • [21] I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §2.
  • [22] Y. Guo, A. Yao, and Y. Chen (2016) Dynamic network surgery for efficient DNNs. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: §1.
  • [23] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations, Cited by: §1.
  • [24] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §1.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1.
  • [26] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §3.1.
  • [27] Q. Hoang, T. D. Nguyen, T. Le, and D. Phung (2018) MGAN: training generative adversarial nets with multiple generators. In International Conference on Learning Representations, Cited by: §2, §3.4.
  • [28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [29] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1, §1, §4.2, Table 4.
  • [30] Y. Jang, T. Zhao, S. Hong, and H. Lee (2019) Adversarial defense via learning to generate diverse attacks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2740–2749. Cited by: §2.
  • [31] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.
  • [32] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §1, §4.2.
  • [33] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report, Univ. of Toronto. Cited by: §1.
  • [34] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • [35] R. G. Lopes, S. Fenu, and T. Starner (2017) Data-free knowledge distillation for deep neural networks. In NeurIPS Workshop on Learning with Limited Data, Cited by: §1, Table 1, §2.
  • [36] I. Loshchilov and F. Hutter (2017) SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations, Cited by: §4.
  • [37] C. Louizos, K. Ullrich, and M. Welling (2017) Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pp. 3290–3300. Cited by: §1.
  • [38] C. Louizos, M. Welling, and D. P. Kingma (2018) Learning sparse neural networks through regularization. In International Conference on Learning Representations, Cited by: §1.
  • [39] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §2.
  • [40] P. Micaelli and A. J. Storkey (2019) Zero-shot knowledge transfer via adversarial belief matching. In Advances in Neural Information Processing Systems, pp. 9547–9557. Cited by: §1, §1, §1, Table 1, §2, §4.1, Table 2, Remark 2.
  • [41] D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pp. 2498–2507. Cited by: §1.
  • [42] A. Mordvintsev, C. Olah, and M. Tyka (2015)(Website) Note: https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html [Online; accessed 18-April-2020] Cited by: §2.
  • [43] M. Nagel, M. v. Baalen, T. Blankevoort, and M. Welling (2019) Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1325–1334. Cited by: Table 1, §2, Remark 2.
  • [44] G. K. Nayak, K. R. Mopuri, V. Shaj, V. B. Radhakrishnan, and A. Chakraborty (2019) Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pp. 4743–4751. Cited by: Table 1, §2.
  • [45] Y. Nesterov (1983) A method for unconstrained convex minimization problem with the rate of convergence . In Doklady AN USSR, Vol. 269, pp. 543–547. Cited by: §4.
  • [46] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §1.
  • [47] T. Nguyen, T. Le, H. Vu, and D. Phung (2017) Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2670–2680. Cited by: §2.
  • [48] E. Park, J. Ahn, and S. Yoo (2017) Weighted-entropy-based quantization for deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7197–7205. Cited by: §1.
  • [49] W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3967–3976. Cited by: §1.
  • [50] O. Poursaeed, I. Katsman, B. Gao, and S. Belongie (2018) Generative adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4422–4431. Cited by: §2.
  • [51] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) XNOR-Net: imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, pp. 525–542. Cited by: §1.
  • [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) FitNets: hints for thin deep nets. In International Conference on Learning Representations, Cited by: §1, §3.5.
  • [53] V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §1.
  • [54] F. Tung and G. Mori (2018) Deep neural network compression by in-parallel pruning-quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [55] K. Ullrich, E. Meeds, and M. Welling (2017) Soft weight-sharing for neural network compression. In International Conference on Learning Representations, Cited by: §1.
  • [56] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2018) Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454. Cited by: §2.
  • [57] H. Wang and C. Yu (2019) A direct approach to robust deep learning using adversarial networks. In International Conference on Learning Representations, Cited by: §2.
  • [58] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1.
  • [59] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082. Cited by: §1.
  • [60] H. Yin, P. Molchanov, Z. Li, J. M. Alvarez, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz (2019) Dreaming to distill: data-free knowledge transfer via DeepInversion. arXiv preprint arXiv:1912.08795. Cited by: §1, Table 1, §2, §2, §4.1, Table 2, Remark 2.
  • [61] J. Yoo, M. Cho, T. Kim, and U. Kang (2019) Knowledge extraction with no observable data. In Advances in Neural Information Processing Systems, pp. 2701–2710. Cited by: §2.
  • [62] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference, pp. 87.1–87.12. Cited by: §1.
  • [63] S. Zagoruyko and N. Komodakis (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, Cited by: §1, §3.5.
  • [64] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) LQ-Nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision, pp. 365–382. Cited by: §1.
  • [65] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) DoReFa-Net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §1.
  • [66] C. Zhu, S. Han, H. Mao, and W. J. Dally (2017) Trained ternary quantization. In International Conference on Learning Representations, Cited by: §1.