Bringing Giant Neural Networks Down to Earth with Unlabeled Data

07/13/2019 ∙ by Yehui Tang, et al. ∙ The University of Sydney Peking University 5

Compressing giant neural networks has gained much attention for their extensive applications on edge devices such as cellphones. During the compressing process, one of the most important procedures is to retrain the pre-trained models using the original training dataset. However, due to the consideration of security, privacy or commercial profits, in practice, only a fraction of sample training data are made available, which makes the retraining infeasible. To solve this issue, this paper proposes to resort to unlabeled data in hand that can be cheaper to acquire. Specifically, we exploit the unlabeled data to mimic the classification characteristics of giant networks, so that the original capacity can be preserved nicely. Nevertheless, there exists a dataset bias between the labeled and unlabeled data, disturbing the mimicking to some extent. We thus fix this bias by an adversarial loss to make an alignment on the distributions of their low-level feature representations. We further provide theoretical discussions about how the unlabeled data help compressed networks to generalize better. Experimental results demonstrate that the unlabeled data can significantly improve the performance of the compressed networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning has demonstrated state-of-the-art performance in many tasks, such as object classification [1, 2], speech recognition [3, 4], image generation [5, 6] and so on. The major component underlying these successes is the development of sophisticated deep neural networks (DNNs), e.g., AlexNet [7], VGGNet [8], Inception [9] , and ResNet [10]. However, large volume of parameters, huge run-time memory cost and heavy dependence on GPU devices hamper the deployment of these giant DNNs in real-world applications. For example, ResNet-50 [10] needs 95MB memory to store parameters, 97MB memory to store feature maps and times of floating number multiplications to interface with a single image [11]. It has been well known that there is significant redundancy in a large over-parameterized network, and fewer parameters can express the same amount of information as well [12, 13]. It therefore motivates the research on neural network compression.

Basically, network compression methods can be categorized into several aspects, including parameter quantization [14, 15], low-rank approximation [16, 17], knowledge distillation [18, 19, 20] and pruning (non-structured pruning [21, 22, 23, 24] & channel pruning [25]). Quantization methods represent weights or activations with low bit integers [14]; even binary weights are used [26, 27]

while low-rank approximation takes advantage of tensor factorization techniques and decomposes one giant filter into multiple smaller components. Knowledge distillation focuses on the training of a compact light network and advocates the soft supervision from pre-trained powerful giant networks

[18]. As for pruning methods, non-structured pruning removes unimportant weights, which [22] can get extremely high compression rate without accuracy loss; however, special hardware is needed to accelerate the computation in practice. In contrast, channel-wise pruning [28, 25] chooses to remove the whole spatial filters over channels and results in simultaneous reduction on the parameters, memory footprint and computation cost. Note that channel pruning does not destroy the structure of the giant network, so it is compatible with other compression methods and has attracted much attention recently [29].

Existing neural network compression methods have received impressive performance in experiments, but they usually need many iterations of retraining to preserve the original accuracies, especially when the compression ratio is fairly high. An immediate question therefore arises: How to retrain the compressed network if the original training dataset is incomplete? Demo models (e.g., well-trained DNNs) are usually released on the Internet, and ready to import for users. However, due to the consideration of security, privacy or commercial profits, the model providers only supply sample data (sometimes with unknown sources) for verification purpose instead of the complete training set. This is a very practical scenario especially in medical diagnosis [30], drug discovery and toxicology [31], since usually their used datasets are not completely open-source as discussed in [32]. Beyond the medical field, compressing with limited data are also useful in many other scenes. For example, the website of EmotionNet 111https://github.com/co60ca/EmotionNet for emotion recognition only provides example images for display, and the original datasets can only be obtained by the approval from administrators with a rigorous agreement. Likewise CNNs can be trained to predict hashtags on Instagram images [33], but only some example images are provided and the dataset is not released. Sample data with ground truth may support the retraining to tailor giant deep models in a supervised manner, but such insufficient training data tend to result in severe over-fitting.

In this paper, we propose to bring giant demo neural networks down to earth with unlabeled data. Instead of struggling to search for original training data for these giant models, we turn our attention to unlabeled data in hand that can be cheaper to acquire. The output of the giant network reflects its classification characteristics and contains the necessary information for its capacity. Thus we regard unlabeled data as a portal to distill its intrinsic information, and concretely, we exploit unlabeled data to mimic the softened output of the giant network, so that its powerful classification ability can be well preserved. However, since unlabeled and labeled data are usually collected in different ways, there is a dataset bias hampering the mimicking process. We fix this issue by make alignment on the distributions of low-level features between unlabeled and labeled data. Furthermore, we provide theoretical discussions about how the unlabeled data help compressed networks to generalize better. Experimental results on benchmark datasets demonstrate the effectiveness of exploiting unlabeled data to assist the network compression.

The rest of this paper is organized as follows. Section II reviews the related work for compressing and accelerating networks. Section III makes a brief introduction of channel pruning with scaling factors as preliminaries. Then we formally elaborate in Section IV how to prune networks with unlabeled data and analyze the proposed method from a theoretical perspective. The experimental results and analysis are presented in Section V, with concluding remarks given in Section VI.

Fig. 1: Pruning pipeline of exploiting unlabeled data. The green solid arrows indicate the forward calculation for labeled data while the red ones are for unlabeled data. Note that both labeled and unlabeled data share the same network. An adversarial loss is used to align low-level feature maps of labeled and unlabeled data for fixing the data bias. A mimicking loss is imposed on the output to take full advantage of training data, especially unlabeled data. The scaling factors in BN layers are constrained to be sparse for implementing channel pruning.

Ii Related Work

For the compression and acceleration of CNNs, the mainstream works are mainly divided into four categories: quantization, sparse or low-rank approximation, knowledge distillation and pruning (non-structured pruning & channel pruning).

Quantization. It aims to reduce the number of bits for representing each weight or activation in the CNNs. For example, Vanhoucke et.al.[34] finds that the 8-bit quantization of weights can induce significant speed-up with almost no drop of accuracy. Binary weights are even investigated to obtain extremely compressed networks, which constrains weights to only two values (i.e., 1 or -1) and most time-consuming multiply-accumulate operations are replaced by simple accumulations [26, 27]

. However, binarizing very large networks (

e.g., GoogleNet) will incur large accuracy loss. To improve the performance of quantized networks, Li et.al.[35] proposes Ternary Weight Network (TWN) constraining weights to ternary values (i.e., -1,0,1). Zhu et.al.[36] further develops it by learning both ternary values and assignment during training. The proposed Trained Ternary Quantization (TTQ) can be trained from scratch as easy as a normal full-precision model.

Low-rank approximation. Since convolutional filters can be seen as 4D tensors, based on low-rank assumption, they can be decomposed to multiple components with fewer parameters. Both storage and computation cost can be reduced in the way. For example, SVD method has been studied widely [16, 17] to decompose a tensor into two-layer compact convolutional filters. Those components with large sizes may be still time-consuming, which can be further decomposed[17]. Thus an original redundant filter is replaced by multiple compact filters.

Knowledge distillation. By distilling the knowledge from the pre-trained giant networks, the performance of the target light and small networks can be boosted. Hinton et.al.[18] proposes to mimic the informative softened outputs of the teacher network. In addition to the output level, intermediate representations of the giant network can be transferred as hints to assist training. In [19], the attention maps via activations or gradients are also used for mimicking. Besides, You et.al.[37] proposes to combine multiple teacher networks. The relative dissimilarity between different examples serves as guidance and a voting strategy is used to unify dissimilarity information provided by various teacher networks. Other works also attempt to transfer more diverse representations (e.g., flow of solution procedure (FSP) matrix [20]) or adopt a more sophisticated transfer manner (e.g., transfering information via adversarial learning [38]).

Non-structured pruning. To prune the redundant parameters, an intuitive method is to remove each weight with small magnitude and get a more sparse network. Han et.al.[39] proposed to apply or regularization to make weights sparse and prune tiny weights in an iterative way. The pruned network can be further compressed with quantization and Huffman encoding, resulting in compression rate on AlexNet without sacrifice for accuracy [40]. To avoid accuracy drop incurred by incorrect pruning weights, splicing operation [41]

was introduced to recover the mistakenly removed connections. Pruning and splicing operations constitute the dynamic network surgery framework and obtain more sparse networks with fewer training epochs. However, although high compression and acceleration rates are obtained theoretically, the hardware is needed to designed specially for realizing practical speed-up. Compared to the fine-grained pruning methods, group-wise pruning methods

[42] are more common in practice. Nevertheless, structures of the original networks are destroyed as a result and real inference speed-up also depends on dedicated libraries badly.

Channel pruning. Channel pruning methods aim to directly remove the redundant channels without destroying the structure of original networks. After pruning a whole filter of a layer, the channels of the corresponding feature maps are also pruned. Parameters, computation cost and memory footprint are reduced simultaneously. There are mainly two strategies for channel pruning. The first one selects important channels based on training the whole networks with sparsity regularization [43, 25, 24]. Slimming method [25]

uses the scaling factors of batch normalization layers

[44] to measure the importance for each channel. During training, the sparse constraint is imposed on the scaling factors and then channels with tiny scaling factors are pruned. The pruned networks are then fine-tuned with normal cross-entropy loss to recover the performance. The second reconstruction-based methods [45, 46, 47] seek to identify the important channels layer-by-layer by minimizing the gap of feature maps between the pruned network and the original pre-trained network. Note that the channel pruning methods are complementary with other compression methods to further improve the compression performance.

Iii Channel-wise Pruning with Scaling Factors

A number of well-trained DNNs can be easily obtained from the Internet and tailored for various tasks. Most of the time these downloaded networks are too cumbersome to be applied directly in practical tasks especially for those on edge mobile devices. So some questions arise immediately: How many parameters would be sufficient for DNNs to reach decent performance? How much computational budget can be offered by our computing platforms? Answers to these questions are not unique, depending on different real-world applications. It is therefore impossible to request model providers to release well-trained models of various sizes from a few hundred KB to several hundred MB to meet all users’ demand. A practical solution is to compress the released giant models to an appropriate size that can meet different requirements. In the sequel, we will revisit how the giant neural networks can be compressed to a specific size using channel pruning techniques. Moreover, we also argue that why channel pruning would be degraded when the labeled data is quite limited.

Suppose a dataset 222In our problem, the dataset capacity for labeled data is usually small. containing examples and corresponding labels are released with a well-trained DNN, where and are raw feature space and label space, respectively. Denote the released well-trained neural network of layers as a function , where denotes the hypothesis space of DNNs. Let be the example ’s feature map input for the -th layer of the network . Given as the weights of the -th layer, the feature map can be transformed as

(1)

where is the typical convolutional operation in CNNs, and is the convolutional results by the filters .

In practice, most modern DNNs adopt Batch Normalization (BN) [44] layers in their network architectures. The aim of BN layers is to diminish the covariate shift from internal activations of the network to accelerate and stabilize the training. Specifically, assume the output has channels, then BN works channel-wisely. For each channel, BN first normalizes each channel, then rescales and biases it with trainable scaling factor and bias factor , i.e.,

(2)

where is the -th channel of , and and

are respectively the mean and standard deviation for each channel in a mini-batch.

is a small quantity for the numerical stability, e.g.. As indicated in Eq. (2), the scaling factors rescale feature maps and control the information flow channel-wisely. For each channel, can be employed to control the importance of the corresponding channels. Making sparse will thus reduce the channels and make the feature maps and filters more compact. Therefore, retraining the well-trained giant networks with channel-level sparsity regularization can help to eliminate unimportant channels automatically [25].

The channel-wise pruning usually contains three steps for compressing the given trained neural network . First, retrain (i.e., training initialized by ) with regularization to obtain a sparse network over scaling factors . Second, prune the network by the values of scaling factors and finally, fine-tune the pruned network. We elaborate on these three steps as follows.

Sparse retraining. Denote the sparse network we want to achieve as , then for an example

, its logits are

, which is the output of the network before softmax function. The corresponding prediction score vector over labels is

, where is the norm. Therefore, the objective function on these labeled data is

(3)

where is the cross entropy to guarantee the network performance. is the norm to encourage that only parts of channels are selected to establish the network, where is now a vector composed of all scaling factors over the whole network. balances the classification accuracy and the sparsity of .

Pruning the sparse network. When the sparse retraining with Eq. (3) is fulfilled, the sparsity of scaling factors can be achieved. A global threshold across layers is set according to the percentage of channels users plan to keep. For example, if users want to prune 60% channels of the network, the smaller 60% elements in are removed with their corresponding channels. The structure of the network will be automatically decided according to the threshold, and a pruned network is obtained as a result.

Fine-tuning the pruned network. Nevertheless, after pruning, the network usually has limited classification accuracy. Fine-tuning the pruned network is therefore essential to restore the accuracy. Typically, when we fine-tune the pruned network, the objective is only the supervision loss (i.e., the first term in Eq. (3)).

However, only with the extremely limited data released, Eq. (3) (with or without sparsity term ) cannot be well optimized. Serious overfitting will occur and the accuracy on the test set will drop rapidly. In addition, the smaller capacity

would make the estimation of the mean and standard deviation in BN layers less confident and inaccurate, which further crumbles the performance of the compressed network.

Iv Pruning networks with unlabeled data

Few labeled data limit the performance of channel-pruned networks. Instead of accepting the poor performance of the compressed network with only limited sample data or struggling to search for original training data, we turn attention to cheaper unlabeled data in hand. In this section, we will present a solution to bring giant networks down to earth with channels pruned, in which we will investigate the potential benefits from unlabeled data.

Iv-a Exploiting Unlabeled Data by Mimicking the Giant Network

Unlabeled data are much easier to collect. For example, one can easily find a large number of natural images from the Internet to assist compression of giant networks trained on large natural images set such as ImageNet. The unlabeled data collected by users may be different from the original data used for the well-trained giant network, but still can provide helpful information for the compression task.

Suppose the collected unlabeled dataset contains examples 333Here we assume the labeled data and unlabeled data have the same dimension , which can be easily implemented by image resizing or cropping.. In the sequel, we distinguish the labeled and unlabeled examples with notations and , respectively. 444Thus, the first term in Eq.(3) is written as as well. Similarly, given , the softened output of the network and the released well-trained network after softmax function are represented as and , respectively,

(4)

where is a temperature parameter [18] to control the smoothness, so that a higher value of

produces a softer probability distribution over classes.

The softened output of the giant network reveals clues about its classification characteristics as well as the similarity among classes. Thus we can use unlabeled data to mimic its classification performance and distill its knowledge into the target sparse network. In this way, we encourage the sparse network to have a similar softened output with that of the giant network , and the objective of sparse retraining on unlabeled data can be written like Eq. (3) as

(5)

which is called mimicking loss. As a result, the unlabeled data can provide rich information to guide the sparse training via mimicking the pre-trained model on classification performance.

Confidence on the unlabeled data. Examples in the unlabeled dataset could bring different levels of difficulties to the giant network. If the giant network cannot well understand an example, it should impose less importance on this example during sparse training. The confidence of the giant network on the unlabeled examples can be reflected by its outputs. If the network shows a higher confidence on example , one element in its output vector will be far larger than other elements. Softmax function can normalize the output vector into a probability vector. However, the original softmax function (i.e., with ) is too sharp so that the maximum in would be very close to in most cases. Temperature is used to soften the probability vector so that the maximum in would be more sensitive when the confidence changes. The weight for the example is defined:

(6)

where is the maximum value of a vector. Then the mimicking loss evolves into

(7)

and considering both labeled and unlabeled data, the objective function for training is

(8)

with a constant weight . In this way, the unlabeled data can help the labeled data by further supplying more information about the giant network.

Remark. Note that the labeled data can also be used to mimic the output of the pre-trained giant network 555Then the first term in Eq.7 can be replaced by , where .. Because the labeled data are much fewer than unlabeled data, the effect of mimicking loss on labeled data are also very limited, which will be further verified in experiments.

Iv-B Fixing the Dataset Bias between Unlabeled and Labeled Data

。 In Eq. (7) above, the unlabeled data is usually utilized to make a consistency of the probabilistic output between the sparse network and the original giant network. However, in practice, the collected unlabeled data by users are usually different from the original labeled data. And there exists a dataset bias (or domain shift) [48] between the unlabeled data and labeled data. As a result, the consistency in unlabeled data may not hold on labeled data. Since both the pruned network and the giant network are designed for labeled data, the classification performance of the pruned network would be degraded due to this dataset bias.

Inspired by domain adaptation [49, 50], we fix this issue by encouraging the sparse network to learn domain-invariant representations (i.e., of same distributions) between the unlabeled and labeled data. In this way, the distributions of the unlabeled and labeled data can be aligned on the low-level features, and the probabilistic outputs and for unlabeled and labeled data will have similar distributions as well. Note that typical domain adaptation tasks usually adapt a network pre-trained with labeled data (source domain) to the unlabeled data (target domain). In contrast, we aim to enable the unlabeled data to guide the feature learning of the labeled data in the pruned network, thus the knowledge of the unlabeled data can be well transferred into the compression process.

To learn domain-invariant representations, the distributions of low-level features between the unlabeled and labeled data should be similar to each other. Following the wisdom of adversarial domain adaptation methods [51, 52], we minimize the discrepancy between the unlabeled and labeled feature distributions by introducing a discriminator. The discriminator aims to distinguish whether the learned features are from labeled data or unlabeled data. Typically the discriminator is co-trained with a generator in an adversarial learning manner [53], which has been successfully used in many tasks, such as image style transfer [54]

, image super-resolution

[55] and domain adaptation [52, 56].

The adversarial training [53] can be regarded as a two-player minimax game. We divide the sparse network into two parts, the first part extracting low-level features and the second part outputting the classification results, i.e.. Given examples , the discriminator is to make binary predictions about whether their low-level feature representations are from unlabeled dataset or not. In this case, the low-level feature extractor plays the role as the generator, which tries to confuse the two feature maps. The game can be modelled with a value function :

(9)

Eq. (9) is usually solved by alternatively optimizing the and

, whose loss functions are

(10)
(11)

where is the empirical loss of the value function , i.e.,

(12)

The optimization of and is alternative and the distance of low-level features from labeled data and unlabeled data will be minimized at last. In practice, when training the sparse network , we can augment the original objective Eq. (8) with the adversarial loss in an end-to-end fashion, i.e.,

(13)

where is the weight coefficient. As a consequence, the sparse network can well receive the help from unlabeled data for mimicking the giant network, but with subtle influence by the dataset bias.

Iv-C Theoretical Discussions

Now we attempt to investigate how the unlabeled data help the pruned network to generalize better than that with only a few labeled data. Our method trains the pruned network in a jointly end-to-end way. For simplicity of the theoretical discussions, we decompose the training into two steps sequentially. In the first step, we adversarially train the low-level feature extractor and the discriminator ; then in the second step, we fix the learned low-level features, and train the remaining part of network to mimic the output of the original giant network .

First, using the unlabeled and labeled data at low-level layers, we train the network via Eq. (9). Then in theory, we can make their distributions identical on feature representations, via the following Theorem 1 [53].

Theorem 1 (Feature alignment)

With being fixed, the optimal discriminator is . Then the global optimality is achieved if and only if .

As a result, we can align the distribution of unlabeled data’s low-level features with that of labeled data. Since , the input of the remaining part would have no dataset bias in theory. Then can be trained with the loss of Eq. (8), which can be cast into the framework of empirical risk minimization (ERM) with regularization. To facilitate the analysis on the unlabeled data’s effect, we leave out the regularization term. Then we investigate the generalization ability of the learned by checking its generalization error bound, which is related to its population risk and empirical risk defined as

(14)
(15)

where is the ground-truth distribution of 666Here we do not distinguish the hard label vector and softened output vector in Eq. (8), and regard both as the target space for simplicity. . Usually, there exists a gap between the population risk and empirical risk . A desired model should have small gap. Via MaDiarmid’s inequality, the gap can be bound by Theorem 2 [57].

Theorem 2 (Generalization error bound)

Given a fixed , for any , with probability at least , for all

(16)

where is the number of classes, and is the Rademacher complexity.

In Theorem 2, the third term shows that a large dataset capacity induces a tight bound. In this way, with the unlabeled data involved, we boost the generalization ability of by increasing the training examples, which have identical distributions after the feature alignment. The second term refers to the Rademacher complexity as follows.

Definition 1

Given rademacher variables

(independent uniform random variables in

{-1,+1}) and , the Rademacher complexity of hypothesis is defined as

(17)

where is the -th element in .

is directly related with the complexity of the hypothesis and the generalization gap as well. Thus to make the gap tighter, we can train the network by minimizing . However, its computation is very hard. In practice, we usually use its upper bound or estimation [58] for each minibatch, e.g.,

(18)

where is the number of both labeled and unlabeled samples in a minibatch. Then term can thus serve as a regularization term (called Rademacher loss) to control the generalization ability during the training of the network; the loss function is

(19)

where is a constant parameter, and is calculated per minibatch. Although unlabeled data are much cheaper than labeled data, they are not free and large disks are needed to store the collected unlabeled data. Serious over-fitting usually occurs when tailoring or retraining the giant network with insufficient data, so in this case promoting the generalization with will work. Our proposed method is summarized in Algorithm 1. Similarly, it also contains three steps, and unlabeled data play an important part in the sparse retraining and fine-tuning to assist the labeled data.

0:  Pretrained network , released labeled dataset and collected unlabeled dataset
1:  Initialize the sparse network with .
2:  repeat
3:     Randomly select and as a minibatch.
4:     Forward the pretrained network:
5:     Forward the network:
6:     Calculate the loss of discriminator with Eq. (10) and update the parameters of
7:     Calculate the loss of network with Eq. (19)
8:     Update the parameters of network
9:  until convergence
10:  Prune channels with small scaling factors in network .
11:  Fine-tune the pruned network.
11:  A pruned network ready to deploy.
Algorithm 1 Pruning with Unlabeled Data

V Experiments

In this section, we compress several prevalent neural networks for different applications to validate the effectiveness of the proposed method. Concretely, we conduct experiments on the benchmark CIFAR-10 dataset [59] and the large-scale ILSVRC2012 dataset [60], together with the prevalent VGGNet and ResNet. As for the assistant unlabeled data, we adopt the STL-10 dataset [61] and COCO dataset [62], respectively.

Comparison methods. We adopt a Vanilla Pruning method as a baseline, which just removes the small scaling factors of the giant networks and then fine-tunes the pruned networks with the labeled data. Furthermore, we also compare our method with the state-of-the-art Slimming method [25]. Note that the labeled data can be independently involved in Rademacher loss or mimicking loss, so we have two variants of the Slimming method. “Slimming+” denotes attaching Rademacher loss to labeled data while “Slimming++” is for both Rademacher loss and mimicking loss. In addition, to show the low bound of performance when labeled data are quite limited, we also train the pruned networks from scratch by randomly initializing their parameters, which is denoted as Scratch method in our experiment.

VGGNet Scratch Vanilla Pruning Slimming Slimming+ Slimming++ Ours
[25] [25] [25]
100 41.47 59.28 62.49 63.57 65.83 75.04
500 56.97 78.31 84.52 85.26 85.84 88.51
1K 69.86 82.56 87.23 87.86 87.95 91.04
TABLE I: Classification accuracy of the pruned VGGNet on CIFAR-10 dataset with the unlabeled STL-10 dataset. All methods achieve approximately compression rate and acceleration rate.
ResNet Scratch Vanilla Pruning Slimming Slimming+ Slimming++ Ours
[25] [25] [25]
200 42.32 51.49 55.48 55.71 56.24 62.03
500 52.78 65.78 67.02 67.33 67.89 73.29
1K 65.24 77.19 79.22 79.43 79.91 82.42
TABLE II: Classification accuracy of the pruned ResNet-56 on CIFAR-10 dataset with the unlabeled STL-10 dataset. All methods achieve approximately compression rate and acceleration rate.

V-a Experiments on CIFAR-10 Dataset

Dataset. The CIFAR-10 dataset [59] is composed of 60,000 color images from ten categories, 50,000 for training and 10,000 for testing. In our setting, only a small fraction of images are randomly selected as labeled data. The standard data argumentation [10]

is adopted, including padding (with size 4), random cropping and horizontal flipping. As for the unlabeled data, we choose STL-10 dataset

[61], which is also an image recognition dataset containing a large number of (labeled and unlabeled) RGB images. Actually, STL-10 dataset has similar categories with CIFAR-10 dataset, however, their collection approaches are different. Some example images of the two datasets are shown in Figure 2. In our experiment, we randomly sample 5000 images from the unlabeled part of the STL-10 dataset to assist the compression. All unlabeled images are then rescaled into the same size .

(a) CIFAR-10 dataset.
(b) STL-10 dataset.
Fig. 2: Sample images in the labeled CIFAR-10 dataset [59] and the unlabeled STL-10 dataset [61].

Networks. We experiment with VGGNet [8] and ResNet-56 [10], which are deep and powerful baseline networks broadly used in many tasks, such as image recognition, objection detection and video action analysis. The original VGGNet is designed for ImageNet dataset, thus we tailor its structure slightly to fit CIFAR-10 dataset following [25]

. The features extracted by the convolution layers are pooled by a

pooling layer and then directly sent to a fully-connected layer to obtain predictions. The 56-layer ResNet is stacked by bottleneck blocks with pre-activation structure[63]. We train the VGGNet and ResNet from scratch in CIFAR-10 dataset as the giant pre-trained networks. For the adversarial loss in Section IV-B, we adopt the second pooling layer in VGGNet and the first block in ResNet as low-level feature layers, and the discriminator

is a simple 3-layer CNN. Feature maps are first delivered into two convolution layers followed by ReLU nonlinear operation, then forwarded to a

average-pooling layer and fully-connected layer to predict whether the image comes from labeled dataset or unlabeled dataset. The number of output channels of the first convolution layer is equal to the number of its input channels while the second convolution layer has double channels.

Training. For sparse retraining with Eq. (19), roughly equal iterations (15K20K) are used. We experimentally find that this training iteration number suffices for both comparison methods (using only labeled data) and our method (using both labeled and unlabeled data). As for fine-tuning the pruned network, we use half of the iterations, i.e., 7.5K10K. For VGGNet, the initial learning rate is set to 0.003 in sparse retraining and 0.001 in fine-tuning, and for ResNet, it is set to 0.02 and 0.005, respectively. Learning rate drops by 0.1 at and of the maximum iterations for training with only labeled data. For training with additional unlabeled data, it drops by 0.3 at 40%, 70% and 90% of the maximum iterations. We empirically find that the two learning rate schemes fit their own setting well. For VGGNet, we select in the interval with step 0.0001 to control the sparsity of the network via the term , and for ResNet we select in the set . When calculating the loss of discriminator, the labeled data are weighted by a coefficient equal to for balance. The weight and the temperature parameter are set to 0.7 and 3, respectively. The weight of adversarial loss is set to , while the weight of Rademacher loss is select from . Parameters are determined with cross-validation.

Fig. 3: Detailed struture of ResNet-56 on CIFAR-10 dataset by our method. The blue bar denotes the number of channels of the original network while the red bar denotes that of the pruned network.

Results. The classification accuracy of the compressed networks on CIFAR-10 dataset assisted by STL-10 dataset is presented in Table I and Table II for VGGNet and ResNet, respectively. The pre-trained VGGNet (ResNet) achieves 93.78% (93.96%) accuracy with 20.1M (0.59M) parameters and 398.6M (88.3M) float-point-operations (FLOPs). For fairness of comparison, all methods prune 70% channels of the pre-trained models assisted with 5K unlabeled images from STL-10 dataset, and obtain pruned networks with about 1.8M (0.3M) parameters and 159M (35M) FLOPs. 777The actual compression rate and acceleration rate are related to the percentage of channels pruned in each layer and may vary in a small range.

From Table I and Table II, we can see that with various numbers of labeled images , our method significantly outperforms the comparison methods in all cases. This indicates the effectiveness and superiority of exploiting unlabeled data even when the distributions of labeled and unlabeled images are not exactly identical. When the labeled data are not sufficient, the comparison methods tend to be trapped in serious over-fitting problem. With 1K labeled data, the state-of-the-art Slimming method only achieves 87.23% accuracy, with a large accuracy drop (6.55%) from the pre-trained VGGNet (93.78%). Training with the Rademacher loss and mimicking loss on labeled data (Slimminng++) improves the performance a little (from 87.23% to 87.95%). However, with the assistance of unlabeled data, our method can improve the performance by a large margin and achieves accuracy of 91.04% . Note that the pre-trained ResNet with shortcut connections and bottleneck blocks [63] is originally parameter compact, thus when pruning a similar percentage of channels, ResNet is more challenging and usually has larger accuracy drop than that of VGGNet.

Table I and Table II also show how the number of labeled images affects the performance of the pruned networks. Fewer data incur larger accuracy drop inevitably, however, the drop of our proposed method is much slower owing to the unlabeled data. For example, with only 100 labeled images, the state-of-the-art Slimming method [25] only achieves 62.49% accuracy, which is unacceptable for real applications. However, the improvement by unlabeled data is very prominent (i.e., accuracy improved more than 12% comparing to Slimming [25]). The results show that unlabeled data provide a good platform to transfer the knowledge of the giant network and improve the accuracy accordingly, which is essential when labeled data is extremely limited.

The detailed structure of the pruned VGGNet and ResNet are shown in Table III and Figure 3, respectively. For VGGNet on CIFAR-10 dataset, more than 90% channels can be pruned in the later layers, implying much redundancy. For ResNet with bottleneck structure, a large number of channels in the “wider” layers can be pruned.

Layer # Channel # Channel* Pruning rate (%)
conv 1-1 64 45 29.69
conv 1-2 64 60 6.25
conv 2-1 128 120 6.25
conv 2-2 128 112 12.50
conv 3-1 256 218 14.84
conv 3-2 256 211 17.58
conv 3-3 256 205 19.92
conv 3-4 256 124 51.56
conv 4-1 512 64 87.50
conv 4-2 512 59 88.48
conv 4-3 512 61 88.09
conv 4-4 512 37 92.77
conv 4-5 512 41 92.00
conv 4-6 512 39 92.38
conv 4-7 512 44 91.41
conv 4-8 512 248 51.56
TABLE III: Detailed struture of VGGNet on CIFAR-10 dataset by our method. “# Channel” and “# Channel*” denote the number of output channels of convolutional layers in the original network and the pruned network, respectively.
Performance Scratch Vanilla Pruning Slimming Slimming + Slimming ++ Ours
[25] [25] [25]
50K #FLOPS  5 5 5 5 5 5
Top5-Acc 65.46 74.86 74.96 75.16 75.37 78.41
100K #FLOPS  5 5 5 5 5 5
Top5-Acc 70.37 76.94 77.52 78.14 78.15 82.21
TABLE IV: Classification performance of the pruned VGGNet on ISLVRC2012 dataset with the unlabeled COCO dataset.

V-B Experiments on ISLVRC2012 Dataset

Dataset. The ISLVRC2012 dataset [60] contains over 1.2M training images and 50k validation images from 1000 categories. For training, all images are cropped with size and then randomly horizontally flipped. As for the unlabeled data, COCO dataset [62] is adopted since it is also a large-scale benchmark image dataset, which is widely used for object detection, segmentation and captioning. COCO dataset has 80 object categories, much fewer than the ISLVRC2012 dataset. We randomly sample 100k images as the unlabeled data. Using COCO dataset to assist the ISLVRC2012 dataset is a very challenge task because of the large difference of their distributions and categories. Some example images are shown in Figure 4.

Networks. Following [25], we use the “VGG-A” network model [8] with batch normalization [44]

released by Pytorch

888https://pytorch.org/docs/master/torchvision/models.html as pre-trained model and evaluate performance with top-5 single-center-crop validation accuracy . The pre-trained model has 89.81% top-5 accuracy, with 132.87M parameters and 7.62B FLOPs. The feature maps after the second pooling layer are sent to a 4-layer convolutional discriminator, which has a similar structure with that for CIFAR-10 dataset, but a convolution layer is added in the beginning and the size of average-pooling layer is changed to . All the convolution layers have

kernel with stride 2.

(a) ISLVRC2012 dataset.
(b) COCO dataset.
Fig. 4: Sample images in the labeled ISLVRC2012 dataset [60] and the unlabeled COCO dataset [62].

Training. For all comparison methods, we use 100k iterations for the sparse retraining and 50k iterations for fine-tuning. The initial learning rate is set to 0.01 for the sparse retraining and determined from set for fine-tuning. The learning rate drops by 0.3 at 40%, 70% and 90% of the total iterations. The weights , are respectively set to 0.5, and is selected from . The sparsity weight is set to 0.005 for the case and 0.003 for the case . For all methods, we prune 50% channels of the pre-trained network.

Results. We randomly sample 50k and 100k labeled images from ISLVRC2012 dataset to implement the compression, assisted with 100K unlabeled samples from the COCO dataset. As Table IV shows, after 50% channels pruned, all the pruned networks have approximately acceleration rate. However, our method achieves the best classification accuracies in all cases. It can be safely concluded that the usage of unlabeled data does enable to boost the compression performance on large-scale datasets. Comparing our results with that of Slimming++, e.g., 78.41% vs 75.37% and 82.21% vs 78.15% for top-5 accuracy, we can infer that the classification ability of the pre-trained model is well preserved by the unlabeled data via mimicking softened output and fixing the dataset bias. Considering the difference between ISLVRC2012 dataset and COCO dataset, the significant improvement on the accuracies shows the effectiveness and superiority of our proposed method. The detailed structure of the pruned VGG-A is shown in Table V. For the VGG-A on ISLVRC2012 dataset, most of the layers has similar redundancy.

Layer # Channel # Channel* Pruning rate (%)
conv 1-1 64 30 52.13
conv 2-1 128 57 55.47
conv 3-1 256 85 66.80
conv 3-2 256 123 51.95
conv 4-1 512 172 66.41
conv 4-2 512 223 56.45
conv 4-3 512 238 53.52
conv 4-4 512 499 2.54
TABLE V: Detailed structure of the pruned VGG-A model on ISLVRC2012 dataset by our method. “# Channel” and “# Channel*” denote the number of output channels of convolutional layers in the original network and the pruned network,respectively.
VGGNet Mimicking Confidence Adversarial Rademacher
62.49 84.52 87.23
70.34 87.62 89.84
71.26 88.18 90.21
72.61 88.10 90.72
73.82 88.19 90.50
75.04 88.51 91.14
TABLE VI: Effect of each individual component for pruning VGGNet on CIFAR-10 dataset with the unlabeled STL-10 dataset. All the methods achieve approximately compression rate and acceleration rate. .
ResNet Mimicking Confidence Adversarial Rademacher
55.48 67.02 79.22
59.31 68.89 80.97
59.92 69.37 81.28
60.51 71.95 82.33
61.18 72.45 82.16
62.03 73.29 82.42
TABLE VII: Effect of each individual component for pruning ResNet-56 on CIFAR-10 dataset with the unlabeled STL-10 dataset. All the methods achieve approximately compression rate and acceleration rate. .
Fig. 5: Classification accuracy of the pruned networks on CIFAR-10 dataset w.r.t.,, (a) different number of unlabeled data with 100 labeled data () and 70% pruning ratio, and (b) various pruning ratio with .
(a) for mimicking loss .
(b) for adversarial loss .
(c) for Rademacher loss .
Fig. 6: Analysis on VGGNet of the three losses in Eq. (19) by varying their weights.

V-C Ablation Studies

V-C1 Effect of the number of unlabeled data

Furthermore, we investigate how the number of unlabeled data influences the classification accuracy of the pruned networks. In Figure 5(a), we report the corresponding accuracies with 100 labeled examples and various numbers of unlabeled ones. As shown in the results, when the number of unlabeled data is fairly limited, the help is also limited and the accuracy is low accordingly. But with the increase of unlabeled data, the accuracies rise steadily. When the unlabeled data are much more than labeled data, e.g., 1K vs 100, the accuracy tends to stabilize. Note that more unlabeled data also bring more training cost, thus in practice for the sake of training efficiency, users do not need to collect too many unlabeled examples.

V-C2 Effect of pruning ratio

We also investigate how the accuracy of the pruned networks changes when we prune different ratios of their channels. As Figure 5(b) shows, drop of accuracy occurs as more channels are pruned since more information stored in the giant network loses and cannot be recovered totally due to limited labeled data. The accuracy of our method is always higher than that of pruning without unlabeled data, especially for a high pruning ratio (e.g.,  60%). This might be because our method can leverage the unlabeled data to decrease the loss of information in the sparse retraining and restore information in fine-tuning as well.

V-C3 Effect of each individual component

We now investigate the effect of each individual component, i.e., the mimicking loss , the adversarial loss , Rademacher loss and confidence on unlabeled data . The results with (“”) or without (“”) each component for both VGGNet and ResNet are shown in Table VI and Table VII. Mimicking loss directly involves unlabeled data in the retraining process, and improves the performance by a large margin (i.e., from 87.23% to 89.84%), which verifies the prominent effect of unlabeled data as a good platform to distillate knowledge from the pretrained network, as well as further alleviate over-fitting. Weighting unlabeled data with confidence further improves performance (i.e., from 89.94% to 90.21%). However, due to the bias between labeled and unlabeled data, there is still a large room to boost performance. The adversarial loss alleviates the bias in low-level feature space, making unlabeled data play a more positive effect and resulting in the improvement from 90.21% to 90.72%. Loss derived from the theoretical generalization error bound strengthens the robustness of the proposed method as well as enhances performance slightly (i.e., from 90.21 to 90.50). With all the components and their mutual effect, the proposed method achieves the best performance (i.e., 91.14%).

To further study how each individual loss and its corresponding weight coefficient affect the final performance, we vary weights , and by fixing the others at the optimal parameter configuration with 100 labeled images and 5K unlabeled data, as shown in Figure 6.

Mimicking loss and weight . The main function of loss is introducing unlabeled data to mimic the classification characteristics of the pre-trained model. When varying from 0 to 1, the degree of mimicking increases accordingly. From Figure 6(a), the accuracy achieves a high level when exceeds 0.001 then increases steadily with . We also observe that an overlarge (e.g., 1) would induce the accuracy to drop a bit. This might result from that in practice, the distribution of unlabeled data is different from that of labeled data, and an overemphasis on the unlabeled data would disturb the network’s fitting ability on the labeled data.

Adversarial loss and weight . The low-level features on the pre-trained model are usually fairly different between unlabeled and labeled data, thus the adversarial loss is very large at the beginning of the retraining. We empirically find a small can still have a more significant impact on the update of low-level features than that on the output layer since the adversarial loss is directly imposed on the low-level layers. The adversarial loss aligns the distributions of the unlabeled and labeled low-level features; however, this might cause that the feature distributions of labeled data on the pruned network drift away slightly from that on the original network. In Figure 6(b), when the weight is too large (e.g.), much information of the original giant network will lose and the accuracy drops slightly.

Rademacher loss and weight . Rademacher loss acts as a regularization term to boost the generalization ability of the pruned networks. The Rademacher loss complements with the two losses and , and work best with . Nevertheless, stronger regularization (e.g., 0.1 in Figure 6(c)) may also hamper the classification accuracy.

Verification of limitation of few labeled data. Note that the mimicking loss and Rademacher loss can be imposed both on labeled data and unlabeled data. By fixing other components, we additionally conduct experiments whether the implementation of and cover the labeled data or unlabeled data. Accuracies are presented in Table VIII and Table IX, and ”L” represents labeled data while “U” is for unlabeled data. We can see that for both mimicking loss and Rademacher loss, implementing them only with labeled data has a small effect. For example, with 100 labeled data for mimicking loss (Rademacher loss), the improvement of performance is only 2.26 (0.05). However, when introducing unlabeled data, the performance can be improved for a large margin. Even implementing mimicking loss (Rademacher loss) only with unlabeled data , the accuracy is improved by 10.84 (1.92) accordingly. We can safely conclude that the unlabeled data do play a vital part in helping the performance improvement of pruned networks.

L U
63.57 85.26 87.86
65.83 85.89 88.14
74.71 88.44 91.05
75.04 88.51 91.14
TABLE VIII: Effect of labeled data and unlabeled data for the mimicking loss .
L U
72.61 88.10 90.72
72.66 88.11 90.95
74.53 88.42 91.10
75.04 88.51 91.14
TABLE IX: Effect of labeled data and unlabeled data for the Rademacher loss .

Vi Conclusion

We solved a practical problem of compressing giant demo neural networks given only a few labeled examples instead of the original and complete training data. We exploited the unlabeled data to distill the knowledge from the giant network into the pruned network and boosted the compression performance. To alleviate the dataset bias between labeled and unlabeled data, we trained the low-level feature extractor of the pruned network to make an alignment on their feature distributions. Experimental results validated the effectiveness of our method. For the future work, we plan to investigate an extreme situation even if no single example is released with the giant networks, which might demand higher generalization ability of the compressed networks.

References

  • [1] S. Qiao, Z. Zhang, W. Shen, B. Wang, and A. L. Yuille, “Gradually updated neural networks for large-scale image recognition,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 80.   JMLR.org, 2018, pp. 4185–4194.
  • [2] H. Wen, K. Han, J. Shi, Y. Zhang, E. Culurciello, and Z. Liu, “Deep predictive coding network for object recognition,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 80.   JMLR.org, 2018, pp. 5263–5272.
  • [3]

    T. Nagamine and N. Mesgarani, “Understanding the representation and computation of multilayer perceptrons: A case study in speech recognition,” in

    ICML

    , ser. Proceedings of Machine Learning Research, vol. 70.   PMLR, 2017, pp. 2564–2573.

  • [4] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel end-to-end speech recognition,” in ICML, ser. Proceedings of Machine Learning Research, vol. 70.   PMLR, 2017, pp. 2632–2641.
  • [5] D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra, “One-shot generalization in deep generative models,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 48.   JMLR.org, 2016, pp. 1521–1529.
  • [6] L. Maaloe, C. K. Sonderby, S. K. Sonderby, and O. Winther, “Auxiliary deep generative models,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 48.   JMLR.org, 2016, pp. 1445–1453.
  • [7]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 1–9.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [11] Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond filters: Compact feature map for portable deep model,” in International Conference on Machine Learning, 2017, pp. 3703–3711.
  • [12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, vol. abs/1510.00149, 2015.
  • [13] J. M. Alvarez and M. Salzmann, “Compression-aware training of deep networks,” in Advances in Neural Information Processing Systems, 2017, pp. 856–867.
  • [14] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in neural information processing systems, 2016, pp. 4107–4115.
  • [15] Q. Hu, P. Wang, and J. Cheng, “From hashing to cnns: Training binaryweight networks via hashing,” arXiv preprint arXiv:1802.02733, 2018.
  • [16] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., 2013, pp. 2148–2156. [Online]. Available: http://papers.nips.cc/paper/5025-predicting-parameters-in-deep-learning
  • [17] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 1943–1955, 2016.
  • [18] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [19] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016.
  • [20]

    J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4133–4141.
  • [21] B. Reagen, U. Gupta, B. Adolf, M. Mitzenmacher, A. M. Rush, G. Wei, and D. Brooks, “Weightless: Lossy weight encoding for deep neural network compression,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 80.   JMLR.org, 2018, pp. 4321–4330.
  • [22] M. A. Carreira-Perpinán and Y. Idelbayev, “Learning-compression algorithms for neural net pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8532–8541.
  • [23] Y. Wang, C. Xu, C. Xu, and D. Tao, “Packing convolutional neural networks in the frequency domain,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [24]

    J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep networks,” in

    Advances in Neural Information Processing Systems, 2016, pp. 2270–2278.
  • [25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Computer Vision (ICCV), 2017 IEEE International Conference on.   IEEE, 2017, pp. 2755–2763.
  • [26] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, 2015, pp. 3123–3131.
  • [27]

    M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in

    European Conference on Computer Vision.   Springer, 2016, pp. 525–542.
  • [28] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” Preprint at https://arxiv. org/abs/1711.05908, 2017.
  • [29] J. Cheng, P. Wang, G. Li, Q. Hu, and H. Lu, “Recent advances in efficient computation of deep convolutional neural networks,” Frontiers of IT & EE, vol. 19, no. 1, pp. 64–77, 2018.
  • [30] U. Djuric, G. Zadeh, K. Aldape, and P. Diamandis, “Precision histology: how deep learning is poised to revitalize histomorphology for personalized cancer care,” NPJ precision oncology, vol. 1, no. 1, p. 22, 2017.
  • [31]

    R. Burbidge, M. Trotter, B. Buxton, and S. Holden, “Drug design by machine learning: support vector machines for pharmaceutical data analysis,”

    Computers & chemistry, vol. 26, no. 1, pp. 5–14, 2001.
  • [32] J. Cheng, P.-s. Wang, G. Li, Q.-h. Hu, and H.-q. Lu, “Recent advances in efficient computation of deep convolutional neural networks,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 64–77, 2018.
  • [33] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, “Exploring the limits of weakly supervised pretraining,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 181–196.
  • [34] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on cpus,” 2011.
  • [35] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
  • [36] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
  • [37] S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2017, pp. 1285–1294.
  • [38] Y. Wang, C. Xu, C. Xu, and D. Tao, “Adversarial learning of portable student networks,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [39] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
  • [40] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
  • [41] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds.   Curran Associates, Inc., 2016, pp. 1379–1387. [Online]. Available: http://papers.nips.cc/paper/6165-dynamic-network-surgery-for-efficient-dnns.pdf
  • [42] V. Lebedev and V. S. Lempitsky, “Fast convnets using group-wise brain damage,” in CVPR.   IEEE Computer Society, 2016, pp. 2554–2564.
  • [43] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
  • [44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [45] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1389–1397.
  • [46] J.-H. Luo, H. Zhang, H.-Y. Zhou, C.-W. Xie, J. Wu, and W. Lin, “Thinet: pruning cnn filters for a thinner net,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [47] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discrimination-aware channel pruning for deep neural networks,” in Advances in Neural Information Processing Systems, 2018, pp. 883–894.
  • [48] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence, “Covariate shift and local learning by distribution matching,” 2008.
  • [49] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.
  • [50] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, 2018.
  • [51] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  • [52] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 2, 2017, p. 4.
  • [53] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [54] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asymmetric style transfer for applying and removing makeup,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [55] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network.” in CVPR, vol. 2, no. 3, 2017, p. 4.
  • [56] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” arXiv preprint arXiv:1711.03213, 2017.
  • [57] V. Koltchinskii, D. Panchenko et al.

    , “Empirical margin distributions and bounding the generalization error of combined classifiers,”

    The Annals of Statistics, vol. 30, no. 1, pp. 1–50, 2002.
  • [58] K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learning,” arXiv preprint arXiv:1710.05468, 2017.
  • [59] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
  • [60] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [61] A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in AISTATS, ser. JMLR Proceedings, vol. 15.   JMLR.org, 2011, pp. 215–223.
  • [62] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV (5), ser. Lecture Notes in Computer Science, vol. 8693.   Springer, 2014, pp. 740–755.
  • [63] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision.   Springer, 2016, pp. 630–645.