Data-Driven Sparse Structure Selection for Deep Neural Networks

07/05/2017 ∙ by Zehao Huang, et al. ∙ 0

Deep convolutional neural networks have liberated its extraordinary power on various tasks. However, it is still very challenging to deploy state-of-the-art models into real-world applications due to their high computational complexity. How can we design a compact and effective network without massive experiments and expert knowledge? In this paper, we propose a simple and effective framework to learn and prune deep models in an end-to-end manner. In our framework, a new type of parameter -- scaling factor is first introduced to scale the outputs of specific structures, such as neurons, groups or residual blocks. Then we add sparsity regularizations on these factors, and solve this optimization problem by a modified stochastic Accelerated Proximal Gradient (APG) method. By forcing some of the factors to zero, we can safely remove the corresponding structures, thus prune the unimportant parts of a CNN. Compared with other structure selection methods that may need thousands of trials or iterative fine-tuning, our method is trained fully end-to-end in one training pass without bells and whistles. We evaluate our method, Sparse Structure Selection with two state-of-the-art CNNs ResNet and ResNeXt, and demonstrate very promising results with adaptive depth and width selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning methods, especially convolutional neural networks (CNNs) have achieved remarkable performances in many fields, such as computer vision, natural language processing and speech recognition. However, these extraordinary performances are at the expense of high computational and storage demand. Although the power of modern GPUs has skyrocketed in the last years, these high costs are still prohibitive for CNNs to deploy in latency critical applications such as self-driving cars and augmented reality, etc.

Recently, a significant amount of works on accelerating CNNs at inference time have been proposed. Methods focus on accelerating pre-trained models include direct pruning [8, 23, 28, 12, 26], low-rank decomposition [6, 19, 44], and quantization [30, 5, 41]. Another stream of researches trained small and efficient networks directly, such as knowledge distillation [13, 32, 34], novel architecture designs [17, 14] and sparse learning [24, 45, 1, 40]. In spare learning, prior works [24] pursued the sparsity of weights. However, non-structure sparsity only produce random connectivities and can hardly utilize current off-the-shelf hardwares such as GPUs to accelerate model inference in wall clock time. To address this problem, recently methods [45, 1, 40] proposed to apply group sparsity to retain a hardware friendly CNN structure.

In this paper, we take another view to jointly learn and prune a CNN. First, we introduce a new type of parameter – scaling factors which scale the outputs of some specific structures (e.g., neurons, groups or blocks) in CNNs. These scaling factors endow more flexibility to CNN with very few parameters. Then, we add sparsity regularizations on these scaling factors to push them to zero during training. Finally, we can safely remove the structures correspond to zero scaling factors and get a pruned model. Comparing with direct pruning methods, this method is data driven and fully end-to-end. In other words, the network can select its unique configuration based on the difficulty and needs of each task. Moreover, the model selection is accomplished jointly with the normal training of CNNs. We do not require extra fine-tuning or multi-stage optimizations, and it only introduces minor cost in the training.

To summarize, our contributions are in the following three folds:

  • We propose a unified framework for model training and pruning in CNNs. Particularly, we formulate it as a joint sparse regularized optimization problem by introducing scaling factors and corresponding sparse regularizations on certain structures of CNNs.

  • We utilize a modified stochastic Accelerated Proximal Gradient (APG) method to jointly optimize the weights of CNNs and scaling factors with sparsity regularizations. Compared with previous methods that utilize heuristic ways to force sparsity, our methods enjoy more stable convergence and better results without fine-tuning and multi-stage optimization.

  • We test our proposed method on several state-of-the-art networks, VGG, ResNet and ResNeXt to prune neurons, residual blocks and groups, respectively. We can adaptively adjust the depth and width accordingly. We show very promising acceleration performances on CIFAR and large scale ILSVRC 2012 image classification datasets.

2 Related Works

Network pruning was pioneered in the early development of neural network. In Optimal Brain Damage [22] and Optimal Brain Surgeon [9]

, unimportant connections are removed based on the Hessian matrix derived from the loss function. Recently, Han

et al.[8] brought back this idea by pruning the weights whose absolute value are smaller than a given threshold. This approach requires iteratively pruning and fine-tuning which is very time-consuming. To tackle this problem, Guo et al.[7] proposed dynamic network surgery to prune parameters during training. However, the nature of irregular sparse weights make them only yield effective compression but not faster inference in terms of wall clock time. To tackle this issue, several works pruned the neurons directly [15, 23, 28] by evaluating neuron importance on specific criteria. These methods all focus on removing the neurons whose removal affect the final prediction least. On the other hand, the diversity of neurons to be kept is also an important factor to consider [27]. More recently, [26] and [12] formulate pruning as a optimization problem. They first select most representative neurons and further minimize the reconstitution error to recover the accuracy of pruned networks. While neuron level pruning can achieve practical acceleration with moderate accuracy loss, it is still hard to implement them in an end-to-end manner without iteratively pruning and retraining. Very recently, Liu et al.[25]

used similar technique as ours to prune neurons. They sparsify the scaling parameters of batch normalization (BN)

[18] to select channels. As discussed later, their work can be seen as a special case in our framework.

Model structure learning for deep learning models has attracted increasing attention recently. Several methods have been explored to learn CNN architectures without handcrafted design [2, 46, 31]

. One stream is to explore the design space by reinforcement learning

[2, 46]

or genetic algorithms

[31, 42]. Another stream is to utilize sparse learning. [45, 1] added group sparsity regularizations on the weights of neurons and sparsified them in the training stage. Lately, Wen et al.[40] proposed a more general approach, which applied group sparsity on multiple structures of networks, including filter shapes, channels and layers in skip connections.

CNNs with skip connections have been the main stream for modern network design since it can mitigate the gradient vanishing/exploding issue in ultra deep networks by the help of skip connections [37, 10]. Among these work, ResNet and its variants [11, 43] have attracted more attention because of their simple design principle and state-of-the-art performances. Recently, Veit et al.[39] interpreted ResNet as an exponential ensemble of many shallow networks. They find there is minor impact on the performance when removing single residual block. However, deleting more and more residual blocks will impair the accuracy significantly. Therefore, accelerating this state-of-the-art network architecture is still a challenging problem. In this paper, we propose a data-driven method to learn the architecture of such kind of network. Through scaling and pruning residual blocks during training, our method can produce a more compact ResNet with faster inference speed and even better performance.

3 Proposed Method

Notations Consider the weights of a convolutional layer in a

layers CNN as a 4-dimensional tensor

, where is the number of output channels, represents the number of input channels, and are the height and width of a 2-dimensional kernel. Then we can use to denote the weights of -th neuron in layer

. The scaling factors are represented as a 1-dimensional vector

, where is the number of structures we consider to prune. refers to the -th value of . Denote soft-threshold operator as .

Figure 1: The network architecture of our method. represents a residual function. Gray block, group and neuron mean they are inactive and can be pruned since their corresponding scaling factors are .

3.1 Sparse Structure Selection

Given a training set consisting of sample-label pairs , then a layers CNN can be represented as a function , where represents the collection of all weights in the CNN. is learned through solving an optimization problem of the form:

(1)

where is the loss on the sample , is a non-structured regularization applying on every weight, e.g. -norm as weight decay.

Prior sparse based model structure learning work [45, 1] tried to learn the number of neurons in a CNN. To achieve this goal, they added group sparsity regularization on into Eqn.1, and enforced entire to zero during training. Another concurrent work by Wen et al.[40] adopted similar method but on multiple different structures. These ideas are straightforward but the implementations are nontrivial. First, the optimization is difficult since there are several constraints on weights simultaneously, including weight decay and group sparsity. Improper optimization technique may result in slow convergence and inferior results. Consequently, there is no successful attempt to directly apply these methods on large scale applications with complicated modern network architectures.

In this paper, we address structure learning problem in a more simple and effective way. Different from directly pushing weights in the same group to zero, we try to enforce the output of the group to zero. To achieve this goal, we introduce a new type of parameter – scaling factor to scale the outputs of some specific structures (neurons, groups or blocks), and add sparsity constraint on during training. Our goal is to obtain a sparse . Namely, if , then we can safely remove the corresponding structure since its outputs have no contribution to subsequent computation. Fig. 1 illustrates our framework.

Formally, the objective function of our proposed method can be formulated as:

(2)

where is a sparsity regularization for with weight . In this work, we consider its most commonly used convex relaxation -norm, which defined as .

For

, we can update it by Stochastic Gradient Descent (SGD) with momentum or its variants. For

, we adopt Accelerated Proximal Gradient (APG) [29] method to solve it. For better illustration, we shorten as , and reformulate the optimization of as:

(3)

Then we can update by APG:

(4)
(5)
(6)

where is gradient step size at iteration and since . However, this formulation is not friendly for deep learning since additional to the pass for updating , we need to obtain by extra forward-backward computation, which is computational expensive for deep neural networks. Thus, following the derivation in [38], we reformulate APG as a momentum based method:

(7)
(8)
(9)

where we define and . This formulation is similar as the modified Nesterov Accelerated Gradient (NAG) in [38] except the update of . Furthermore, we simplified the update of by replacing as following the modification of NAG in [3] which has been widely used in practical deep learning frameworks [4]. Our new parameters updates become:

(10)
(11)
(12)

In practice, we follow a stochastic approach with mini-batches and set momentum fixed to a constant value. Both and are updated in each iteration.

In our framework, we add scaling factors to three different CNN micro-structures, including neurons, groups and blocks to yield flexible structure selection. We will introduce these three cases in the following. Note that for networks with BN, we add scaling factors after BN to prevent the influence of bias parameters.

3.2 Neuron Selection

We introduce scaling factors for the output of channels to prune neurons. After training, removing the filters with zero scaling factor will result in a more compact network. A recent work proposed by Liu et al.[25] adopted similar idea for network slimming. They absorbed the scaling parameters into the parameters of batch normalization, and solve the optimization by subgradient descent. During training, scaling parameters whose absolute value are lower than a threshold value are set to 0. Comparing with [25], our method is more general and effective. Firstly, introducing scaling factor is more universal than reusing BN parameters. On one hand, some networks have no batch normalization layers, such as AlexNet [21] and VGG [36]; On the other hand, when we fine-tune pre-trained models on object detection or semantic segmentation tasks, the parameters of batch normalization are usually fixed due to small batch size. Secondly, the optimization of [25] is heuristic and need iterative pruning and retraining. In contrast, our optimization is more stable in an end-to-end manner. Above all, [25] can be seen as a special case of our method.

3.3 Block Selection

The structure of skip connection CNNs allows us to skip the computation of specific layers without cutting off the information flow in the network. Through stacking residual blocks, ResNet [10, 11] can easily exploit the advantage of very deep networks. Formally, residual block with identity mapping can be formulated by the following formula:

(13)

where and are input and output of the -th block, is a residual function and are parameters of the block.

To prune blocks, we add scaling factor after each residual block. Then in our framework, the formulation of Eqn.13 is as follows:

(14)

As shown in Fig 1, after optimization, we can get a sparse . The residual block with scaling factor 0 will be pruned entirely, and we can learn a much shallower ResNet. A prior work that also adds scaling factors for residual in ResNet is Weighted Residual Networks [35]. Though sharing a lot of similarities, the motivations behind these two works are different. Their work focuses on how to train ultra deep ResNet to get better results with the help of scaling factors. Particularly, they increase depth from 100+ to 1000+. While our method aims to decrease the depth of ResNet, we use the scaling factors and sparse regularizations to sparsify the output of residual blocks.

3.4 Group Selection

Recently, Xie et al. introduced a new dimension – cardinality into ResNets and proposed ResNeXt [43]. Formally, they presented aggregated transformations as:

(15)

where represents a transformation with parameters , is the cardinality of the set of to be aggregated. In practice, they use grouped convolution to ease the implementation of aggregated transformations. So in our framework, we refer as the number of group, and formulate a weighted as:

(16)

After training, several basic cardinalities are chosen by a sparse to form the final transformations. Then, the inactive groups with zero scaling factors can be safely removed as shown in Fig 1. Note that neuron pruning can also seen as a special case of group pruning when each group contains only one neuron. Furthermore, we can combine block pruning and group pruning to learn more flexible network structures.

4 Experiments

(a) VGG CIFAR10
(b) VGG CIFAR10
(c) VGG CIFAR100
(d) VGG CIFAR100
Figure 2: Error vs. number of parameters and FLOPs after SSS training for VGG on CIFAR-10 and CIFAR-100 datasets.
(a) ResNet20 CIFAR10
(b) ResNet20 CIFAR10
(c) ResNet20 CIFAR100
(d) ResNet20 CIFAR100
(e) ResNet164 CIFAR10
(f) ResNet164 CIFAR10
(g) ResNet164 CIFAR100
(h) ResNet164 CIFAR100
Figure 3: Error vs. number of parameters and FLOPs after SSS training for ResNet-20 and ResNet-164 on CIFAR-10 and CIFAR-100 datasets.
(a) ResNeXt20 CIFAR10
(b) ResNeXt20 CIFAR10
(c) ResNeXt20 CIFAR100
(d) ResNeXt20 CIFAR100
(e) ResNeXt164 CIFAR10
(f) ResNeXt164 CIFAR10
(g) ResNeXt164 CIFAR100
(h) ResNeXt164 CIFAR100
Figure 4: Error vs. number of parameters and FLOPs with SSS training for ResNeXt-20 and ResNeXt-164 on CIFAR-10 and CIFAR-100 datasets.

In this section, we evaluate the effectiveness of our method on three standard datasets, including CIFAR-10, CIFAR-100 [20]

and ImageNet LSVRC 2012

[33]. For neuron pruning, we adopt VGG16 [36], a classical plain network to validate our method. As for blocks and groups, we use two state-of-the-art networks, ResNet [11] and ResNeXt [43] respectively.

For optimization, we adopt NAG [38, 3] and our modified APG to update weights and scaling factors , respectively. We set weight decay of to 0.0001 and fix momentum to 0.9 for both and . The weights are initialized as in [10] and all scaling factors are initialized to be 1. All the experiments are conducted in MXNet [4]. The code will be made publicly available if the paper is accepted.

4.1 Cifar

We start with CIFAR dataset to evaluate our method. CIFAR-10 dataset consists of 50K training and 10K testing RGB images with 10 classes. CIFAR-100 is similar to CIFAR-10, except it has 100 classes. As suggested in [10], the input image is

randomly cropped from a zero-padded

image or its flipping. The models in our experiments are trained with a mini-batch size of 64 on a single GPU. We start from a learning rate of 0.1 and train the models for 240 epochs. The learning rate is divided by 10 at the 120-th,160-th and 200-th epoch.

VGG: The baseline network is a modified VGG16 with BN [18]111Without BN, the performance of this network is very worse in CIFAR-100 dataset.. We remove fc6 and fc7 and only use one fully-connected layer for classification. We add scale factors after every batch normalization layers. Fig. 2 shows the results of our method. Both parameters and floating-point operations per second (FLOPs)222Multiply-adds. are reported. Our method can save about 30% parameters and 30% - 50% computational cost with minor lost of performance.

ResNet: To learn the number of residual blocks, we use ResNet-20 and ResNet-164 [11] as our baseline networks. ResNet-20 consists of 9 residual blocks. Each block has 2 convolutional layers, while ResNet-164 has 54 blocks with bottleneck structure in each block. Fig. 3 summarizes our results. It is easy to see that our SSS achieves better performance than the baseline model with similar parameters and FLOPs. For ResNet-164, our SSS yields 2.5x speedup with about performance loss both in CIFAR-10 and CIFAR-100. After optimization, we found that the blocks in early stages are pruned first. This discovery coincides with the common design that the network should spend more budget in its later stage, since more and more diverse and complicated pattern may emerge as the receptive field increases.

Additionally, we try to evaluate the performance of SSL proposed in [40] by adding group sparsity on all the weights in a residual block. However, we found the optimization of the network could not converge. We guess the reason is that the number of parameters in the group we defined (one residual block) is much larger than that of original paper (a single layer), normal SGD with heuristic thresholding adopted in [40] is unable to solve this optimization problem.

ResNeXt: We also test our method on ResNeXt [43]. We choose ResNeXt-20 and ResNeXt-164 as our base networks. Both of these two networks have bottleneck structures with 32 groups in residual blocks. For ResNeXt-20, we focus on groups pruning since there are only 6 residual blocks in it. For ResNeXt-164, we add sparsity on both groups and blocks. Fig. 4 shows our experiment results. Both groups pruning and block pruning show good trade-off between parameters and performance, especially in ResNeXt-164. The combination of groups and blocks pruning is extremely effective in CIFAR-10. Our SSS saves about FLOPs while achieves higher accuracy. In ResNeXt-20, groups in first and second block are pruned first. Similarly, in ResNeXt-164, groups in shallow residual blocks are pruned mostly.

4.2 ImageNet LSVRC 2012

To further demonstrate the effectiveness of our method in large-scale CNNs, we conduct more experiments on the ImageNet LSVRC 2012 classification task with VGG16 [36], ResNet-50 [11] and ResNeXt-50 (32 4d) [43]. We do data augmentation based on the publicly available implementation of “fb.resnet” 333https://github.com/facebook/fb.resnet.torch. The mini-batch size is 128 on 4 GPUs for VGG16 and ResNet-50, and 256 on 8 GPUs for ResNeXt-50. The optimization and initialization are similar as those in CIFAR experiments. We train the models for 100 epochs. The learning rate is set to an initial value of 0.1 and then divided by 10 at the 30-th, 60-th and 90-th epoch. All the results for ImageNet dataset are summarized in Table 2.

stage output ResNet-50 ResNet-26 ResNet-32 ResNet-41
conv1 112112 7

7, 64, stride 2

conv2 5656 3

3 max pool, stride 2

 
conv3 2828
conv4 1414
conv5 77
11 global average pool 1000-d FC, softmax
Table 1: Network architectures of ResNet-50 and our pruned ResNets for ImageNet. represents that the corresponding block is kept while denotes that the block is pruned.
Model Top-1 Top-5 #Parameters #FLOPs
VGG-16 27.54 9.16 138.3M 30.97B
VGG-16 31.47 11.8 130.5M 7.667B
ResNet-50 23.88 7.14 25.5M 4.089B
ResNet-41 24.56 7.39 25.3M 3.473B
ResNet-32 25.82 8.09 18.6M 2.818B
ResNet-26 28.18 9.21 15.6M 2.329B
ResNeXt-50 22.43 6.32 25.0M 4.230B
ResNeXt-41 24.07 7.00 12.4M 3.234B
ResNeXt-38 25.02 7.50 10.7M 2.431B
ResNeXt-35-A 25.43 7.83 10.0M 2.068B
ResNeXt-35-B 26.83 8.42 8.50M 1.549B
Table 2: Results on ImageNet dataset. Both top-1 and top-5 validation errors (single crop) are reported. Number of parameters and FLOPs for inference of different models are also shown. Here, M/B means million/billion (), respectively.
Layer Width conv11 18 conv12 35 conv21 87 conv22 79 conv31 88 conv32 61 conv33 79 conv41 180 conv42 198 conv43 230 conv51 512 conv52 512 conv53 512 Table 3: Pruned VGG16 structure. (a) Parameters (b) FLOPs Figure 5: Top-1 error vs. number of parameters and FLOPs for our SSS models and original ResNets on ImageNet validation set.

VGG16: In our experiments of VGG16 pruning, we find the results of pruning all convolutional layers were not promising. This is because in VGG16, the computational cost in terms of FLOPs is not equally distributed in each layer. The number of FLOPs of conv5 layers is 2.77 billion in total, which is only 9% of the whole network (30.97 billion). Thus, we consider the sparse penalty should be adjusted by computational cost of different layers. Similar idea has been adopted in [28] and [12]. In [28], they introduce FLOPs regularization to the pruning criteria. He et al.[12] do not prune conv5 layers in their VGG16 experiments. Following [12], we set the sparse penalty of conv5 to 0 and only prune conv1 to conv4. The results can be found in Table 2, and Table 3 shows the detailed structure of our pruned VGG16. The pruned model save about 75% FLOPs, while the parameter saving is negligible. This is due to that fully-connected layers have a large amount of parameters (123 million in original VGG16), and we do not pruned fully-connected layers for fair comparison with other methods.

ResNet-50: For ResNet-50, we experiment three different settings of to explore the performance of our method in block pruning. For simplicity, we denote the trained models as ResNet-26, ResNet-32 and ResNet-41 depending on their depths. Their structures are shown in Table 1. All the pruned models come with accuracy loss in certain extent. Comparing with original ResNet-50, ResNet-41 provides FLOPs reduction with top-1 accuracy loss while ResNet-32 saves FLOPs with about top-1 loss. Fig. 5 shows the top-1 validation errors of our SSS models and ResNets as a function of the number of parameters and FLOPs. The results reveal that our pruned models perform on par with original hand-crafted ResNets, whilst requiring less parameters and computational cost. For example, comparing with ResNet-34 [11], both our ResNet-41 and ResNet-36 yield better performances with less FLOPs.

ResNeXt-50: As for ResNeXt-50, we add sparsity constraint on both residual blocks and groups which results in several pruned models. Table 2 summarizes the performance of these models. The learned ResNeXt-41 yields top-1 error in ILSVRC validation set. It gets similar results with the original ResNet50, but with half parameters and more than 20% less FLOPs. In ResNeXt-41, three residual blocks in “conv5” stage are pruned entirely. This pruning result is somewhat contradict to the common design of CNNs, which worth to be studied in depth in the future.

Model Top-1 Top-5 #FLOPs
ResNet-34-pruned [23] 27.44 - 3.080B
ResNet-50-pruned-A [23] (Our impl.) 27.12 8.95 3.070B
ResNet-50-pruned-B [23] (Our impl.) 27.02 8.92 3.400B
ResNet-32 (Ours) 25.82 8.09 2.818B
ResNet-50-pruned (2) [12] - 9.20 2.726B
ResNet-26 (Ours) 28.18 9.21 2.329B
VGG16-pruned [28] - 15.5 8.0B
VGG16-pruned (5) [12] 32.20 11.90 7.033B
VGG16-pruned (ThiNet-Conv) [26] 30.20 10.47 9.580B
VGG16-pruned (Ours) 31.47 11.80 7.667B
Table 4: Comparison among several state-of-the-art pruning methods on the ResNet and VGG16 networks.
Model Top-1 Top-5 #FLOPs
DenseNet-121 [16] 25.02 7.71 2.834B
DenseNet-121 [16] (Our impl.) 25.58 7.89 2.834B
ResNeXt-38 (Ours) 25.02 7.50 2.431B
Table 5: Comparison between pruned ResNeXt-38 and DenseNet-121.

4.3 Comparison with other methods

We compare our SSS with other pruning methods, including filter pruning [23], channel pruning [12], ThiNet [26] and [28]. Table 4 shows the pruning results on the ImageNet LSVRC2012 dataset. To the best of our knowledge, only a few works reported ResNet pruning results with FLOPs. Comparing with filter pruning results, our ResNet-32 performs best with least FLOPs. As for channel pruning, our ResNet-26 yields similar top-5 error with pruned ResNet-50 provided by [12], but saves about 14.5% FLOPs444We calculate the FLOPs of He’s models by provided network structures.. We also show comparison in VGG16. All the method including channel pruning, ThiNet and our SSS achieve significant improvement than [28]. Our VGG16 pruning result is competitive to other state-of-the-art.

We further compare our pruned ResNeXt with DenseNet [16] in Table 5. With 14% less FLOPs, Our ResNeXt-38 achieves 0.2% lower top-5 error than DenseNet-121.

5 Conclusions

In this paper, we have proposed a data-driven method, Sparse Structure Selection (SSS) to adaptively learn the structure of CNNs. In our framework, the training and pruning of CNNs is formulated as a joint sparse regularized optimization problem. Through pushing the scaling factors which are introduced to scale the outputs of specific structures to zero, our method can remove the structures corresponding to zero scaling factors. To solve this challenging optimization problem and adapt it into deep learning models, we modified the Accelerated Proximal Gradient method. In our experiments, we demonstrate very promising pruning results on VGG, ResNet and ResNeXt. We can adaptively adjust the depth and width of these CNNs based on budgets at hand and difficulties of each task. We believe these pruning results can further inspire the design of more compact CNNs.

In future work, we plan to apply our method in more applications such as object detection. It is also interesting to investigate the use of more advanced sparse regularizers such as non-convex relaxations, and adjust the penalty based on the complexity of different structures adaptively.

References

  • [1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In NIPS, 2016.
  • [2] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017.
  • [3] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In ICASSP, 2013.
  • [4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet

    : A flexible and efficient machine learning library for heterogeneous distributed systems.

    In NIPS Workshop, 2015.
  • [5] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. In NIPS, 2016.
  • [6] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
  • [7] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.
  • [8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
  • [9] B. Hassibi, D. G. Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, 1993.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • [12] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
  • [13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Workshop, 2014.
  • [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. In CVPR, 2017.
  • [15] H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250, 2016.
  • [16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
  • [17] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv:1602.07360, 2016.
  • [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
  • [19] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. BMVC, 2014.
  • [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
  • [22] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In NIPS, 1990.
  • [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient ConvNets. In ICLR, 2017.
  • [24] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In CVPR, 2015.
  • [25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  • [26] J.-H. Luo, J. Wu, and W. Lin. ThiNet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
  • [27] Z. Mariet and S. Sra. Diversity networks. In ICLR, 2016.
  • [28] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
  • [29] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
  • [30] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. In ECCV, 2016.
  • [31] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin.

    Large-scale evolution of image classifiers.

    In ICML, 2017.
  • [32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015.
  • [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [34] Z. Sergey and K. Nikos. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
  • [35] F. Shen, R. Gan, and G. Zeng. Weighted residuals for very deep networks. In ICSAI, 2016.
  • [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [37] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. In ICML, 2015.
  • [38] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
  • [39] A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In NIPS, 2016.
  • [40] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
  • [41] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
  • [42] L. Xie and A. Yuille. Genetic CNN. In ICCV, 2017.
  • [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  • [44] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and accurate approximations of nonlinear convolutional networks. In CVPR, 2015.
  • [45] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In ECCV, 2016.
  • [46] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.