1 Introduction
Deep learning methods, especially convolutional neural networks (CNNs) have achieved remarkable performances in many fields, such as computer vision, natural language processing and speech recognition. However, these extraordinary performances are at the expense of high computational and storage demand. Although the power of modern GPUs has skyrocketed in the last years, these high costs are still prohibitive for CNNs to deploy in latency critical applications such as selfdriving cars and augmented reality, etc.
Recently, a significant amount of works on accelerating CNNs at inference time have been proposed. Methods focus on accelerating pretrained models include direct pruning [8, 23, 28, 12, 26], lowrank decomposition [6, 19, 44], and quantization [30, 5, 41]. Another stream of researches trained small and efficient networks directly, such as knowledge distillation [13, 32, 34], novel architecture designs [17, 14] and sparse learning [24, 45, 1, 40]. In spare learning, prior works [24] pursued the sparsity of weights. However, nonstructure sparsity only produce random connectivities and can hardly utilize current offtheshelf hardwares such as GPUs to accelerate model inference in wall clock time. To address this problem, recently methods [45, 1, 40] proposed to apply group sparsity to retain a hardware friendly CNN structure.
In this paper, we take another view to jointly learn and prune a CNN. First, we introduce a new type of parameter – scaling factors which scale the outputs of some specific structures (e.g., neurons, groups or blocks) in CNNs. These scaling factors endow more flexibility to CNN with very few parameters. Then, we add sparsity regularizations on these scaling factors to push them to zero during training. Finally, we can safely remove the structures correspond to zero scaling factors and get a pruned model. Comparing with direct pruning methods, this method is data driven and fully endtoend. In other words, the network can select its unique configuration based on the difficulty and needs of each task. Moreover, the model selection is accomplished jointly with the normal training of CNNs. We do not require extra finetuning or multistage optimizations, and it only introduces minor cost in the training.
To summarize, our contributions are in the following three folds:

We propose a unified framework for model training and pruning in CNNs. Particularly, we formulate it as a joint sparse regularized optimization problem by introducing scaling factors and corresponding sparse regularizations on certain structures of CNNs.

We utilize a modified stochastic Accelerated Proximal Gradient (APG) method to jointly optimize the weights of CNNs and scaling factors with sparsity regularizations. Compared with previous methods that utilize heuristic ways to force sparsity, our methods enjoy more stable convergence and better results without finetuning and multistage optimization.

We test our proposed method on several stateoftheart networks, VGG, ResNet and ResNeXt to prune neurons, residual blocks and groups, respectively. We can adaptively adjust the depth and width accordingly. We show very promising acceleration performances on CIFAR and large scale ILSVRC 2012 image classification datasets.
2 Related Works
Network pruning was pioneered in the early development of neural network. In Optimal Brain Damage [22] and Optimal Brain Surgeon [9]
, unimportant connections are removed based on the Hessian matrix derived from the loss function. Recently, Han
et al. [8] brought back this idea by pruning the weights whose absolute value are smaller than a given threshold. This approach requires iteratively pruning and finetuning which is very timeconsuming. To tackle this problem, Guo et al. [7] proposed dynamic network surgery to prune parameters during training. However, the nature of irregular sparse weights make them only yield effective compression but not faster inference in terms of wall clock time. To tackle this issue, several works pruned the neurons directly [15, 23, 28] by evaluating neuron importance on specific criteria. These methods all focus on removing the neurons whose removal affect the final prediction least. On the other hand, the diversity of neurons to be kept is also an important factor to consider [27]. More recently, [26] and [12] formulate pruning as a optimization problem. They first select most representative neurons and further minimize the reconstitution error to recover the accuracy of pruned networks. While neuron level pruning can achieve practical acceleration with moderate accuracy loss, it is still hard to implement them in an endtoend manner without iteratively pruning and retraining. Very recently, Liu et al. [25]used similar technique as ours to prune neurons. They sparsify the scaling parameters of batch normalization (BN)
[18] to select channels. As discussed later, their work can be seen as a special case in our framework.Model structure learning for deep learning models has attracted increasing attention recently. Several methods have been explored to learn CNN architectures without handcrafted design [2, 46, 31]
. One stream is to explore the design space by reinforcement learning
[2, 46][31, 42]. Another stream is to utilize sparse learning. [45, 1] added group sparsity regularizations on the weights of neurons and sparsified them in the training stage. Lately, Wen et al. [40] proposed a more general approach, which applied group sparsity on multiple structures of networks, including filter shapes, channels and layers in skip connections.CNNs with skip connections have been the main stream for modern network design since it can mitigate the gradient vanishing/exploding issue in ultra deep networks by the help of skip connections [37, 10]. Among these work, ResNet and its variants [11, 43] have attracted more attention because of their simple design principle and stateoftheart performances. Recently, Veit et al. [39] interpreted ResNet as an exponential ensemble of many shallow networks. They find there is minor impact on the performance when removing single residual block. However, deleting more and more residual blocks will impair the accuracy significantly. Therefore, accelerating this stateoftheart network architecture is still a challenging problem. In this paper, we propose a datadriven method to learn the architecture of such kind of network. Through scaling and pruning residual blocks during training, our method can produce a more compact ResNet with faster inference speed and even better performance.
3 Proposed Method
Notations Consider the weights of a convolutional layer in a
layers CNN as a 4dimensional tensor
, where is the number of output channels, represents the number of input channels, and are the height and width of a 2dimensional kernel. Then we can use to denote the weights of th neuron in layer. The scaling factors are represented as a 1dimensional vector
, where is the number of structures we consider to prune. refers to the th value of . Denote softthreshold operator as .3.1 Sparse Structure Selection
Given a training set consisting of samplelabel pairs , then a layers CNN can be represented as a function , where represents the collection of all weights in the CNN. is learned through solving an optimization problem of the form:
(1) 
where is the loss on the sample , is a nonstructured regularization applying on every weight, e.g. norm as weight decay.
Prior sparse based model structure learning work [45, 1] tried to learn the number of neurons in a CNN. To achieve this goal, they added group sparsity regularization on into Eqn.1, and enforced entire to zero during training. Another concurrent work by Wen et al. [40] adopted similar method but on multiple different structures. These ideas are straightforward but the implementations are nontrivial. First, the optimization is difficult since there are several constraints on weights simultaneously, including weight decay and group sparsity. Improper optimization technique may result in slow convergence and inferior results. Consequently, there is no successful attempt to directly apply these methods on large scale applications with complicated modern network architectures.
In this paper, we address structure learning problem in a more simple and effective way. Different from directly pushing weights in the same group to zero, we try to enforce the output of the group to zero. To achieve this goal, we introduce a new type of parameter – scaling factor to scale the outputs of some specific structures (neurons, groups or blocks), and add sparsity constraint on during training. Our goal is to obtain a sparse . Namely, if , then we can safely remove the corresponding structure since its outputs have no contribution to subsequent computation. Fig. 1 illustrates our framework.
Formally, the objective function of our proposed method can be formulated as:
(2) 
where is a sparsity regularization for with weight . In this work, we consider its most commonly used convex relaxation norm, which defined as .
For
, we can update it by Stochastic Gradient Descent (SGD) with momentum or its variants. For
, we adopt Accelerated Proximal Gradient (APG) [29] method to solve it. For better illustration, we shorten as , and reformulate the optimization of as:(3) 
Then we can update by APG:
(4)  
(5)  
(6) 
where is gradient step size at iteration and since . However, this formulation is not friendly for deep learning since additional to the pass for updating , we need to obtain by extra forwardbackward computation, which is computational expensive for deep neural networks. Thus, following the derivation in [38], we reformulate APG as a momentum based method:
(7)  
(8)  
(9) 
where we define and . This formulation is similar as the modified Nesterov Accelerated Gradient (NAG) in [38] except the update of . Furthermore, we simplified the update of by replacing as following the modification of NAG in [3] which has been widely used in practical deep learning frameworks [4]. Our new parameters updates become:
(10)  
(11)  
(12) 
In practice, we follow a stochastic approach with minibatches and set momentum fixed to a constant value. Both and are updated in each iteration.
In our framework, we add scaling factors to three different CNN microstructures, including neurons, groups and blocks to yield flexible structure selection. We will introduce these three cases in the following. Note that for networks with BN, we add scaling factors after BN to prevent the influence of bias parameters.
3.2 Neuron Selection
We introduce scaling factors for the output of channels to prune neurons. After training, removing the filters with zero scaling factor will result in a more compact network. A recent work proposed by Liu et al. [25] adopted similar idea for network slimming. They absorbed the scaling parameters into the parameters of batch normalization, and solve the optimization by subgradient descent. During training, scaling parameters whose absolute value are lower than a threshold value are set to 0. Comparing with [25], our method is more general and effective. Firstly, introducing scaling factor is more universal than reusing BN parameters. On one hand, some networks have no batch normalization layers, such as AlexNet [21] and VGG [36]; On the other hand, when we finetune pretrained models on object detection or semantic segmentation tasks, the parameters of batch normalization are usually fixed due to small batch size. Secondly, the optimization of [25] is heuristic and need iterative pruning and retraining. In contrast, our optimization is more stable in an endtoend manner. Above all, [25] can be seen as a special case of our method.
3.3 Block Selection
The structure of skip connection CNNs allows us to skip the computation of specific layers without cutting off the information flow in the network. Through stacking residual blocks, ResNet [10, 11] can easily exploit the advantage of very deep networks. Formally, residual block with identity mapping can be formulated by the following formula:
(13) 
where and are input and output of the th block, is a residual function and are parameters of the block.
To prune blocks, we add scaling factor after each residual block. Then in our framework, the formulation of Eqn.13 is as follows:
(14) 
As shown in Fig 1, after optimization, we can get a sparse . The residual block with scaling factor 0 will be pruned entirely, and we can learn a much shallower ResNet. A prior work that also adds scaling factors for residual in ResNet is Weighted Residual Networks [35]. Though sharing a lot of similarities, the motivations behind these two works are different. Their work focuses on how to train ultra deep ResNet to get better results with the help of scaling factors. Particularly, they increase depth from 100+ to 1000+. While our method aims to decrease the depth of ResNet, we use the scaling factors and sparse regularizations to sparsify the output of residual blocks.
3.4 Group Selection
Recently, Xie et al. introduced a new dimension – cardinality into ResNets and proposed ResNeXt [43]. Formally, they presented aggregated transformations as:
(15) 
where represents a transformation with parameters , is the cardinality of the set of to be aggregated. In practice, they use grouped convolution to ease the implementation of aggregated transformations. So in our framework, we refer as the number of group, and formulate a weighted as:
(16) 
After training, several basic cardinalities are chosen by a sparse to form the final transformations. Then, the inactive groups with zero scaling factors can be safely removed as shown in Fig 1. Note that neuron pruning can also seen as a special case of group pruning when each group contains only one neuron. Furthermore, we can combine block pruning and group pruning to learn more flexible network structures.
4 Experiments
In this section, we evaluate the effectiveness of our method on three standard datasets, including CIFAR10, CIFAR100 [20]
and ImageNet LSVRC 2012
[33]. For neuron pruning, we adopt VGG16 [36], a classical plain network to validate our method. As for blocks and groups, we use two stateoftheart networks, ResNet [11] and ResNeXt [43] respectively.For optimization, we adopt NAG [38, 3] and our modified APG to update weights and scaling factors , respectively. We set weight decay of to 0.0001 and fix momentum to 0.9 for both and . The weights are initialized as in [10] and all scaling factors are initialized to be 1. All the experiments are conducted in MXNet [4]. The code will be made publicly available if the paper is accepted.
4.1 Cifar
We start with CIFAR dataset to evaluate our method. CIFAR10 dataset consists of 50K training and 10K testing RGB images with 10 classes. CIFAR100 is similar to CIFAR10, except it has 100 classes. As suggested in [10], the input image is
randomly cropped from a zeropadded
image or its flipping. The models in our experiments are trained with a minibatch size of 64 on a single GPU. We start from a learning rate of 0.1 and train the models for 240 epochs. The learning rate is divided by 10 at the 120th,160th and 200th epoch.
VGG: The baseline network is a modified VGG16 with BN [18]^{1}^{1}1Without BN, the performance of this network is very worse in CIFAR100 dataset.. We remove fc6 and fc7 and only use one fullyconnected layer for classification. We add scale factors after every batch normalization layers. Fig. 2 shows the results of our method. Both parameters and floatingpoint operations per second (FLOPs)^{2}^{2}2Multiplyadds. are reported. Our method can save about 30% parameters and 30%  50% computational cost with minor lost of performance.
ResNet: To learn the number of residual blocks, we use ResNet20 and ResNet164 [11] as our baseline networks. ResNet20 consists of 9 residual blocks. Each block has 2 convolutional layers, while ResNet164 has 54 blocks with bottleneck structure in each block. Fig. 3 summarizes our results. It is easy to see that our SSS achieves better performance than the baseline model with similar parameters and FLOPs. For ResNet164, our SSS yields 2.5x speedup with about performance loss both in CIFAR10 and CIFAR100. After optimization, we found that the blocks in early stages are pruned first. This discovery coincides with the common design that the network should spend more budget in its later stage, since more and more diverse and complicated pattern may emerge as the receptive field increases.
Additionally, we try to evaluate the performance of SSL proposed in [40] by adding group sparsity on all the weights in a residual block. However, we found the optimization of the network could not converge. We guess the reason is that the number of parameters in the group we defined (one residual block) is much larger than that of original paper (a single layer), normal SGD with heuristic thresholding adopted in [40] is unable to solve this optimization problem.
ResNeXt: We also test our method on ResNeXt [43]. We choose ResNeXt20 and ResNeXt164 as our base networks. Both of these two networks have bottleneck structures with 32 groups in residual blocks. For ResNeXt20, we focus on groups pruning since there are only 6 residual blocks in it. For ResNeXt164, we add sparsity on both groups and blocks. Fig. 4 shows our experiment results. Both groups pruning and block pruning show good tradeoff between parameters and performance, especially in ResNeXt164. The combination of groups and blocks pruning is extremely effective in CIFAR10. Our SSS saves about FLOPs while achieves higher accuracy. In ResNeXt20, groups in first and second block are pruned first. Similarly, in ResNeXt164, groups in shallow residual blocks are pruned mostly.
4.2 ImageNet LSVRC 2012
To further demonstrate the effectiveness of our method in largescale CNNs, we conduct more experiments on the ImageNet LSVRC 2012 classification task with VGG16 [36], ResNet50 [11] and ResNeXt50 (32 4d) [43]. We do data augmentation based on the publicly available implementation of “fb.resnet” ^{3}^{3}3https://github.com/facebook/fb.resnet.torch. The minibatch size is 128 on 4 GPUs for VGG16 and ResNet50, and 256 on 8 GPUs for ResNeXt50. The optimization and initialization are similar as those in CIFAR experiments. We train the models for 100 epochs. The learning rate is set to an initial value of 0.1 and then divided by 10 at the 30th, 60th and 90th epoch. All the results for ImageNet dataset are summarized in Table 2.
stage  output  ResNet50  ResNet26  ResNet32  ResNet41 
conv1  112112  7 7, 64, stride 2 

conv2  5656  3 3 max pool, stride 2 

conv3  2828  
conv4  1414  
conv5  77  
11  global average pool 1000d FC, softmax 
Model  Top1  Top5  #Parameters  #FLOPs 

VGG16  27.54  9.16  138.3M  30.97B 
VGG16  31.47  11.8  130.5M  7.667B 
ResNet50  23.88  7.14  25.5M  4.089B 
ResNet41  24.56  7.39  25.3M  3.473B 
ResNet32  25.82  8.09  18.6M  2.818B 
ResNet26  28.18  9.21  15.6M  2.329B 
ResNeXt50  22.43  6.32  25.0M  4.230B 
ResNeXt41  24.07  7.00  12.4M  3.234B 
ResNeXt38  25.02  7.50  10.7M  2.431B 
ResNeXt35A  25.43  7.83  10.0M  2.068B 
ResNeXt35B  26.83  8.42  8.50M  1.549B 
VGG16: In our experiments of VGG16 pruning, we find the results of pruning all convolutional layers were not promising. This is because in VGG16, the computational cost in terms of FLOPs is not equally distributed in each layer. The number of FLOPs of conv5 layers is 2.77 billion in total, which is only 9% of the whole network (30.97 billion). Thus, we consider the sparse penalty should be adjusted by computational cost of different layers. Similar idea has been adopted in [28] and [12]. In [28], they introduce FLOPs regularization to the pruning criteria. He et al. [12] do not prune conv5 layers in their VGG16 experiments. Following [12], we set the sparse penalty of conv5 to 0 and only prune conv1 to conv4. The results can be found in Table 2, and Table 3 shows the detailed structure of our pruned VGG16. The pruned model save about 75% FLOPs, while the parameter saving is negligible. This is due to that fullyconnected layers have a large amount of parameters (123 million in original VGG16), and we do not pruned fullyconnected layers for fair comparison with other methods.
ResNet50: For ResNet50, we experiment three different settings of to explore the performance of our method in block pruning. For simplicity, we denote the trained models as ResNet26, ResNet32 and ResNet41 depending on their depths. Their structures are shown in Table 1. All the pruned models come with accuracy loss in certain extent. Comparing with original ResNet50, ResNet41 provides FLOPs reduction with top1 accuracy loss while ResNet32 saves FLOPs with about top1 loss. Fig. 5 shows the top1 validation errors of our SSS models and ResNets as a function of the number of parameters and FLOPs. The results reveal that our pruned models perform on par with original handcrafted ResNets, whilst requiring less parameters and computational cost. For example, comparing with ResNet34 [11], both our ResNet41 and ResNet36 yield better performances with less FLOPs.
ResNeXt50: As for ResNeXt50, we add sparsity constraint on both residual blocks and groups which results in several pruned models. Table 2 summarizes the performance of these models. The learned ResNeXt41 yields top1 error in ILSVRC validation set. It gets similar results with the original ResNet50, but with half parameters and more than 20% less FLOPs. In ResNeXt41, three residual blocks in “conv5” stage are pruned entirely. This pruning result is somewhat contradict to the common design of CNNs, which worth to be studied in depth in the future.
Model  Top1  Top5  #FLOPs 

ResNet34pruned [23]  27.44    3.080B 
ResNet50prunedA [23] (Our impl.)  27.12  8.95  3.070B 
ResNet50prunedB [23] (Our impl.)  27.02  8.92  3.400B 
ResNet32 (Ours)  25.82  8.09  2.818B 
ResNet50pruned (2) [12]    9.20  2.726B 
ResNet26 (Ours)  28.18  9.21  2.329B 
VGG16pruned [28]    15.5  8.0B 
VGG16pruned (5) [12]  32.20  11.90  7.033B 
VGG16pruned (ThiNetConv) [26]  30.20  10.47  9.580B 
VGG16pruned (Ours)  31.47  11.80  7.667B 
4.3 Comparison with other methods
We compare our SSS with other pruning methods, including filter pruning [23], channel pruning [12], ThiNet [26] and [28]. Table 4 shows the pruning results on the ImageNet LSVRC2012 dataset. To the best of our knowledge, only a few works reported ResNet pruning results with FLOPs. Comparing with filter pruning results, our ResNet32 performs best with least FLOPs. As for channel pruning, our ResNet26 yields similar top5 error with pruned ResNet50 provided by [12], but saves about 14.5% FLOPs^{4}^{4}4We calculate the FLOPs of He’s models by provided network structures.. We also show comparison in VGG16. All the method including channel pruning, ThiNet and our SSS achieve significant improvement than [28]. Our VGG16 pruning result is competitive to other stateoftheart.
5 Conclusions
In this paper, we have proposed a datadriven method, Sparse Structure Selection (SSS) to adaptively learn the structure of CNNs. In our framework, the training and pruning of CNNs is formulated as a joint sparse regularized optimization problem. Through pushing the scaling factors which are introduced to scale the outputs of specific structures to zero, our method can remove the structures corresponding to zero scaling factors. To solve this challenging optimization problem and adapt it into deep learning models, we modified the Accelerated Proximal Gradient method. In our experiments, we demonstrate very promising pruning results on VGG, ResNet and ResNeXt. We can adaptively adjust the depth and width of these CNNs based on budgets at hand and difficulties of each task. We believe these pruning results can further inspire the design of more compact CNNs.
In future work, we plan to apply our method in more applications such as object detection. It is also interesting to investigate the use of more advanced sparse regularizers such as nonconvex relaxations, and adjust the penalty based on the complexity of different structures adaptively.
References
 [1] J. M. Alvarez and M. Salzmann. Learning the number of neurons in deep networks. In NIPS, 2016.
 [2] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. In ICLR, 2017.
 [3] Y. Bengio, N. BoulangerLewandowski, and R. Pascanu. Advances in optimizing recurrent networks. In ICASSP, 2013.

[4]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and
Z. Zhang.
MXNet
: A flexible and efficient machine learning library for heterogeneous distributed systems.
In NIPS Workshop, 2015.  [5] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or 1. In NIPS, 2016.
 [6] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
 [7] Y. Guo, A. Yao, and Y. Chen. Dynamic network surgery for efficient dnns. In NIPS, 2016.
 [8] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
 [9] B. Hassibi, D. G. Stork, et al. Second order derivatives for network pruning: Optimal brain surgeon. In NIPS, 1993.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [12] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
 [13] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Workshop, 2014.
 [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. In CVPR, 2017.
 [15] H. Hu, R. Peng, Y.W. Tai, and C.K. Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250, 2016.
 [16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In CVPR, 2017.
 [17] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv:1602.07360, 2016.
 [18] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.
 [19] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. BMVC, 2014.
 [20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.
 [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
 [22] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel. Optimal brain damage. In NIPS, 1990.
 [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient ConvNets. In ICLR, 2017.
 [24] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In CVPR, 2015.
 [25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
 [26] J.H. Luo, J. Wu, and W. Lin. ThiNet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
 [27] Z. Mariet and S. Sra. Diversity networks. In ICLR, 2016.
 [28] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
 [29] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
 [30] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: ImageNet classification using binary convolutional neural networks. In ECCV, 2016.

[31]
E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin.
Largescale evolution of image classifiers.
In ICML, 2017.  [32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015.
 [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [34] Z. Sergey and K. Nikos. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
 [35] F. Shen, R. Gan, and G. Zeng. Weighted residuals for very deep networks. In ICSAI, 2016.
 [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [37] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. In ICML, 2015.
 [38] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
 [39] A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In NIPS, 2016.
 [40] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In NIPS, 2016.
 [41] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
 [42] L. Xie and A. Yuille. Genetic CNN. In ICCV, 2017.
 [43] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [44] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and accurate approximations of nonlinear convolutional networks. In CVPR, 2015.
 [45] H. Zhou, J. M. Alvarez, and F. Porikli. Less is more: Towards compact cnns. In ECCV, 2016.
 [46] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
Comments
There are no comments yet.