1 Introduction
Deep convolutional neural networks (CNNs) have become wider and deeper to achieve high performance on different applications
he2016deep ; huang2017densely ; zoph2017neural . Despite their great success, it is impractical to apply them to resource constrained devices, such as mobile devices and drones.A straightforward solution to address this problem is using network pruning lecun1990optimal ; han2016deep ; han2016learning ; he2017channel ; he2018soft to reduce the computation cost of overparameterized CNNs. A typical pipeline for network pruning, as indicated in the first line of Fig. 1, is achieved by removing the redundant filters and then finetuning the slashed networks, based on the original networks. Different criteria for the importance of the filters are applied, such as L2norm of the filter li2017pruning , reconstruction error he2017channel , and learnable scaling factor liu2017learning . Lastly, researchers apply various finetuning strategies li2017pruning ; he2018soft for the pruned network to efficiently transfer the parameters of the unpruned networks and maximize the performance of the pruned networks.
Traditional network pruning approaches achieve effective impacts on network compacting while maintaining performance. Their network structure is intuitively designed, e.g., pruning 30% filters in each layer li2017pruning ; he2018soft , predicting sparsity ratio he2018amc , or leveraging regularization alvarez2016learning . The performance of the pruned network is upper bounded by the handcrafted structures or rules for structures. To break this limitation, we apply Neural Architecture Search (NAS) to turn the design of the architecture structure into a learning procedure and propose a new paradigm for network pruning as explained in the bottom of Fig. 1.
Prevailing NAS methods liu2019darts ; zoph2017neural ; dong2019search ; cai2018proxylessnas ; real2019regularized optimize the network topology while the focus in this literature of network pruning is the selection of channels. In order to satisfy the requirements and make a fair comparison between the previous pruning strategies, we propose a new NAS scheme termed Transformable Architecture Search (TAS). TAS aims to search for the best size of a network instead of topology, regularized by minimization of the computation cost, e.g., floating point operations (FLOPs). The parameters of the searched/pruned networks are then learned by knowledge transfer hinton2014distilling ; yim2017gift ; zagoruyko2017paying .
TAS is a differentiable searching algorithm, which can search for the width and depth of the networks effectively and efficiently. Specifically, different candidates of channels/layers are attached with a learnable probability. The probability distribution is learned by backpropagating the loss generated by the pruned networks, whose feature map is an aggregation of K feature map fragments (outputs of networks in different sizes) sampled based on the probability distribution. These feature maps of different channel sizes are aggregated with the help of channelwise interpolation. The maximum probability for the size in each distribution serves as the width and depth of the pruned network.
In experiments, we show that the searched architecture with parameters transferred by knowledge distillation outperforms previous stateoftheart pruning methods on three benchmarks: CIFAR10, CIFAR100 and ImageNet. We also test different knowledge transfer approaches on architectures generated by traditional handcrafted pruning approaches li2017pruning ; he2018soft and random architecture search approach liu2019darts . Consistent improvements on different architectures demonstrate the generality of knowledge transfer.
2 Related Studies
Network pruning lecun1990optimal ; liu2019rethinking is an effective technique to compress and accelerate deep neural networks, and thus allows us to develop the powerful networks he2016deep on hardware devices with limited storage and computation resources. A variety of techniques have been proposed, such as lowrank decomposition zhang2016accelerating , weight pruning hassibi1993second ; lecun1990optimal ; han2016learning ; han2016deep , channel pruning he2018soft ; liu2019rethinking , dynamic computation figurnov2017spatially ; dong2017more and quantization hubara2017quantized ; alizadeh2019empirical . These algorithms lie in two modalities: unstructured pruning lecun1990optimal ; figurnov2017spatially ; dong2017more ; han2016deep and structure pruning li2017pruning ; he2017channel ; he2018soft ; liu2019rethinking .
Unstructured pruning methods lecun1990optimal ; figurnov2017spatially ; dong2017more ; han2016deep usually enforce the convolutional weights lecun1990optimal ; hassibi1993second or feature maps dong2017more ; figurnov2017spatially to be sparse. The pioneers of unstructured pruning, LeCun et al. lecun1990optimal and Hassibi et al. hassibi1993second , investigated the use of the secondderivative information to prune weights of shallow networks. After deep network was born in 2012 krizhevsky2012imagenet , Han et al. han2016deep ; han2016learning ; han2016eie proposed a series of works to obtain highly compressed deep networks based on L2 regularization. After this development, many researchers explored different regularization techniques to improve the sparsity while preserve the accuracy, such as L0 regularization louizos2018learning and output sensitivity tartaglione2018learning . Since unstructured pruning methods aim to make a big network sparse instead of changing the whole structure of the network, they need dedicated design for dependencies han2016eie and specific hardware to speedup the inference procedure.
Structured pruning methods li2017pruning ; he2017channel ; he2018soft ; liu2019rethinking target the pruning of convolutional filters or whole layers, and thus the pruned networks can be easily developed and applied. Early works in this field alvarez2016learning ; wen2016learning leveraged a group Lasso to enable structured sparsity of deep networks. After that, Li et al. li2017pruning proposed the typical threestage pruning paradigm (training a large network, pruning, retraining). These pruning algorithms regard filters with a small norm as unimportant and tend to prune them, while this assumption does not hold in deep nonlinear networks ye2018rethinking . Therefore, many researchers focus on better criterion for the informative filters. For example, Liu et al. liu2017learning leveraged a L1 regularization; Ye et al. ye2018rethinking applied a ISTA penalty; and He et al. he2019pruning utilized a geometric medianbased criterion. In contrast to previous pruning pipelines, our approach allows the number of channels/layers to be explicitly optimized so that the learned structure has highperformance and lowcost.
Besides the criteria for informative filters, the importance of network structure was suggested in liu2019rethinking . Some methods implicitly find a dataspecific architecture wen2016learning ; alvarez2016learning ; he2018amc , by automatically determining the pruning and compression ratio of each layer. In contrast, we explicitly discover the architecture using NAS. Most previous NAS algorithms zoph2017neural ; dong2019search ; liu2019darts ; real2019regularized
automatically discover the topology structure of a neural network, while we focus on searching for the depth and width of a neural network. Reinforcement learningbased
zoph2017neuralor evolutionary algorithmbased
real2019regularized methods are possible to search networks with flexible width and depth, whereas they require huge computational resources and cannot be directly used on largescale target datasets. Differentiable methods dong2019search ; liu2019darts ; cai2018proxylessnasdramatically decrease the computation costs but they usually assume that the number of channels in different searching candidates is the same. TAS is a differentiable NAS method, which is able to efficiently search for a transformable networks with flexible width and depth.
Knowledge transfer has been proven to be effective in the literature of pruning. The parameters of the networks can be transfered from the pretrained initialization li2017pruning ; he2018soft . Minnehan et al. minnehan2019cascaded transferred the knowledge of uncompressed network via a blockwise reconstruction loss. In this paper, we apply a simple KD hinton2014distilling to perform knowledge transfer, which achieves robust performance for the searched networks.
3 Methodology
Our pruning approach consists of three steps: (1) training the unpruned large network by a standard classification training procedure. (2) searching for the depth and width of a small network via the proposed TAS. (3) transferring the knowledge from the unpruned large network to the searched small network by a simple KD approach hinton2014distilling . We will introduce the background, show the details of TAS, and explain the knowledge transfer procedure.
3.1 Transformable Architecture Search
Network channel pruning aims to reduce the number of channels in each layer of a network. Given an input image, a network takes it as input and produces the probability over each target class. Suppose and
are the input and output feature tensors of the
th convolutional layer (we take 3by3 convolution as an example), this layer calculates the following procedure:(1) 
where indicates the convolutional kernel weight, is the input channel, and is the output channel. corresponds to the th input channel and th output channel. denotes the convolutional operation. Channel pruning methods could reduce the number of , and consequently, the in the next layer is also reduced.
Search for width. We use parameters to indicate the distribution of the possible number of channels in one layer, indicated by and . The probability of choosing the th candidate for the number of channels can be formulated as:
(2) 
However, the sampling operation in the above procedure is nondifferentiable, which prevents us from backpropagating gradients through to . Motivated by dong2019search , we apply GumbelSoftmax jang2017categorical ; maddison2017concrete to soften the sampling procedure to optimize :
(3) 
where
means the uniform distribution between 0 and 1.
is the softmax temperature. When , becomes oneshot, and the Gumbelsoftmax distribution drawn from becomes identical to the categorical distribution. When , the Gumbelsoftmax distribution becomes a uniform distribution over . The feature map in our method is defined as the weighted sum of the original feature map fragments with different sizes, where weights are . Feature maps with different sizes are aligned by channel wise interpolation (CWI) so as for the operation of weighted sum. To reduce the memory costs, we select a small subset with indexes for aggregation instead of using all candidates. Additionally, the weights are renormalized based on the probability of the selected sizes, which is formulated as:(4) 
where indicates the multinomial probability distribution parameterized by . The CHI
can be implemented in many ways, such a 3D variant of spatial transformer network
jaderberg2015spatial or adaptive pooling operation he2015spatial . In this paper, we choose the 3D adaptive pooling operation he2015spatial as CHI, because it brings no extra parameters and negligible extra costs. We use Batch Normalization
ioffe2015batch before CHI to normalize different fragments. The above procedure is illustrated in Fig. 2 by taking as an example.Search for depth. We use parameters to indicate the distribution of the possible number of layers in a network with convolutional layers. We utilize a similar strategy to sample the number of layers following Eq. (3) and allow to be differentiable as that of , using sampling distribution for depth . We then calculate the final output feature of the pruned networks as an aggregation from all possible depths, which can be formulated as:
(5) 
where indicates the output feature map via Eq. (4) at the th layer. indicates the maximum sampled channel among all . The final output feature map is fed into the last classification layer to make predictions. In this way, we can backpropagate gradients to both width parameters and depth parameters .
Searching objectives. The final architecture is derived by selecting the candidate with the maximum probability, learned by the architecture parameters , consisting of for each layers and . The goal of our TAS is to find an architecture with the minimum validation loss after trained by minimizing the training loss as:
(6) 
where indicates the optimized weights of . The training loss is the crossentropy classification loss of the networks. Prevailing NAS methods liu2019darts ; zoph2017neural ; dong2019search ; cai2018proxylessnas ; real2019regularized optimize over network candidates with different typologies, while our TAS searches over candidates with the same typology structure as well as smaller width and depth. As a result, the validation loss in our search procedure includes not only the classification validation loss but also the penalty for the computation cost:
(7) 
where
is a vector denoting the output logits from the pruned networks,
indicates the ground truth class of a corresponding input, and is the weight of . The cost loss aims to encourage the FLOP of the network to converge to a target R so that the computation cost of the network can be dynamically adjusted by setting different R. The piecewise computation cost loss is defined as:(8) 
where computes the expectation of the computation cost, based on the architecture parameters . indicates the actual cost of the searched architecture, which is derived from . denotes a toleration ratio, which slows down the speed of changing the
searched architecture. Note that we use FLOP to evaluate the computation cost of a network, while it is readily to replace FLOP with other metric, such as latency cai2018proxylessnas .
We show the overall algorithm in Alg. 1. During searching, we forward the network using Eq. (5) to make both weights and architecture parameters differentiable. We alternatively minimize on the training set to optimize the pruned networks’ weights and on the validation set to optimize the architecture parameters . After searching, we pick up the number of channels with the maximum probability as width and the number of layers with the maximum probability as depth. The final searched network is constructed by the selected width and depth. This network will be optimized via KD (introduced in the next section).
3.2 Knowledge Transfer
Knowledge transfer is important to learn a robust pruned network, and we employ a simple KD algorithm hinton2014distilling on a searched network architecture. This algorithm encourages the predictions of the small network to match soft targets from the unpruned network via the following objective:
(9) 
where is a temperature, and indicates the logit output vector from the pretrained unpruned network. Additionally, it uses a softmax with crossentropy loss to encourage the small network to predict the true targets. The final objective of KD is as follows:
(10) 
where indicates the true target class of a corresponding input. is the weight of loss to balance the standard classification loss and soft matching loss. After we obtain the searched network (Sec. 3.1), we first pretrain the unpruned network, and then optimize the searched network by transferring from the unpruned network via Eq. (10).
4 Experimental Analysis
4.1 Datasets and Settings
Datasets. We evaluate our approach on CIFAR10, CIFAR100 krizhevsky2009learning , and ImageNet deng2009imagenet . CIFAR10 contains 50K training images and 10K test images with 10 classes. CIFAR100 is similar to CIFAR10 but has 100 classes. ImageNet contains 1.28 million training images and 50K test images with 1000 classes.
The search setting. We search the number of channels over {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} of the original number in the unpruned network. We search the depth within each convolutional stage. We sample candidates in Eq. (4) to reduce the GPU memory cost during searching. We set according to the FLOP of the compared pruning algorithms. We optimize the weights via SGD and the architecture parameters via Adam. For the weights, we start the learning rate from 0.1 and reduce it by the cosine scheduler loshchilov2017sgdr
. For the architecture parameters, we use the constant learning rate of 0.001 and a weight decay of 0.001. On both CIFAR10 and CIFAR100, we train the model for 600 epochs with the batch size of 256. On ImageNet, we train ResNets
he2016deep for 120 epochs and MobileNetV2 sandler2018mobilenetv2 for 150 epochs with the batch size of 256. The toleration ratio is always set as 5%. The in Eq. (3) is linearly decayed from 10 to 0.1.


Training. For CIFAR experiments, we use SGD with a momentum of 0.9 and a weight decay of 0.0005. We train each model by 300 epochs, start the learning rate at 0.1, and reduce it by the cosine scheduler loshchilov2017sgdr . We use the batch size of 256 and 2 GPUs. When using KD on CIFAR, we use of 0.9 and the temperature of 4 following zagoruyko2017paying . For ResNet models on ImageNet, we follow most
FLOP  accuracy  
Predefined  41.1 MB  68.18 % 
Predefined w/ Init  41.1 MB  69.34 % 
Predefined w/ KD  41.1 MB  71.40 % 
Random Search  42.9 MB  68.57 % 
Random Search w/ Init  42.9 MB  69.14 % 
Random Search w/ KD  42.9 MB  71.71 % 
TAS  42.5 MB  68.95 % 
TAS w/ Init  42.5 MB  69.70 % 
TAS w/ KD (TAS)  42.5 MB  72.41 % 
hyperparameters as CIFAR, but use a weight decay of 0.0001. We train the model by 120 epochs. For MobileNetV2 sandler2018mobilenetv2 , we use SGD with the momentum of 0.9 and the weight decay of 0.00004. We train each model by 150 epochs, start the learning rate of 0.05, and reduce it by the cosine scheduler. We also set the dropout ratio to 0 in MobileNetV2. We find this training strategy can make MobileNetV2 converge faster and obtain a similar performance compared to that of sandler2018mobilenetv2 . For all ImageNet experiments, we use the batch size of 256 and 4 GPUs. When using KD on ImageNet, we set as 0.5 and as 4 on ImageNet.
4.2 Case Studies
In this section, we evaluate different aspects of our proposed TAS. We also compare it with different searching algorithm and knowledge transfer method to demenstrate the effectiveness of TAS.
The effect of different strategies to differentiate . We apply our TAS on CIFAR100 to prune ResNet110. We try two different aggregation methods, i.e., using our proposed CHI to align feature maps or not. We also try two different kinds of aggregation weights, i.e., Gumbelsoftmax sampling as Eq. (3) (denoted as “sample” in Fig. 3) and vanillasoftmax as Eq. (2) (denoted as “mixture” in Fig. 3). Therefore, there are four different strategies, i.e., with/without CHI combining with Gumbelsoftmax/vanillasoftmax. Suppose we do not constrain the computational cost, then the architecture parameters should be optimized to find the maximum width and depth. This is because such network will have the maximum capacity and result in the best performance on CIFAR100. We try all four strategies without using and show the results in Fig. 2(a). Our TAS can successfully find the best architecture should have a maximum width and depth. However, other three strategies failed. We also investigate discrepancy between the highest probability and the second highest probability in Fig. 2(b). Theoretically, a higher discrepancy indicates that the model is more confident to select a certain width, while a lower discrepancy means that the model is confused and does not know which candidate to select. As shown in Fig. 2(b), with the training procedure going, our TAS becomes more confident to select the suitable width. In contrast, strategies without CHI can not optimize the architecture parameters; and “mixture with CHI” shows a worse discrepancy than ours.
Depth  Method  CIFAR10  CIFAR100  
Prune Acc  Acc Drop  FLOP  Prune Acc  Acc Drop  FLOP  
20  LCCL dong2017more  91.68%  1.06%  2.61E7 (36.0%)  64.66%  2.87%  2.73E7 (33.1%) 
SFP he2018soft  90.83%  1.37%  2.43E7 (42.2%)  64.37%  3.25%  2.43E7 (42.2%)  
FPGM he2019pruning  91.09%  1.11%  2.43E7 (42.2%)  66.86%  0.76%  2.43E7 (42.2%)  
TAS (D)  90.97%  1.91%  2.19E7 (46.2%)  64.81%  3.88%  2.19E7 (46.2%)  
TAS (W)  92.31%  0.57%  1.99E7 (51.3%)  68.08%  0.61%  1.92E7 (52.9%)  
TAS  92.88%  0.00%  2.24E7 (45.0%)  68.90%  0.21%  2.24E7 (45.0%)  
32  LCCL dong2017more  90.74%  1.59%  4.76E7 (31.2%)  67.39%  2.69%  4.32E7 (37.5%) 
SFP he2018soft  92.08%  0.55%  4.03E7 (41.5%)  68.37%  1.40%  4.03E7 (41.5%)  
FPGM he2019pruning  92.31%  0.32%  4.03E7 (41.5%)  68.52%  1.25%  4.03E7 (41.5%)  
TAS (D)  91.48%  2.41%  4.08E7 (41.0%)  66.94%  3.66%  4.08E7 (41.0%)  
TAS (W)  92.92%  0.96%  3.78E7 (45.4%)  71.74%  1.12%  3.80E7 (45.0%)  
TAS  93.16%  0.73%  3.50E7 (49.4%)  72.41%  1.80%  4.25E7 (38.5%)  
56  PFEC li2017pruning  91.31%  1.75%  9.10E7 (27.6%)  
CP he2017channel  91.80%  1.00%  6.29E7 (50.0%)  
LCCL dong2017more  92.81%  1.54%  7.81E7 (37.9%)  68.37%  2.96%  7.63E7 (39.3%)  
AMC he2018amc  91.90%  0.90%  6.29E7 (50.0%)  
SFP he2018soft  93.35%  0.24%  5.94E7 (52.6%)  68.79%  2.61%  5.94E7 (52.6%)  
FPGM he2019pruning  93.49%  0.10%  5.94E7 (52.6%)  69.66%  1.75%  5.94E7 (52.6%)  
TAS  93.69%  0.77%  5.95E7 (52.7%)  72.25%  0.93%  6.12E7 (51.3%)  
110  LCCLdong2017more  93.44%  0.19%  1.66E8 (34.2%)  70.78%  2.01%  1.73E8 (31.3%) 
PFEC li2017pruning  93.30%  0.20%  1.55E8 (38.6%)  
SFP he2018soft  92.97%  0.70%  1.21E8 (52.3%)  71.28%  2.86%  1.21E8 (52.3%)  
FPGM he2019pruning  93.85%  0.17%  1.21E8 (52.3%)  72.55%  1.59%  1.21E8 (52.3%)  
TAS  94.33%  0.64%  1.19E8 (53.0%)  73.16%  1.90%  1.20E8 (52.6%)  
164  LCCLdong2017more  94.09%  0.45%  1.79E8 (27.40%)  75.26%  0.41%  1.95E8 (21.3%) 
TAS  94.00%  1.47%  1.78E8 (28.10%)  77.76%  0.53%  1.71E8 (30.9%) 
Comparison w.r.t. structure generated by different methods in Table 1. ‘Predefined’ means pruning a fixed ratio at each layer li2017pruning . ‘Random Search’ indicates an NAS baseline used in liu2019darts . ‘TAS’ is our proposed differentiable searching algorithm. We make two observations: (1) searching can find a better structure using different knowledge transfer methods; (2) our TAS is superior to the NAS random baseline.
Comparison w.r.t. different knowledge transfer methods in Table 1. The first line in each block does not use any knowledge transfer method. “w/ Init” indicates using pretrained unpruned network as initialization. “w/ KD” indicates using KD. From Table 1, knowledge transfer methods can consistently improve the accuracy of pruned network, even if a simple method is applied (Init). Besides, KD is robust and improves the pruned network by more than 2% accuracy on CIFAR100.
Searching width vs. searching depth. We try (1) only searching depth (“TAS (D)”), (2) only searching width (“TAS (W)”), and (3) searching both depth and width (“TAS”) in Table 2. Only searching depth obtains a worse results than searching width. Searching both depth and width achieves better accuracy with similar FLOP than both searching depth and searching width only.
4.3 Compared to the stateoftheart
Results on CIFAR in Table 2. We prune different ResNets on both CIFAR10 and CIFAR100. Most previous algorithms perform poorly on CIFAR100, while our TAS consistently outperforms then by more than 2% accuracy in most cases. On CIFAR10, our TAS outperforms the stateoftheart algorithms on ResNet20,32,56,110. For example, TAS obtains 72.25% accuracy by pruning ResNet56 on CIFAR100, which is higher than 69.66% of FPGM he2019pruning . For pruning ResNet32 on CIFAR100, we obtain greater accuracy and less FLOP than the unpruned network. We obtain a slightly worse performance thanLCCL dong2017more on ResNet164. It because there are candidate network structures to searching for pruning ResNet164. It is challenging to search over such huge search space, and the very deep network has the overfitting problem on CIFAR10 he2016deep .
Model  Method  Top1  Top5  FLOPs 


Prune Acc  Acc Drop  Prune Acc  Acc Drop  
ResNet18  LCCL dong2017more  66.33%  3.65%  86.94%  2.29%  1.19E9  34.6%  
SFP he2018soft  67.10%  3.18%  87.78%  1.85%  1.06E9  41.8%  
FPGM he2019pruning  68.41%  1.87%  88.48%  1.15%  1.06E9  41.8%  
TAS  69.15%  1.50%  89.19%  0.68%  1.21E9  33.3%  
ResNet50  SFP he2018soft  74.61%  1.54%  92.06%  0.81%  2.38E9  41.8%  
CP he2017channel      90.80%  1.40%  2.04E9  50.0%  
Taylor molchanov2019importance  74.50%  1.68%      2.25E9  44.9%  
AutoSlim yu2019network  76.00%        3.00E9  26.6%  
FPGM he2019pruning  75.50%  0.65%  92.63%  0.21%  2.36E9  42.2%  
TAS  76.20%  1.26%  93.07%  0.48%  2.31E9  43.5%  
MobileNetV2  AMC he2018amc  70.80%  1.00%      1.50E8  50.0%  
TAS  70.90%  1.18%  89.74%  0.76%  1.49E8  50.3% 
Results on ImageNet in Table 3. We prune ResNet18, ResNet50 and MobileNetV2 on ImageNet. For ResNet18, it takes about 59 hours to search for the pruned network on 4 NVIDIA Tesla V100 GPUs. The training time of unpruned ResNet18 costs about 24 hours, and thus the searching time is acceptable. With more machines and optimized implementation, we can finish TAS with less time cost. We show competitive results compared to other stateoftheart pruning algorithms. For example, TAS prunes ResNet50 by 43.5% FLOPs, and the pruned network achieves 76.20% accuracy, which is higher than FPGM by 0.7. Similar improvements can be found when pruning ResNet18 and MobileNetV2. Note that we directly apply the hyperparameters on CIFAR10 to prune models on ImageNet, and thus TAS can potentially achieve a better result by carefully tuning parameters on ImageNet.
Our proposed TAS is a preliminary work for the new network pruning pipeline. This pipeline can be improved by designing more effective searching algorithm and knowledge transfer method. We hope that future work to explore these two components will yield powerful compact networks.
5 Conclusion
In this paper, we propose a new paradigm for network pruning, which consists of two components. For the first component, we propose to apply NAS to search for the best depth and width of a network. Since most previous NAS approaches focus on the network topology instead the network size, we name this new NAS scheme as Transformable Architecture Search (TAS). Furthermore, we propose a differentiable TAS approach to efficiently and effectively find the most suitable depth and width of a network. For the second component, we propose to optimize the searched network by transferring knowledge from the unpruned network. In this paper, we apply a simple KD algorithm to perform knowledge transfer, and conduct other transferring approaches to demonstrate the effectiveness of this component. Our results show that new efforts focusing on searching and transferring may lead to new breakthroughs in network pruning.
References
 [1] M. Alizadeh, J. FernándezMarqués, N. D. Lane, and Y. Gal. An empirical study of binary neural networks’ optimisation. In International Conference on Learning Representations (ICLR), 2019.

[2]
J. M. Alvarez and M. Salzmann.
Learning the number of neurons in deep networks.
In The Conference on Neural Information Processing Systems (NeurIPS), pages 2270–2278, 2016.  [3] H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2019.

[4]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
ImageNet: A largescale hierarchical image database.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 248–255, 2009.  [5] X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5840–5848, 2017.
 [6] X. Dong and Y. Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1761–1770, 2019.
 [7] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1039–1048, 2017.
 [8] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In The ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 243–254, 2016.
 [9] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2015.
 [10] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In The Conference on Neural Information Processing Systems (NeurIPS), pages 1135–1143, 2015.
 [11] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In The Conference on Neural Information Processing Systems (NeurIPS), pages 164–171, 1993.
 [12] J. L. Z. L. H. W. L.J. L. He, Yihui and S. Han. AMC: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 183–202, 2018.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9):1904–1916, 2015.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[15]
Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang.
Soft filter pruning for accelerating deep convolutional neural
networks.
In
International Joint Conference on Artificial Intelligence (IJCAI)
, pages 2234–2240, 2018.  [16] Y. He, P. Liu, Z. Wang, and Y. Yang. Pruning filter via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4340–4349, 2019.
 [17] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1389–1397, 2017.
 [18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In The Conference on Neural Information Processing Systems Workshop (NeurIPSW), 2014.
 [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708, 2017.

[20]
I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio.
Quantized neural networks: Training neural networks with low
precision weights and activations.
The Journal of Machine Learning Research (JMLR)
, 18(1):6869–6898, 2017.  [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In The International Conference on Machine Learning (ICML), pages 448–456, 2015.
 [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 2017–2025, 2015.
 [23] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations (ICLR), 2017.
 [24] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012.
 [26] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In The Conference on Neural Information Processing Systems (NeurIPS), pages 598–605, 1990.
 [27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR), 2017.
 [28] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations (ICLR), 2019.
 [29] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2736–2744, 2017.
 [30] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2018.

[31]
I. Loshchilov and F. Hutter.
SGDR: Stochastic gradient descent with warm restarts.
In International Conference on Learning Representations (ICLR), 2017.  [32] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations (ICLR), 2018.

[33]
C. J. Maddison, A. Mnih, and Y. W. Teh.
The concrete distribution: A continuous relaxation of discrete random variables.
In International Conference on Learning Representations (ICLR), 2017.  [34] B. Minnehan and A. Savakis. Cascaded projection: Endtoend network compression and acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10715–10724, 2019.

[35]
P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz.
Importance estimation for neural network pruning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11264–11272, 2019. 
[36]
E. Real, A. Aggarwal, Y. Huang, and Q. V. Le.
Regularized evolution for image classifier architecture search.
In AAAI Conference on Artificial Intelligence (AAAI), 2019.  [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
 [38] E. Tartaglione, S. Lepsøy, A. Fiandrotti, and G. Francini. Learning sparse neural networks via sensitivitydriven regularization. In The Conference on Neural Information Processing Systems (NeurIPS), pages 3878–3888, 2018.
 [39] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 2074–2082, 2016.
 [40] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. In International Conference on Learning Representations (ICLR), 2018.

[41]
J. Yim, D. Joo, J. Bae, and J. Kim.
A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4133–4141, 2017.  [42] J. Yu and T. Huang. Network slimming by slimmable networks: Towards oneshot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.
 [43] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR), 2017.
 [44] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(10):1943–1955, 2016.
 [45] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.