DeepAI
Log In Sign Up

Network Pruning via Transformable Architecture Search

Network pruning reduces the computation costs of an over-parameterized network without performance damage. Prevailing pruning algorithms pre-define the width and depth of the pruned networks, and then transfer parameters from the unpruned network to pruned networks. To break the structure limitation of the pruned networks, we propose to apply neural architecture search to search directly for a network with flexible channel and layer sizes. The number of the channels/layers is learned by minimizing the loss of the pruned networks. The feature map of the pruned network is an aggregation of K feature map fragments (generated by K networks of different sizes), which are sampled based on the probability distribution.The loss can be back-propagated not only to the network weights, but also to the parameterized distribution to explicitly tune the size of the channels/layers. Specifically, we apply channel-wise interpolation to keep the feature map with different channel sizes aligned in the aggregation procedure. The maximum probability for the size in each distribution serves as the width and depth of the pruned network, whose parameters are learned by knowledge transfer, e.g., knowledge distillation, from the original networks. Experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate the effectiveness of our new perspective of network pruning compared to traditional network pruning algorithms. Various searching and knowledge transfer approaches are conducted to show the effectiveness of the two components.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/19/2020

Knapsack Pruning with Inner Distillation

Neural network pruning reduces the computational cost of an over-paramet...
06/02/2022

Pruning-as-Search: Efficient Neural Architecture Search via Channel Pruning and Structural Reparameterization

Neural architecture search (NAS) and network pruning are widely studied ...
11/04/2020

Channel Planting for Deep Neural Networks using Knowledge Distillation

In recent years, deeper and wider neural networks have shown excellent p...
05/28/2020

A Feature-map Discriminant Perspective for Pruning Deep Neural Networks

Network pruning has become the de facto tool to accelerate deep neural n...
10/13/2021

CONetV2: Efficient Auto-Channel Size Optimization for CNNs

Neural Architecture Search (NAS) has been pivotal in finding optimal net...
03/25/2022

Searching for Network Width with Bilaterally Coupled Network

Searching for a more compact network width recently serves as an effecti...
05/21/2021

BCNet: Searching for Network Width with Bilaterally Coupled Network

Searching for a more compact network width recently serves as an effecti...

1 Introduction

Deep convolutional neural networks (CNNs) have become wider and deeper to achieve high performance on different applications 

he2016deep ; huang2017densely ; zoph2017neural . Despite their great success, it is impractical to apply them to resource constrained devices, such as mobile devices and drones.

Figure 1: A comparison between the typical pruning paradigm and the proposed paradigm.

A straightforward solution to address this problem is using network pruning lecun1990optimal ; han2016deep ; han2016learning ; he2017channel ; he2018soft to reduce the computation cost of over-parameterized CNNs. A typical pipeline for network pruning, as indicated in the first line of Fig. 1, is achieved by removing the redundant filters and then fine-tuning the slashed networks, based on the original networks. Different criteria for the importance of the filters are applied, such as L2-norm of the filter li2017pruning , reconstruction error he2017channel , and learnable scaling factor liu2017learning . Lastly, researchers apply various fine-tuning strategies li2017pruning ; he2018soft for the pruned network to efficiently transfer the parameters of the unpruned networks and maximize the performance of the pruned networks.

Traditional network pruning approaches achieve effective impacts on network compacting while maintaining performance. Their network structure is intuitively designed, e.g., pruning 30% filters in each layer li2017pruning ; he2018soft , predicting sparsity ratio he2018amc , or leveraging regularization alvarez2016learning . The performance of the pruned network is upper bounded by the hand-crafted structures or rules for structures. To break this limitation, we apply Neural Architecture Search (NAS) to turn the design of the architecture structure into a learning procedure and propose a new paradigm for network pruning as explained in the bottom of Fig. 1.

Prevailing NAS methods liu2019darts ; zoph2017neural ; dong2019search ; cai2018proxylessnas ; real2019regularized optimize the network topology while the focus in this literature of network pruning is the selection of channels. In order to satisfy the requirements and make a fair comparison between the previous pruning strategies, we propose a new NAS scheme termed Transformable Architecture Search (TAS). TAS aims to search for the best size of a network instead of topology, regularized by minimization of the computation cost, e.g., floating point operations (FLOPs). The parameters of the searched/pruned networks are then learned by knowledge transfer hinton2014distilling ; yim2017gift ; zagoruyko2017paying .

TAS is a differentiable searching algorithm, which can search for the width and depth of the networks effectively and efficiently. Specifically, different candidates of channels/layers are attached with a learnable probability. The probability distribution is learned by back-propagating the loss generated by the pruned networks, whose feature map is an aggregation of K feature map fragments (outputs of networks in different sizes) sampled based on the probability distribution. These feature maps of different channel sizes are aggregated with the help of channel-wise interpolation. The maximum probability for the size in each distribution serves as the width and depth of the pruned network.

In experiments, we show that the searched architecture with parameters transferred by knowledge distillation outperforms previous state-of-the-art pruning methods on three benchmarks: CIFAR-10, CIFAR-100 and ImageNet. We also test different knowledge transfer approaches on architectures generated by traditional hand-crafted pruning approaches li2017pruning ; he2018soft and random architecture search approach liu2019darts . Consistent improvements on different architectures demonstrate the generality of knowledge transfer.

2 Related Studies

Network pruning lecun1990optimal ; liu2019rethinking is an effective technique to compress and accelerate deep neural networks, and thus allows us to develop the powerful networks he2016deep on hardware devices with limited storage and computation resources. A variety of techniques have been proposed, such as low-rank decomposition zhang2016accelerating , weight pruning hassibi1993second ; lecun1990optimal ; han2016learning ; han2016deep , channel pruning he2018soft ; liu2019rethinking , dynamic computation figurnov2017spatially ; dong2017more and quantization hubara2017quantized ; alizadeh2019empirical . These algorithms lie in two modalities: unstructured pruning lecun1990optimal ; figurnov2017spatially ; dong2017more ; han2016deep and structure pruning li2017pruning ; he2017channel ; he2018soft ; liu2019rethinking .

Unstructured pruning methods lecun1990optimal ; figurnov2017spatially ; dong2017more ; han2016deep usually enforce the convolutional weights lecun1990optimal ; hassibi1993second or feature maps dong2017more ; figurnov2017spatially to be sparse. The pioneers of unstructured pruning, LeCun et al. lecun1990optimal and Hassibi et al. hassibi1993second , investigated the use of the second-derivative information to prune weights of shallow networks. After deep network was born in 2012 krizhevsky2012imagenet , Han et al. han2016deep ; han2016learning ; han2016eie proposed a series of works to obtain highly compressed deep networks based on L2 regularization. After this development, many researchers explored different regularization techniques to improve the sparsity while preserve the accuracy, such as L0 regularization louizos2018learning and output sensitivity tartaglione2018learning . Since unstructured pruning methods aim to make a big network sparse instead of changing the whole structure of the network, they need dedicated design for dependencies han2016eie and specific hardware to speedup the inference procedure.

Structured pruning methods li2017pruning ; he2017channel ; he2018soft ; liu2019rethinking target the pruning of convolutional filters or whole layers, and thus the pruned networks can be easily developed and applied. Early works in this field alvarez2016learning ; wen2016learning leveraged a group Lasso to enable structured sparsity of deep networks. After that, Li et al. li2017pruning proposed the typical three-stage pruning paradigm (training a large network, pruning, re-training). These pruning algorithms regard filters with a small norm as unimportant and tend to prune them, while this assumption does not hold in deep nonlinear networks ye2018rethinking . Therefore, many researchers focus on better criterion for the informative filters. For example, Liu et al. liu2017learning leveraged a L1 regularization; Ye et al. ye2018rethinking applied a ISTA penalty; and He et al. he2019pruning utilized a geometric median-based criterion. In contrast to previous pruning pipelines, our approach allows the number of channels/layers to be explicitly optimized so that the learned structure has high-performance and low-cost.

Besides the criteria for informative filters, the importance of network structure was suggested in liu2019rethinking . Some methods implicitly find a data-specific architecture wen2016learning ; alvarez2016learning ; he2018amc , by automatically determining the pruning and compression ratio of each layer. In contrast, we explicitly discover the architecture using NAS. Most previous NAS algorithms zoph2017neural ; dong2019search ; liu2019darts ; real2019regularized

automatically discover the topology structure of a neural network, while we focus on searching for the depth and width of a neural network. Reinforcement learning-based 

zoph2017neural

or evolutionary algorithm-based 

real2019regularized methods are possible to search networks with flexible width and depth, whereas they require huge computational resources and cannot be directly used on large-scale target datasets. Differentiable methods dong2019search ; liu2019darts ; cai2018proxylessnas

dramatically decrease the computation costs but they usually assume that the number of channels in different searching candidates is the same. TAS is a differentiable NAS method, which is able to efficiently search for a transformable networks with flexible width and depth.

Knowledge transfer has been proven to be effective in the literature of pruning. The parameters of the networks can be transfered from the pre-trained initialization li2017pruning ; he2018soft . Minnehan et al. minnehan2019cascaded transferred the knowledge of uncompressed network via a block-wise reconstruction loss. In this paper, we apply a simple KD hinton2014distilling to perform knowledge transfer, which achieves robust performance for the searched networks.

3 Methodology

Our pruning approach consists of three steps: (1) training the unpruned large network by a standard classification training procedure. (2) searching for the depth and width of a small network via the proposed TAS. (3) transferring the knowledge from the unpruned large network to the searched small network by a simple KD approach hinton2014distilling . We will introduce the background, show the details of TAS, and explain the knowledge transfer procedure.

Figure 2: Searching for the width of a pruned CNN from an unpruned three-layer CNN. Each convolutional layer is equipped with a learnable distribution for the size of the channels in this layer, indicated by on the left side. The feature map for every layer is built sequentially by the layers, as shown on the right side. For a specific layer, K (2 in this example) feature maps of different sizes are sampled according to corresponding distribution and combined by channel-wise interpolation (CWI) and weighted sum. This aggregated feature map is fed as input to the next layer.

3.1 Transformable Architecture Search

Network channel pruning aims to reduce the number of channels in each layer of a network. Given an input image, a network takes it as input and produces the probability over each target class. Suppose and

are the input and output feature tensors of the

-th convolutional layer (we take 3-by-3 convolution as an example), this layer calculates the following procedure:

(1)

where indicates the convolutional kernel weight, is the input channel, and is the output channel. corresponds to the -th input channel and -th output channel. denotes the convolutional operation. Channel pruning methods could reduce the number of , and consequently, the in the next layer is also reduced.

Search for width. We use parameters to indicate the distribution of the possible number of channels in one layer, indicated by and . The probability of choosing the -th candidate for the number of channels can be formulated as:

(2)

However, the sampling operation in the above procedure is non-differentiable, which prevents us from back-propagating gradients through to . Motivated by dong2019search , we apply Gumbel-Softmax jang2017categorical ; maddison2017concrete to soften the sampling procedure to optimize :

(3)

where

means the uniform distribution between 0 and 1.

is the softmax temperature. When , becomes one-shot, and the Gumbel-softmax distribution drawn from becomes identical to the categorical distribution. When , the Gumbel-softmax distribution becomes a uniform distribution over . The feature map in our method is defined as the weighted sum of the original feature map fragments with different sizes, where weights are . Feature maps with different sizes are aligned by channel wise interpolation (CWI) so as for the operation of weighted sum. To reduce the memory costs, we select a small subset with indexes for aggregation instead of using all candidates. Additionally, the weights are re-normalized based on the probability of the selected sizes, which is formulated as:

(4)

where indicates the multinomial probability distribution parameterized by . The CHI

can be implemented in many ways, such a 3D variant of spatial transformer network 

jaderberg2015spatial or adaptive pooling operation he2015spatial . In this paper, we choose the 3D adaptive pooling operation he2015spatial as CHI

, because it brings no extra parameters and negligible extra costs. We use Batch Normalization 

ioffe2015batch before CHI to normalize different fragments. The above procedure is illustrated in Fig. 2 by taking as an example.

Search for depth. We use parameters to indicate the distribution of the possible number of layers in a network with convolutional layers. We utilize a similar strategy to sample the number of layers following Eq. (3) and allow to be differentiable as that of , using sampling distribution for depth . We then calculate the final output feature of the pruned networks as an aggregation from all possible depths, which can be formulated as:

(5)

where indicates the output feature map via Eq. (4) at the -th layer. indicates the maximum sampled channel among all . The final output feature map is fed into the last classification layer to make predictions. In this way, we can back-propagate gradients to both width parameters and depth parameters .

Searching objectives. The final architecture is derived by selecting the candidate with the maximum probability, learned by the architecture parameters , consisting of for each layers and . The goal of our TAS is to find an architecture with the minimum validation loss after trained by minimizing the training loss as:

(6)

where indicates the optimized weights of . The training loss is the cross-entropy classification loss of the networks. Prevailing NAS methods liu2019darts ; zoph2017neural ; dong2019search ; cai2018proxylessnas ; real2019regularized optimize over network candidates with different typologies, while our TAS searches over candidates with the same typology structure as well as smaller width and depth. As a result, the validation loss in our search procedure includes not only the classification validation loss but also the penalty for the computation cost:

(7)

where

is a vector denoting the output logits from the pruned networks,

indicates the ground truth class of a corresponding input, and is the weight of . The cost loss aims to encourage the FLOP of the network to converge to a target R so that the computation cost of the network can be dynamically adjusted by setting different R. The piece-wise computation cost loss is defined as:

(8)

where computes the expectation of the computation cost, based on the architecture parameters . indicates the actual cost of the searched architecture, which is derived from . denotes a toleration ratio, which slows down the speed of changing the

0:  the training set
0:  the validation set
1:  while not converge do
2:     Sample batch data from
3:     Calculate on to update network weights
4:     Sample batch data from
5:     Calculate on via Eq. (7) to update
6:  end while
7:  Derive the searched network from
8:  Optimize the searched network network by KD via Eq. (10)
Algorithm 1 The TAS Procedure

searched architecture. Note that we use FLOP to evaluate the computation cost of a network, while it is readily to replace FLOP with other metric, such as latency cai2018proxylessnas .

We show the overall algorithm in Alg. 1. During searching, we forward the network using Eq. (5) to make both weights and architecture parameters differentiable. We alternatively minimize on the training set to optimize the pruned networks’ weights and on the validation set to optimize the architecture parameters . After searching, we pick up the number of channels with the maximum probability as width and the number of layers with the maximum probability as depth. The final searched network is constructed by the selected width and depth. This network will be optimized via KD (introduced in the next section).

3.2 Knowledge Transfer

Knowledge transfer is important to learn a robust pruned network, and we employ a simple KD algorithm hinton2014distilling on a searched network architecture. This algorithm encourages the predictions of the small network to match soft targets from the unpruned network via the following objective:

(9)

where is a temperature, and indicates the logit output vector from the pre-trained unpruned network. Additionally, it uses a softmax with cross-entropy loss to encourage the small network to predict the true targets. The final objective of KD is as follows:

(10)

where indicates the true target class of a corresponding input. is the weight of loss to balance the standard classification loss and soft matching loss. After we obtain the searched network (Sec. 3.1), we first pre-train the unpruned network, and then optimize the searched network by transferring from the unpruned network via Eq. (10).

4 Experimental Analysis

4.1 Datasets and Settings

Datasets. We evaluate our approach on CIFAR-10, CIFAR-100 krizhevsky2009learning , and ImageNet deng2009imagenet . CIFAR-10 contains 50K training images and 10K test images with 10 classes. CIFAR-100 is similar to CIFAR-10 but has 100 classes. ImageNet contains 1.28 million training images and 50K test images with 1000 classes.

The search setting. We search the number of channels over {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} of the original number in the unpruned network. We search the depth within each convolutional stage. We sample candidates in Eq. (4) to reduce the GPU memory cost during searching. We set according to the FLOP of the compared pruning algorithms. We optimize the weights via SGD and the architecture parameters via Adam. For the weights, we start the learning rate from 0.1 and reduce it by the cosine scheduler loshchilov2017sgdr

. For the architecture parameters, we use the constant learning rate of 0.001 and a weight decay of 0.001. On both CIFAR-10 and CIFAR-100, we train the model for 600 epochs with the batch size of 256. On ImageNet, we train ResNets 

he2016deep for 120 epochs and MobileNet-V2 sandler2018mobilenetv2 for 150 epochs with the batch size of 256. The toleration ratio is always set as 5%. The in Eq. (3) is linearly decayed from 10 to 0.1.

(a) FLOP of the searched network over epochs.
(b) The mean discrepancy over epochs.
Figure 3: The impact of different choices to make architecture parameters differentiable.

Training. For CIFAR experiments, we use SGD with a momentum of 0.9 and a weight decay of 0.0005. We train each model by 300 epochs, start the learning rate at 0.1, and reduce it by the cosine scheduler loshchilov2017sgdr . We use the batch size of 256 and 2 GPUs. When using KD on CIFAR, we use of 0.9 and the temperature of 4 following zagoruyko2017paying . For ResNet models on ImageNet, we follow most

FLOP accuracy
Pre-defined 41.1 MB 68.18 %
Pre-defined w/ Init 41.1 MB 69.34 %
Pre-defined w/ KD 41.1 MB 71.40 %
Random Search 42.9 MB 68.57 %
Random Search w/ Init 42.9 MB 69.14 %
Random Search w/ KD 42.9 MB 71.71 %
TAS 42.5 MB 68.95 %
TAS w/ Init 42.5 MB 69.70 %
TAS w/ KD (TAS) 42.5 MB 72.41 %
Table 1: The accuracy on CIFAR-100 when pruning about 40% FLOPs of ResNet-32.

hyper-parameters as CIFAR, but use a weight decay of 0.0001. We train the model by 120 epochs. For MobileNet-V2 sandler2018mobilenetv2 , we use SGD with the momentum of 0.9 and the weight decay of 0.00004. We train each model by 150 epochs, start the learning rate of 0.05, and reduce it by the cosine scheduler. We also set the dropout ratio to 0 in MobileNet-V2. We find this training strategy can make MobileNet-V2 converge faster and obtain a similar performance compared to that of sandler2018mobilenetv2 . For all ImageNet experiments, we use the batch size of 256 and 4 GPUs. When using KD on ImageNet, we set as 0.5 and as 4 on ImageNet.

4.2 Case Studies

In this section, we evaluate different aspects of our proposed TAS. We also compare it with different searching algorithm and knowledge transfer method to demenstrate the effectiveness of TAS.

The effect of different strategies to differentiate . We apply our TAS on CIFAR-100 to prune ResNet-110. We try two different aggregation methods, i.e., using our proposed CHI to align feature maps or not. We also try two different kinds of aggregation weights, i.e., Gumbel-softmax sampling as Eq. (3) (denoted as “sample” in Fig. 3) and vanilla-softmax as Eq. (2) (denoted as “mixture” in Fig. 3). Therefore, there are four different strategies, i.e., with/without CHI combining with Gumbel-softmax/vanilla-softmax. Suppose we do not constrain the computational cost, then the architecture parameters should be optimized to find the maximum width and depth. This is because such network will have the maximum capacity and result in the best performance on CIFAR-100. We try all four strategies without using and show the results in Fig. 2(a). Our TAS can successfully find the best architecture should have a maximum width and depth. However, other three strategies failed. We also investigate discrepancy between the highest probability and the second highest probability in Fig. 2(b). Theoretically, a higher discrepancy indicates that the model is more confident to select a certain width, while a lower discrepancy means that the model is confused and does not know which candidate to select. As shown in Fig. 2(b), with the training procedure going, our TAS becomes more confident to select the suitable width. In contrast, strategies without CHI can not optimize the architecture parameters; and “mixture with CHI” shows a worse discrepancy than ours.

Depth Method CIFAR-10 CIFAR-100
Prune Acc Acc Drop FLOP Prune Acc Acc Drop FLOP
20 LCCL dong2017more 91.68% 1.06% 2.61E7 (36.0%) 64.66% 2.87% 2.73E7 (33.1%)
SFP he2018soft 90.83% 1.37% 2.43E7 (42.2%) 64.37% 3.25% 2.43E7 (42.2%)
FPGM he2019pruning 91.09% 1.11% 2.43E7 (42.2%) 66.86% 0.76% 2.43E7 (42.2%)
TAS (D) 90.97% 1.91% 2.19E7 (46.2%) 64.81% 3.88% 2.19E7 (46.2%)
TAS (W) 92.31% 0.57% 1.99E7 (51.3%) 68.08% 0.61% 1.92E7 (52.9%)
TAS 92.88% 0.00% 2.24E7 (45.0%) 68.90% -0.21% 2.24E7 (45.0%)
32 LCCL dong2017more 90.74% 1.59% 4.76E7 (31.2%) 67.39% 2.69% 4.32E7 (37.5%)
SFP he2018soft 92.08% 0.55% 4.03E7 (41.5%) 68.37% 1.40% 4.03E7 (41.5%)
FPGM he2019pruning 92.31% 0.32% 4.03E7 (41.5%) 68.52% 1.25% 4.03E7 (41.5%)
TAS (D) 91.48% 2.41% 4.08E7 (41.0%) 66.94% 3.66% 4.08E7 (41.0%)
TAS (W) 92.92% 0.96% 3.78E7 (45.4%) 71.74% -1.12% 3.80E7 (45.0%)
TAS 93.16% 0.73% 3.50E7 (49.4%) 72.41% -1.80% 4.25E7 (38.5%)
56 PFEC li2017pruning 91.31% 1.75% 9.10E7 (27.6%)
CP he2017channel 91.80% 1.00% 6.29E7 (50.0%)
LCCL dong2017more 92.81% 1.54% 7.81E7 (37.9%) 68.37% 2.96% 7.63E7 (39.3%)
AMC he2018amc 91.90% 0.90% 6.29E7 (50.0%)
SFP he2018soft 93.35% 0.24% 5.94E7 (52.6%) 68.79% 2.61% 5.94E7 (52.6%)
FPGM he2019pruning 93.49% 0.10% 5.94E7 (52.6%) 69.66% 1.75% 5.94E7 (52.6%)
TAS 93.69% 0.77% 5.95E7 (52.7%) 72.25% 0.93% 6.12E7 (51.3%)
110 LCCLdong2017more 93.44% 0.19% 1.66E8 (34.2%) 70.78% 2.01% 1.73E8 (31.3%)
PFEC li2017pruning 93.30% 0.20% 1.55E8 (38.6%)
SFP he2018soft 92.97% 0.70% 1.21E8 (52.3%) 71.28% 2.86% 1.21E8 (52.3%)
FPGM he2019pruning 93.85% 0.17% 1.21E8 (52.3%) 72.55% 1.59% 1.21E8 (52.3%)
TAS 94.33% 0.64% 1.19E8 (53.0%) 73.16% 1.90% 1.20E8 (52.6%)
164 LCCLdong2017more 94.09% 0.45% 1.79E8 (27.40%) 75.26% 0.41% 1.95E8 (21.3%)
TAS 94.00% 1.47% 1.78E8 (28.10%) 77.76% 0.53% 1.71E8 (30.9%)
Table 2: Comparison of different pruning algorithms for ResNet on CIFAR. ‘Acc’ = accuracy, ‘FLOP’ = FLOP (pruning ratio), ‘TAS (D)’ = searching for depth, ‘TAS (W)’ = searching for width, ‘TAS’ = searching for both width and depth.

Comparison w.r.t. structure generated by different methods in Table 1. ‘Pre-defined’ means pruning a fixed ratio at each layer li2017pruning . ‘Random Search’ indicates an NAS baseline used in liu2019darts . ‘TAS’ is our proposed differentiable searching algorithm. We make two observations: (1) searching can find a better structure using different knowledge transfer methods; (2) our TAS is superior to the NAS random baseline.

Comparison w.r.t. different knowledge transfer methods in Table 1. The first line in each block does not use any knowledge transfer method. “w/ Init” indicates using pre-trained unpruned network as initialization. “w/ KD” indicates using KD. From Table 1, knowledge transfer methods can consistently improve the accuracy of pruned network, even if a simple method is applied (Init). Besides, KD is robust and improves the pruned network by more than 2% accuracy on CIFAR-100.

Searching width vs. searching depth. We try (1) only searching depth (“TAS (D)”), (2) only searching width (“TAS (W)”), and (3) searching both depth and width (“TAS”) in Table 2. Only searching depth obtains a worse results than searching width. Searching both depth and width achieves better accuracy with similar FLOP than both searching depth and searching width only.

4.3 Compared to the state-of-the-art

Results on CIFAR in Table 2. We prune different ResNets on both CIFAR-10 and CIFAR-100. Most previous algorithms perform poorly on CIFAR-100, while our TAS consistently outperforms then by more than 2% accuracy in most cases. On CIFAR-10, our TAS outperforms the state-of-the-art algorithms on ResNet-20,32,56,110. For example, TAS obtains 72.25% accuracy by pruning ResNet-56 on CIFAR-100, which is higher than 69.66% of FPGM he2019pruning . For pruning ResNet-32 on CIFAR-100, we obtain greater accuracy and less FLOP than the unpruned network. We obtain a slightly worse performance thanLCCL dong2017more on ResNet-164. It because there are candidate network structures to searching for pruning ResNet-164. It is challenging to search over such huge search space, and the very deep network has the over-fitting problem on CIFAR-10 he2016deep .

Model Method Top-1 Top-5 FLOPs
Prune
Ratio
Prune Acc Acc Drop Prune Acc Acc Drop
ResNet-18 LCCL dong2017more 66.33% 3.65% 86.94% 2.29% 1.19E9 34.6%
SFP he2018soft 67.10% 3.18% 87.78% 1.85% 1.06E9 41.8%
FPGM he2019pruning 68.41% 1.87% 88.48% 1.15% 1.06E9 41.8%
TAS 69.15% 1.50% 89.19% 0.68% 1.21E9 33.3%
ResNet-50 SFP he2018soft 74.61% 1.54% 92.06% 0.81% 2.38E9 41.8%
CP he2017channel - - 90.80% 1.40% 2.04E9 50.0%
Taylor molchanov2019importance 74.50% 1.68% - - 2.25E9 44.9%
AutoSlim yu2019network 76.00% - - - 3.00E9 26.6%
FPGM he2019pruning 75.50% 0.65% 92.63% 0.21% 2.36E9 42.2%
TAS 76.20% 1.26% 93.07% 0.48% 2.31E9 43.5%
MobileNet-V2 AMC he2018amc 70.80% 1.00% - - 1.50E8 50.0%
TAS 70.90% 1.18% 89.74% 0.76% 1.49E8 50.3%
Table 3: Comparison of different pruning algorithms for ResNet and MobileNet-V2 on ImageNet.

Results on ImageNet in Table 3. We prune ResNet-18, ResNet-50 and MobileNet-V2 on ImageNet. For ResNet-18, it takes about 59 hours to search for the pruned network on 4 NVIDIA Tesla V100 GPUs. The training time of unpruned ResNet-18 costs about 24 hours, and thus the searching time is acceptable. With more machines and optimized implementation, we can finish TAS with less time cost. We show competitive results compared to other state-of-the-art pruning algorithms. For example, TAS prunes ResNet-50 by 43.5% FLOPs, and the pruned network achieves 76.20% accuracy, which is higher than FPGM by 0.7. Similar improvements can be found when pruning ResNet-18 and MobileNet-V2. Note that we directly apply the hyper-parameters on CIFAR-10 to prune models on ImageNet, and thus TAS can potentially achieve a better result by carefully tuning parameters on ImageNet.

Our proposed TAS is a preliminary work for the new network pruning pipeline. This pipeline can be improved by designing more effective searching algorithm and knowledge transfer method. We hope that future work to explore these two components will yield powerful compact networks.

5 Conclusion

In this paper, we propose a new paradigm for network pruning, which consists of two components. For the first component, we propose to apply NAS to search for the best depth and width of a network. Since most previous NAS approaches focus on the network topology instead the network size, we name this new NAS scheme as Transformable Architecture Search (TAS). Furthermore, we propose a differentiable TAS approach to efficiently and effectively find the most suitable depth and width of a network. For the second component, we propose to optimize the searched network by transferring knowledge from the unpruned network. In this paper, we apply a simple KD algorithm to perform knowledge transfer, and conduct other transferring approaches to demonstrate the effectiveness of this component. Our results show that new efforts focusing on searching and transferring may lead to new breakthroughs in network pruning.

References

  • [1] M. Alizadeh, J. Fernández-Marqués, N. D. Lane, and Y. Gal. An empirical study of binary neural networks’ optimisation. In International Conference on Learning Representations (ICLR), 2019.
  • [2] J. M. Alvarez and M. Salzmann.

    Learning the number of neurons in deep networks.

    In The Conference on Neural Information Processing Systems (NeurIPS), pages 2270–2278, 2016.
  • [3] H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2019.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 248–255, 2009.
  • [5] X. Dong, J. Huang, Y. Yang, and S. Yan. More is less: A more complicated network with less inference complexity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5840–5848, 2017.
  • [6] X. Dong and Y. Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1761–1770, 2019.
  • [7] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1039–1048, 2017.
  • [8] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. In The ACM/IEEE International Symposium on Computer Architecture (ISCA), pages 243–254, 2016.
  • [9] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2015.
  • [10] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In The Conference on Neural Information Processing Systems (NeurIPS), pages 1135–1143, 2015.
  • [11] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In The Conference on Neural Information Processing Systems (NeurIPS), pages 164–171, 1993.
  • [12] J. L. Z. L. H. W. L.-J. L. He, Yihui and S. Han. AMC: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 183–202, 2018.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(9):1904–1916, 2015.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [15] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang. Soft filter pruning for accelerating deep convolutional neural networks. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    , pages 2234–2240, 2018.
  • [16] Y. He, P. Liu, Z. Wang, and Y. Yang. Pruning filter via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4340–4349, 2019.
  • [17] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1389–1397, 2017.
  • [18] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In The Conference on Neural Information Processing Systems Workshop (NeurIPS-W), 2014.
  • [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4700–4708, 2017.
  • [20] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.

    The Journal of Machine Learning Research (JMLR)

    , 18(1):6869–6898, 2017.
  • [21] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In The International Conference on Machine Learning (ICML), pages 448–456, 2015.
  • [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 2017–2025, 2015.
  • [23] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR), 2017.
  • [24] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012.
  • [26] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In The Conference on Neural Information Processing Systems (NeurIPS), pages 598–605, 1990.
  • [27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. In International Conference on Learning Representations (ICLR), 2017.
  • [28] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations (ICLR), 2019.
  • [29] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2736–2744, 2017.
  • [30] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2018.
  • [31] I. Loshchilov and F. Hutter.

    SGDR: Stochastic gradient descent with warm restarts.

    In International Conference on Learning Representations (ICLR), 2017.
  • [32] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations (ICLR), 2018.
  • [33] C. J. Maddison, A. Mnih, and Y. W. Teh.

    The concrete distribution: A continuous relaxation of discrete random variables.

    In International Conference on Learning Representations (ICLR), 2017.
  • [34] B. Minnehan and A. Savakis. Cascaded projection: End-to-end network compression and acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10715–10724, 2019.
  • [35] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz.

    Importance estimation for neural network pruning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11264–11272, 2019.
  • [36] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le.

    Regularized evolution for image classifier architecture search.

    In AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
  • [38] E. Tartaglione, S. Lepsøy, A. Fiandrotti, and G. Francini. Learning sparse neural networks via sensitivity-driven regularization. In The Conference on Neural Information Processing Systems (NeurIPS), pages 3878–3888, 2018.
  • [39] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In The Conference on Neural Information Processing Systems (NeurIPS), pages 2074–2082, 2016.
  • [40] J. Ye, X. Lu, Z. Lin, and J. Z. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In International Conference on Learning Representations (ICLR), 2018.
  • [41] J. Yim, D. Joo, J. Bae, and J. Kim.

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4133–4141, 2017.
  • [42] J. Yu and T. Huang. Network slimming by slimmable networks: Towards one-shot architecture search for channel numbers. arXiv preprint arXiv:1903.11728, 2019.
  • [43] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR), 2017.
  • [44] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(10):1943–1955, 2016.
  • [45] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.