MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning

In this paper, we propose a novel meta learning approach for automatic channel pruning of very deep neural networks. We first train a PruningNet, a kind of meta network, which is able to generate weight parameters for any pruned structure given the target network. We use a simple stochastic structure sampling method for training the PruningNet. Then, we apply an evolutionary procedure to search for good-performing pruned networks. The search is highly efficient because the weights are directly generated by the trained PruningNet and we do not need any finetuning. With a single PruningNet trained for the target network, we can search for various Pruned Networks under different constraints with little human participation. We have demonstrated competitive performances on MobileNet V1/V2 networks, up to 9.0/9.9 higher ImageNet accuracy than V1/V2. Compared to the previous state-of-the-art AutoML-based pruning methods, like AMC and NetAdapt, we achieve higher or comparable accuracy under various conditions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

11/25/2019

Real-Time Object Tracking via Meta-Learning: Efficient Model Adaptation and One-Shot Channel Pruning

We propose a novel meta-learning framework for real-time object tracking...
04/27/2022

Channel Pruned YOLOv5-based Deep Learning Approach for Rapid and Accurate Outdoor Obstacles Detection

One-stage algorithm have been widely used in target detection systems th...
02/13/2020

Classifying the classifier: dissecting the weight space of neural networks

This paper presents an empirical study on the weights of neural networks...
05/11/2022

Revisiting Random Channel Pruning for Neural Network Compression

Channel (or 3D filter) pruning serves as an effective way to accelerate ...
12/31/2018

Learning to Design RNA

Designing RNA molecules has garnered recent interest in medicine, synthe...
03/30/2020

DHP: Differentiable Meta Pruning via HyperNetworks

Network pruning has been the driving force for the efficient inference o...
01/31/2022

Signing the Supermask: Keep, Hide, Invert

The exponential growth in numbers of parameters of neural networks over ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Channel pruning has been recognized as an effective neural network compression/acceleration method [27, 19, 2, 3, 18, 48] and is widely used in the industry. A typical pruning approach contains three stages: training a large over-parameterized network, pruning the less-important weights or channels, finetuning or re-train the pruned network. The second stage is the key. It usually performs iterative layer-wise pruning and fast finetuning or weight reconstruction to retain the accuracy [15, 1, 28, 36].

Conventional channel pruning methods mainly relie on data-driven sparsity constraints [24, 30], or human-designed policies [19, 27, 35, 21, 33, 2]. Recent AutoML-style works automatically prune channels in an iterative model, based on a feedback loop [48]

or reinforcement learning 

[18]. Compared with the conventional pruning methods, the AutoML methods save human efforts and can optimize the direct metrics like the hardware latency.

Apart from the idea of keeping the important weights in the pruned network, a recent study [31] finds that the pruned network can achieve the same accuracy no matter it inherits the weights in the original network or not. This finding suggests that the essence of channel pruning is finding good pruning structure - layer-wise channel numbers.

Figure 1:

Our MetaPruning has two steps. 1) training a PruningNet. At each iteration, a network encoding vector (i.e., the number of channels in each layer) is randomly generated. The Pruned Network is constructed accordingly. The PruningNet takes the network encoding vector as input and generates the weights for the Pruned Network. 2) searching for the best Pruned Network. We construct many Pruned Networks by varying network encoding vector and evaluate their goodness on the validation data with the weights predicted by the PruningNet. No finetuning or re-training is needed.

However, exhaustively finding the optimal pruning structure is computationally prohibitive. Considering a network with 10 layers and each layer contains 32 channels. The possible combination of layer-wise channel numbers could be . Inspired by the recent Neural Architecture Search (NAS), specifically One-Shot model [5], as well as the weight prediction mechanism in HyperNetwork [13], we propose to train a PruningNet that can generate weights for all candidate pruned networks structures, such that we can search good-performing structures by just evaluating their accuracy on the validation data, which is highly efficient.

To train the PruningNet, we use a stochastic training. As shown in Figure 1, the PruningNet generates the weights for pruned networks with corresponding network encoding vectors, which is the number of channels in each layer. By stochastically feeding in the network encoding vector, the PruningNet gradually learns to generate weights for various pruned structures. After the training, we search for good-performing Pruned Networks by an evolutionary search method which can flexibly incorporate various hard constraints such as computation FLOPs or hardware latency. Moreover, by directly searching the best pruned network via determining the channels for each layer or each stage, we can prune channels in the shortcut without extra effort, which is seldom addressed in previous channel pruning solutions. We name the proposed method as MetaPruning.

We apply our approach on MobileNets [20, 41]. At the same FLOPS, our accuracy is - higher than MobileNet V1, and - higher than MobileNet V2. At the same latency, our accuracy is - higher than MobileNet V1, and - higher than MobileNet V2. Compared with state-of-the-art AutoML-based channel pruning methods [18, 48], our MetaPruning produces superior or comparable results.

Our contribution lies in four folds:

  • [leftmargin=*]

  • We proposed a meta learning approach, MetaPruning, for channel pruning. The central of this approach is learning a meta network (named PruningNet) which generates weights for various pruning structures. With a single trained PruningNet, we can obtain various pruned networks under different constraints.

  • Compared to conventional pruning methods, MetaPruning liberates human from cumbersome hyperparameter tuning and enables the direct optimization with desired metrics.

  • Compared to other AutoML methods, MetaPruning can easily enforce hard constraints in the search of desired structures, without manually tuning the reinforcement learning hyper-parameters.

  • The meta learning is able to effortlessly prune the channels in the short-cuts for ResNet-like structures, which is non-trivial because the channels in the short-cut affect more than one layers.

2 Related Works

There are extensive studies on compressing and accelerating neural networks, such as quantization [38, 32], pruning [19, 25, 14] and compact network design [20, 41, 50, 34]. A comprehensive survey is provided in [43]. Here, we summarize the approaches that are most related to our work.

Pruning Network pruning is a prevalent approach for removing redundancy in DNNs. In weight pruning, people prune individual weights to compress the model size [25, 16, 14, 12]. However, weight pruning results in unstructured sparse filters, which can hardly be accelerated by general-purpose hardware. Recent works [21, 27, 35, 19, 33, 49] focus on channel pruning in the CNNs, which removes entire weight filters instead of individual weights. Traditional channel pruning methods trim channels based on the importance of each channel either in an iterative mode [19, 33] or by adding a data-driven sparsity [24, 30]

. In most traditional channel pruning, compression ratio for each layer need to be manually set based on human experts or heuristics, which is time consuming and prone to be trapped in sub-optimal solutions.

AutoML Recently, AutoML methods [18, 48] take the real-time inference latency on multiple devices into account to iteratively prune channels in different layers of a network via reinforcement learning [18] or an automatic feedback loop [48]. Compared with traditional channel pruning methods, AutoML methods help to alleviate the manual efforts for tuning the hyper-parameters in channel pruning. Our proposed MetaPruning also involves little human participation. Different from previous AutoML pruning methods, which is carried out in a layer-wise pruning and finetuning loop, our methods is motivated by recent findings [31], which suggests that instead of selecting “important” weights, the essence of channel pruning sometimes lies in identifying the best pruned network. From this prospective, we propose MetaPruning for directly finding the optimal pruned network structures. Compared to previous AutoML pruning methods [18, 48], MetaPruning method enjoys higher flexibility in precisely meeting the constraints and possesses the ability of pruning the channel in the short-cut.

Meta Learning

Meta-learning refers to learning from observing how different machine learning approaches perform on various learning tasks. Meta learning can be used in few/zero-shot learning 

[39, 11]

and transfer learning 

[44]. A comprehensive overview of meta learning is provided in [26]. In this work we are inspired by[13] to use meta learning for weight prediction. Weight predictions refer to weights of a neural network are predicted by another neural network rather than directly learned [13]. Recent works also applies meta learning on various tasks and achieves state-of-the-art results in detection [47]

, super-resolution with arbitrary magnification 

[23] and instance segmentation [22].

Neural Architecture Search Studies for neural architecture search try to find the optimal network structures and hyper-parameters with reinforcement learning [51, 4]

, genetic algorithms 

[46, 37, 40] or gradient based approaches [29, 45]. Parameter sharing [7, 5, 45, 29] and weights prediction [6, 10] methods are also extensively studied in neural architecture search. One-shot architecture search [5] uses an over-parameterized network with multiple operation choices in each layer. By jointly training multiple choices with drop-path, it can search for the path with highest accuracy in the trained network, which also inspired our two step pruning pipeline. Tuning channel width are also included in some neural architecture search methods. ChamNet [8] built an accuracy predictor atop Gaussian Process with Bayesian optimization to predict the network accuracy with various channel widths, expand ratios and numbers of blocks in each stage. Despite its high accuracy, building such an accuracy predictor requires a substantial of computational power. FBNet [45] and ProxylessNas [7] include blocks with several different middle channel choices in the search space. Different from neural architecture search, in channel pruning task, the channel width choices in each layer is continuous, which makes enumerate every channel width choice as an independent operation infeasible. Proposed MetaPruning targeting at channel pruning is able to solve this continuous channel pruning challenge by training the PruningNet with weight prediction, which will be explained in Sec.3

Figure 2: The proposed stochastic training method of PruningNet. At each iteration, we randomize a network encoding vector. The PruningNet generates the weight by taking the encoding vector as input. The Pruned Network is constructed with respect to channel width vector. We crop the weights generated by the PruningNet to match the input and output channels in the Pruned Networks. By change the channel width in each iteration, the PruningNet can learn to generate different weights for various Pruned Networks.

3 Methodology

In this section, we introduce our meta learning approach for automatically pruning channels in deep neural networks, that pruned network could meet various hard constraints.

We formulate the channel pruning problem as

(1)

where is the network before the pruning. We try to find out the pruned network channel width () for layer to layer that has the minimum loss after the weights are trained, with the cost meets the hard constraint (i.e. FLOPs or latency).

To achieve this, we propose to construct a PruningNet, a kind of meta network, where we can quickly obtain the goodness of all potential pruned network structures by evaluating on the validation data only. Then we can apply any search method, which is evolution algorithm in this paper, to search for the best pruned network.

3.1 PruningNet training

Channel pruning is non-trivial because the layer-wise dependence in channels such that pruning one channel may significantly influence the following layers and, in return, degrade the overall accuracy. Previous methods try to decompose the channel pruning problem into the sub-problem of pruning the unimportant channels layer-by-layer [19] or adding the sparsity regularization [24]. AutoML methods prune channels automatically with a feedback loop [48] or reinforcement learning [18]. Among those methods, how to prune channels in the short-cut is seldom addressed. Most previous methods prune the middle channels in each block only[48, 18], which limits the overall compression ratio without manually reducing the input image resolution.

Carrying out channel pruning task with consideration of the overall pruned network structure is beneficial for finding optimal solutions for channel pruning and can solve the shortcut pruning problem. However, obtaining the best pruned network is not straightforward, considering a small network with 10 layers and each layer containing 32 channels, the combination of possible pruned network structures is huge.

Inspired by the recent work [31], which suggests the weights left by pruning is not important compared to the pruned network structure, we are motivated to directly find the best pruned network structure. In this sense, we may directly predict the optimal pruned network without iteratively decide the important weight filters. To achieve this goal, we construct a meta network, PruningNet, for providing reasonable weights for various pruned network structures to rank their performance.

The PruningNet is a meta network, which takes a network encoding vector as input and outputs the weights of pruned network:

(2)
Figure 3: (a) The network structure of PruningNet connected with Pruned Network. The PruningNet and the Pruned Network are jointly trained with input of the network encoding vector as well as a minibatch of images. (b) The reshape and crop operation on the weight matrix generated by the PruningNet block.

A PruningNet block consists of two fully-connected layers. In the forward pass, the PruningNet takes the network encoding vector (i.e., the number of channels in each layer) as input, and generates the weight matrix. Meanwhile, a Pruned Network is constructed with output channels width in each layer equal to the element in the network encoding vector. The generated weight matrix is cropped to match the number of input and output channel in the Pruned Network. Given a batch of input image, we can calculate the loss from the Pruned Network with generated weights.

In the backward pass, instead of updating the weights in the Pruned Networks, we calculate the gradients w.r.t the weights in the PruningNet. Since the reshape operation as well as the convolution operation between the output of the fully-connect layer in the PruningNet and the output of the previous convolutional layer in the Pruned Network is also differentiable, the gradient of the weights in the PruningNet can be easily calculated by the Chain Rule. The PruningNet is end-to-end trainable. The detailed structure of PruningNet connected with Pruned Network is shown in Figure

3

stochastic training. We propose to randomly change the network encoding vector in the training phase. At each iteration, the network encoding vector is generated by randomly choosing the number of channels in each layer. With different network encodings, different Pruned Networks are constructed and the corresponding weights are provided with the PruningNet. By stochastically training with different encoding vectors, the PruningNet learns to predict reasonable weights for Pruned Networks.

3.2 Pruned-Network search

After the PruningNet is trained, we can obtain the accuracy of each potential pruned network by inputting the network encoding into the PruningNet, generating the corresponding weights and doing the evaluation on the validation data.

Since the number of network encoding vectors is huge, we are not able to enumerate. To find out the pruned network with high accuracy under the constraint, we use an evolutionary search, which is able to easily incorporate any soft or hard constraints.

In the evolutionary algorithm used in MetaPruning, each pruned network is encoded with a vector of channel numbers in each layer, named the genes of pruned networks. Under the hard constraint, we first randomly select a number of genes and obtain the accuracy of the corresponding pruned network by doing the evaluation. Then the top k genes with highest accuracy are selected for generating the new genes with mutation and crossover. The mutation is carried out by changing a proportion of elements in the gene randomly. The crossover means that we randomly recombine the genes in two parent genes to generate an off-spring. We can easily enforce the hard constraint by eliminate the unqualified genes. By further repeating the top k selection process and new genes generation process for several iterations, we can obtain the gene that meets hard constraints while achieving the highest accuracy. Detailed algorithm is described in Algorithm.

1.

Hyper Parameters: Population Size: , Number of Mutation: , Number of Crossover: , Max Number of Iterations: .
Input: PruningNet: , Hard constraints .
Output: Most accurate gene: .

1:   = Random(), s.t ;
2:   = ;
3:  for  do
4:     {, accuracy} = Inference()); 
5:     , = TopK({, }); 
6:      = Mutation(), s.t
7:      = Crossover(), s.t
8:      = (, ); 
9:  end for
10:   = Top1({, }); 
11:  return ;
Algorithm 1 Evolutionary Search Algorithm

4 Experimental Results

In this section, we demonstrate the effectiveness of our proposed MetaPruning method. We first explain the experiment settings and introduce how to apply the MetaPruning on MobileNet V1 [20] and V2 [41], which can be easily generalized to other network structures. Second, we compare our results with the baseline uniform pruning as well as two state-of-the-art AutoML channel pruning methods. Third, we visualize the pruned network obtained with MetaPruning. Last, ablation studies are carried out to elaborate the effect of weight prediction in our method.

4.1 Experiment settings

The proposed MetaPruning is efficient to run as it does not involve iterative finetuning process and requires the same training epochs as training a network normally. Consequently, it is feasible to carry out all experiments on the ImageNet 2012 classification dataset

[9].

Our MetaPruning method consists of two stages. In the first stage, we train the PruningNet from scratch with stochastic training. In the second stage, we use an evolutionary search algorithm to find the best pruned network train it from scratch normally. For training process in both stages, we use the standard data augmentation strategies as ResNet [17] to process the input images and the same training scheme [34]. The resolutions of the input image is set to 224

224 for all experiments. We split the original training images into sub-validation dataset, which contains 50000 images randomly selected from the training images with 50 images in each 1000-class, and sub-training dataset with the rest of images. We train the PruningNet on the sub-training dataset and evaluating the performance of pruned network on the sub-validation dataset in the searching phase. In evaluating the performance of pruned network, we recalculate the running mean and running variance in the BatchNorm layer with 20000 sub-training images, which takes only a few seconds. After obtaining the best pruned network, the pruned network is trained from scratch on the original training dataset and evaluated on the test dataset.

4.2 MetaPruning on MobileNets

To prove the effectiveness of our MetaPruning method, we apply it on MobileNets [20, 41], which is a well-designed network for the Mobile or embedded devices. Compared to other networks like ResNet [17] and VGG [42], MobileNets are more compact with fewer redundancy, therefore harder pruned. MobileNets extensively use the depth-wise convolution concatenating with point-wise convolution, greatly reduces the redundancy in the original convolutions, which are used in VGG and ResNet. In this work, we choose to prune the filters in MobileNets aiming at proving the effectiveness of our method and make further improvements over the MobileNets, so that we can provide off-the-shelf solutions for channel pruning targeting at resource limited devices. Of course, our method can be easily generalized to other network architectures.

Figure 4: Channel Pruning schemes considering the layer-wise inter-dependency. (a) For the network without shortcut, e.g., MobileNet V1, we crop the top left of the original weight matrix to match the input and output channels. For simplification, we omit the depth-wise convolution here; (b) For the network with shortcut, e.g., MobileNet V2, we prune the middle channels in the blocks while keep the input and output of the block being equal.

4.2.1 MobileNet V1

MobileNet V1 is a network without shortcut. The input vector to the PruningNet is the number of channels in each layer. A PruningNet block is composed of two concatenated fully-connected layers and the number of PruningNet blocks equals to the number of dw-conv-pw-conv blocks (i.e. two concatenated layers with a 3x3 depth-wise convolution and a 1x1 point-wise convolution) in the MobileNet v1. In PruningNet, we first decode the network encoding vector to the input compression ratio and output compression ratio of each dw-conv-pw-conv block. In generating each weight matrix, a vector is inputted to the first fully-connected layer with a output size of 64 in PruningNet block. The second fully-connected layer use this 64 encoding to output a vector with a length of , then we reshape it to () as the weight matrix in the convolution layer.

In stochastic training, the number of output channels for each dw-conv-pw-conv block is randomly selected in [, ], with the step being , more refined or coarse step can be chosen according to the fineness of pruning. After PruningNet takes the network encoding vector and generates the corresponding weight matrix, the top left part generated weight matrix matching the input and output channels is cropped and used in training, as shown in Figure 4 (a), and the rest of the weights can be regards as being ‘untouched’ in this iteration. In different iterations, different channel width encoding vectors are generated.

4.2.2 MobileNet V2

In MobileNet V2, each stage starts with a bottleneck block matching the dimension between two stages. If a stage consists of more than one block, the following blocks in this stage will contain a shortcut adding the input feature maps with the output feature maps, for input and output channels in a stage should be identical, as shown in Figure 4 (b). To prune this ResNet-like structure containing shortcut, we generate two network encoding vectors, one encodes the overall stage output channels for matching the channels in the shortcut and another encodes the middle channels of each blocks. We construct the pruned network with respect to the overall stage channel encoding as well as the middle channel encoding. In PruningNet, we first decode this network encoding vector to the input compression ratio, output compression ratio and the middle channel compression ratio of each block. Then we generate the corresponding weight matrices in that block, with a vector inputting to the corresponding block in the PruningNet. The PruningNet block design is the same as that in MobileNetV1, and the number of PruningNet block equals to the number of bottleneck blocks in the MobileNet v2.

4.3 Comparisons with state-of-the-arts

We compare our method with the uniform pruning baselines as well as two state-of-the-art AutoML based channel pruning methods.

- Uniform Baselines MobileNets [20, 41] proposes to use multipliers to uniformly prune the channel width in each layer to meet the resource constraints.

- AMC [18] utilizes reinforcement learning to iteratively prune channels in each layer taking the accuracy as well as the latency as the reward.

- NetAdapt [48]

automatically decide pruning how much proportion in which layer based on a feedback loop with latency estimated from the device.

4.3.1 Pruning under FLOPS constraint

Uniform Baselines MetaPruning
Ratio Top1-Acc FLOPs Top1-Acc FLOPs
1 70.6% 569M
0.75 68.4% 325M 70.6% 324M
0.5 63.7% 149M 66.1% 149M
0.25 50.6% 41M 57.2% 41M
Table 1: This table compares the top-1 accuracy of MetaPruning method with the uniform baselines on MobileNet V1 [20].
Uniform Baselines MetaPruning
Top1-Acc FLOPs Top1-Acc FLOPs
74.7% 585M
72.0% 313M 72.7% 291M
67.2% 140M 68.2% 140M
66.5% 127M 67.3% 124M
64.0% 106M 65.0% 105M
62.1% 87M 63.8% 84M
54.6% 43M 58.3% 43M
Table 2: This table compares the top-1 accuracy of MetaPruning method with the uniform baselines on MobileNet V2 [41]. The MobileNet V2 only reports the accuracy with 585M and 300M FLOPs, so we apply the uniform pruning method on MobileNet V2 to obtain the baseline accuracy for the networks with other FLOPs.
Network FLOPs Top1-Acc
0.75x MobileNet V1 [20] 325M 68.4%
NetAdapt [48] 284M 69.1%
AMC [18] 285M 70.5%
MetaPruning 281M 70.4%
0.75x MobileNet V2 [41] 220M 69.8%
AMC [18] 220M 70.8%
MetaPruning 217M 71.2%
Table 3: This table compares the top-1 accuracy of MetaPruning method with state-of-the-art AutoML methods.

Table 1 compares our accuracy with the uniform pruning baselines reported in [20]. With the pruning scheme learned by MetaPruning, we obtain 6.6 higher accuracy than the baseline 0.25 MobileNet V1. Further more, as our method can be generalized to prune the shortcuts in a network, we also achieves decent improvement on MobileNet V2, shown in Table.2 Previous pruning methods only prunes the middle channels of the bottleneck structure [48, 18], which limits their maximum compress ratio at given input resolution. With MetaPruning, we can obtain 3.7 accuracy boost when the model size is as small as 43M FLOPs.

In Table 3, we compare MetaPruning with the state-of-the-art AutoML pruning methods. MetaPruning achieves superior or comparable results than AMC [18] and NetAdapt [48]. Moreover, MetaPruning gets rid of manually tuning the reinforcement learning hyper-parameters and can obtain the pruned network precisely meeting the FLOPs or latency constraints. With the PruningNet trained once using the same epoch as normally training the target network, we can obtain multiple pruned network structures to strike different accuracy-speed trade-off, which is more efficient than the state-of-the-art AutoML pruning methods [18, 48].

4.3.2 Pruning under latency constraint

There is an increasing attention in directly optimizing the latency on the target devices. Without knowing the implementation details inside the device, MetaPruning learns to prune channels according to the latency estimated from the device.

As the number of potential Pruned Network is numerous, measuring the latency for each network is too time-consuming. With a reasonable assumption that the execution time of each layer is independent, we can obtain the network latency by summing up the run-time of all layers in the network. Following the practice in [45, 48], we first construct a look-up table, by estimating the latency of executing different convolution layers with different input and output channel width on the target device, which is Titan Xp GPU in our experiments. Then we can calculate the latency of the constructed network from the look-up table.

We carried out experiments on MobileNet V1 and V2. Table 4 and Table 5 show that the prune networks discovered by MetaPruning achieve significantly higher accuracy than the uniform baselines with the same latency.

Uniform Baselines MetaPruning
Ratio Top1-Acc Latency Top1-Acc Latency
1 70.6% 0.62ms
0.75 68.4% 0.48ms 70.5% 0.48ms
0.5 63.7% 0.31ms 67.4% 0.30ms
0.25 50.6% 0.17ms 59.6% 0.17ms
Table 4: This table compares the top-1 accuracy of MetaPruning method with the MobileNet V1 [20], under the latency constraints. Reported latency is the run-time of the corresponding network on Titan Xp with a batch-size of 32
Uniform Baselines MetaPruning
Ratio Top1-Acc Latency Top1-Acc Latency
1.4 74.7% 0.95ms
1 72.0% 0.70ms 73.2% 0.67ms
0.65 67.2% 0.49ms 71.7% 0.47ms
0.35 54.6% 0.28ms 64.5% 0.29ms
Table 5: This table compares the top-1 accuracy of MetaPruning method with the MobileNet V2 [41], under the latency constraints. We re-implement MobileNet V2 to obtain the results with 0.65 and 0.35 pruning ratio. This pruning ratio refers to uniformly prune the input and output channels of all the layers.

4.4 Pruned result visualization

In channel pruning, people are curious about what is the best pruning heuristic and lots of human experts are working on manually design the pruning policies. With the same curiosity, we wonder if any reasonable pruning schemes are learned by our MetaPruning method that contributes to its high accuracy. In visualizing the pruned network structures, we find that the MetaPruning did learn something interesting.

Figure 5

shows the pruned network structure of MobileNet V1. We observe significant peeks in the pruned network every time when there is a down sampling operation. When the down-sampling occurs with a stride 2 depth-wise convolution, the resolution degradation in the feature map size need to be compensated by using more channels to carry the same amount of information. Thus, MetaPruning automatically learns to keep more channels at the downsampling layers. The same phenomenon is also observed in MobileNet V2, shown in Figure

6. The middle channels will be pruned less when the corresponding block is in responsible for shrinking the feature map size.

Moreover, when we automatically prune the shortcut channels in MobileNet V2 with MetaPruning, we find that, despite the 145M pruned network contains only half of the FLOPs in the 300M pruned network, 145M network keeps similar number of channels in the last stages as the 300M network, and pruned more channels in the early stages. We suspect it is because the number of classifiers for the ImageNet dataset contains 1000 output nodes and thus more channels are needed at later stages to extract sufficient features. When the FLOPs being restrict to 45M, the network almost reaches the maximum pruning ratio and it has no choice but to prune the channels in the later stage, and the accuracy degradation from 145M network to 45M networks is much severer than that from 300M to 145M.

Figure 5: This figure presents the number of output channels of each block of the pruned MobileNet v1. Each block contains a 3x3 depth-wise convolution followed by a 1x1 point-wise convolution, except the first block is composed by a 3x3 convolution only.
Figure 6: A MobileNet V2 block is constructed by concatenating a 1x1 point-wise convolution, a 3x3 depth-wise convolution and a 1x1 point-wise convolution. This figure illustrates the number of middle channels of each block.
Figure 7: In MobileNet V2, each stage starts with a bottleneck block with differed input and output channels and followed by several repeated bottleneck blocks. Those bottleneck blocks with the same input and output channels are connected with a shortcut. MetaPruning prunes the channels in the shortcut jointly with the middle channels. This figure illustrates the number of shortcut channel in each stage after the MetaPruning.

4.5 Ablation study

In this section, we discuss about the effect of weight prediction in the MetaPruning method.

We wondered about the consequence if we do not use the two fully-connected layers in the PruningNet for weight prediction but directly apply the proposed stochastic training and crop the same weight matrix for matching the input and output channels in the Pruned Network. We compare between performance of the PruningNet with/without weight prediction. We select the channel number with uniformly pruning each layer at a ratio ranging from [0.25, 1], and evaluate the accuracy from both networks. From Figure 8, we see that the PruningNet without weight prediction achieves 10 lower accuracy. We further use the PruningNet without weight prediction to search for the Pruned MobileNet V1 with less than 45M FLOPs, and the obtained network achieves only 55.3 top1 accuracy, 1.9 lower than the pruned network obtained with weight prediction. It is intuitive. For example, the weight matrix for a input channel width of 64 may not be optimal when the total input channels are increased to 128 with 64 more channels added behind. In that case, the weight prediction mechanism in meta learning is effective in de-correlating the weights for different pruned structures and thus achieves much higher accuracy for the PruningNet.

Figure 8: We compare between the performance of PruningNet with weight prediction and that without weight prediction by inferring the accuracy of several uniformly pruned network in MobileNet V1[20]. PruningNet with weight prediction achieves much higher accuracy than that without weight prediction.

5 Conclusion

In this work, we have presented MetaPruning for channel pruning. This meta learning approach has the following advantages: 1) it achieves much higher accuracy than the uniform pruning baselines and better or comparable accuracy than other AutoML-based channel pruning methods; 2) it can flexibly optimize with respect to different constraints without introducing extra hyperparameter; 3) ResNet-like architecture can be effectively handled; 4) the whole pipeline is highly efficient.

References