To filter prune, or to layer prune, that is the question

07/11/2020 ∙ by Sara Elkerdawy, et al. ∙ 37

Recent advances in pruning of neural networks have made it possible to remove a large number of filters or weights without any perceptible drop in accuracy. The number of parameters and that of FLOPs are usually the reported metrics to measure the quality of the pruned models. However, the gain in speed for these pruned methods is often overlooked in the literature due to the complex nature of latency measurements. In this paper, we show the limitation of filter pruning methods in terms of latency reduction and propose LayerPrune framework. LayerPrune presents set of layer pruning methods based on different criteria that achieve higher latency reduction than filter pruning methods on similar accuracy. The advantage of layer pruning over filter pruning in terms of latency reduction is a result of the fact that the former is not constrained by the original model's depth and thus allows for a larger range of latency reduction. For each filter pruning method we examined, we use the same filter importance criterion to calculate a per-layer importance score in one-shot. We then prune the least important layers and fine-tune the shallower model which obtains comparable or better accuracy than its filter-based pruning counterpart. This one-shot process allows to remove layers from single path networks like VGG before fine-tuning, unlike in iterative filter pruning, a minimum number of filters per layer is required to allow for data flow which constraint the search space. To the best of our knowledge, we are the first to examine the effect of pruning methods on latency metric instead of FLOPs for multiple networks, datasets and hardware targets. LayerPrune also outperforms handcrafted architectures such as Shufflenet, MobileNet, MNASNet and ResNet18 by 7.3 dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNN) have become the state-of-the art in various computer vision tasks, e.g., image classification [20], object detection [34], depth estimation [5]. These CNN models are designed with deeper [10] and wider [43] convolutional layers with a large number of parameters and convolutional operations. These architectures hinder deployment on low-power devices, e.g, phones, robots, wearable devices as well as real-time critical applications, such as autonomous driving. As a result, computationally efficient models are becoming increasingly important and multiple paradigms have been proposed to minimize the complexity of CNNs.

A straight forward direction is to manually design networks with small footprint from the start such as [29, 13, 41, 15, 14]. This direction does not only require expert knowledge and multiple trials (e.g up to 1000 neural architectures explored manually [21]), but also does not benefit from available, pre-trained large models. Quantization [40, 17] and distillation [44, 18] are two other techniques, which utilize the pre-trained models to obtain smaller architectures. Quantization reduces bit-width of parameters and thus decreases memory footprint, but requires specialized hardware instructions to achieve latency reduction. While distillation trains a pre-defined smaller model (student) with guidance from a larger pre-trained model (teacher) [44]. Finally, model pruning aims to automatically remove the least important filters (or weights) to reduce number of parameters or FLOPs (i.e indirect measures). However, prior work [46, 47, 2] showed that neither number of pruned parameters nor FLOPs reduction directly correlate with latency (i.e a direct measure) consumption. Latency reduction in that case depends on various aspects, such as the number of filters per layer (signature) and the deployment device. Most GPU programming tools require careful compute kernels111A compute kernel refers to a function such as convolution operation that runs on a high throughput accelerator such as GPU tuning for different matrices shapes (e.g., convolution weights) [39, 32]. These aspects introduce non-linearity in modeling latency with respect to the number of filters per layer. Recognizing the limitations in terms of latency or energy by simply pruning away filters, recent works [47, 45, 46] proposed optimizing directly over these direct measures. These methods require per hardware and per architecture latency measurements collection to create lookup-tables or latency prediction model which can be time intensive. In addition, these filter pruned methods are bounded by the model’s depth and can only reach a limited goal for latency consumption.

In this work, we show the limitations of filter pruning methods in terms of latency reduction. Fig. 1 shows the range of attainable latency reduction on randomly generated models. Each box bar summarizes the latency reduction of 100 random models with filter and layer pruning on different network architectures and hardware platforms. For each filter pruned model , a pruning ratio per layer such that is generated thus models differ in signature/width. For each layer pruned model, layers out of total layers (dependent on the network) are randomly selected for retention such that

thus models differ in depth. As to be expected, layer pruning has higher upper bound in latency reduction compared to filter pruning specially on modern complex architectures with residual blocks. However, we want to highlight quantitatively in the plot the discrepancy of attainable latency reduction using both methods. Filter pruning is not only constrained by the depth of the model but also by the connection dependency in the architecture. Example of such connection dependency is the element-wise sum operation in residual block between identity connection and residual connection. Filter pruning methods commonly prune in-between convolution layers in a residual to respect number of channels and spatial dimensions. BAR

[22] proposed atypical residual block that allows mixed-connectivity between blocks to tackle the issue. However, this requires special implementations to leverage the speedup gain. Another limitation in filter pruning is the iterative process and thus is constrained to keep minimum number of filters per layer during optimization to allow for data passing. LayerPrune performs a one-shot pruning before fine-tuning and thus it allows for layer removal even from single path networks.

Figure 2: Evaluation on ImageNet between our LayerPrune framework, handcrafted architectures (dots) and pruning methods on ResNet50 (crosses). Inference time is measured on 1080Ti GPU.

Motivated by these points, what remains to ask is how well do layer pruned models perform in terms of accuracy compared to filter pruned methods. Fig. 2 shows accuracy and images per second between our LayerPrune and several state-of-the-art pruning methods, as well as, several handcrafted architectures. In general, pruning methods tend to find better quality models than handcrafted architectures. It is worth noting that filter pruning methods such as ThiNet [27] and Taylor [30] show small speedup gain as more filters are pruned compared to LayerPrune. This shows the limitation of filter pruning methods on latency reduction.

2 Related Work

We divide existing pruning methods into four categories: weight pruning, hardware-agnostic filter pruning, hardware-aware filter pruning and layer pruning.

Weight pruning. An early major category in pruning is individual weight pruning (unstructured pruning). Weight pruning methods leverage the fact that some weights have minimal effect on the task accuracy and thus can be zeroed-out. In [8], weights with small magnitude are removed and in [7], quantization is further applied to achieve more model compression. Another data-free pruning is [37]

where neurons are removed iteratively from fully connected layers.

-regularization based method [26] is proposed to encourage network sparsity in training. Finally, in lottery ticket hypothesis [6], the authors propose a method of finding winning tickets which are subnetworks from random initialization that achieve higher accuracy than the dense model. The limitation of the unstructured weight pruning is that dedicated hardware and libraries [35] are needed to achieve speedup from the compression. Given our focus on latency and to keep the evaluation setup simple, we do not consider these methods in our evaluation.

Hardware-agnostic filter pruning. Methods in this category (also known as structured pruning) aim to reduce the footprint of a model by pruning filters without any knowledge of the inference resource consumption. Examples of these are [30, 25, 27, 42, 23], which focus on removing the least important filters and obtaining a slimmer model. Earlier filter-pruning methods [27, 23] required layer wise sensitivity analysis to generate the signature (i.e number of filters per layer) as a prior and remove filters based on a filter criterion. The sensitivity analysis is computationally expensive to conduct and becomes even less feasible for deeper models. Recent methods [30, 25, 42] learn a global importance measure removing the need for sensitivity analysis. Molchanov et al. [30] propose a Taylor approximation on network’s weights where the filter’s gradients and norm are used to approximate its global importance score. Liu et al. [25] and Wen et al. [42] propose sparsity loss for training along with the classification’s cross entropy loss. Filters whose criterion are less than a threshold are removed and the pruned model is finally fine-tuned. Zhao et al. [48]

introduce channel saliency that is parameterized as gaussian distribution and optimized in the training process. After training, channels with small mean and variance are pruned. In general, methods with sparsity loss lack a simple approach to respect a resource consumption target and require hyperparameter tuning to balance different losses.

Hardware-aware filter pruning. To respect a resource consumption budget, recent works [4, 47, 45, 11] have been proposed to take into consideration a resource target within the optimization process. NetAdapt [47]

prunes a model to meet a target budget using heuristic greedy search. Lookup table is built for latency prediction and then multiple candidates are generated at each pruning iteration by pruning a

of filters from each layer independently. The candidate with the highest accuracy is then selected and the process continues to the next pruning iteration with a progressively increasing . On the other hand, AMC [11] and ECC [45]

propose an end-to-end constrained pruning. AMC utilizes reinforcement learning to select a model’s signature by trial and error. ECC simplify the latency reduction model as a bilinear per-layer model. The training utilizes alternating direction method of multiplier (ADMM) to perform constrained optimization by alternating between network weight optimization and dual variables that control layer-wise pruning ratio. Although, these methods incorporate resource consumption as a constraint in the training process, the range of attainable budgets is limited by the depth of the model. In addition, generating data measurements to model resource consumption per hardware and architecture can be expensive specially on low-end hardware platforms.

Layer pruning. Unlike filter pruning, little attention is paid to shallowing CNNs in the pruning literature. In SSS [16], the authors propose to train a scaling factor for structure selection such as neurons, blocks and groups. However, shallower models are only possible with architectures with residual connections to allow data flow in the optimization process. Closest to our work for a general (i.e not constrained by architecture type) layer pruning approach is the work done by Chen et al. [3]

. In their method, linear classifiers probes are utilized and trained independently per layer for layer-ranking. After layer-ranking learning stage, they prune the least important layers and fine-tune the shallower model. Although

[3] requires rank training, it is without any gain in classification accuracy compared to our one-shot LayerPrune layer ranking as will be shown in experiments section.

Figure 3: Main pipeline illustrates the difference between typical iterative filter pruning and proposed LayerPrune framework. Filter pruning (top) produces thinner architecture in an iterative process while LayerPrune (bottom) prunes whole layers in one-shot. In LayerPrune, layer’s importance is calculated as the average importance of each filter in all filters at that layer.

3 Methodology

In this section, we describe in details LayerPrune for layer pruning using existing filter criteria along with a novel layer-wise accuracy approximation.

3.1 Layer criteria

Typical filter pruning method follows a three-stage pipeline as illustrated in Figure 3. Filter importance is iteratively re-evaluated after each pruning step based on a pruning meta-parameter such as pruning N filters or pruning those threshold. In LayerPrune, we remove the need for the iterative pruning step, and show that using the same filter criterion, we can remove layers in one-shot to respect a budget. This simplifies the pruning step to a hyper-parameter free process and is computationally efficient. Layer importance is calculated as the average of filter importance in this layer. Unlike [3], LayerPrune does not require training for layer ranking and leverage the filters statistics.

Layer-wise imprinting. In addition to existing filter criteria, we present a novel layer importance by layer-wise accuracy approximation. Motivated by the few-shot learning literature [33, 28], we use imprinting to approximate the classification accuracy up to each layer. Imprinting is used to approximate a classifier’s weight matrix when only a few training samples are available. Although we have adequate training samples, we are inspired by the efficiency of imprinting to approximate the accuracy in one pass without the need for training. We create a classifier proxy for each prunable candidate (e.g convolution layer or residual blocks), and then the training data is used to imprint the classifier weight matrix for each proxy. Since each layer has a different output feature shape, we apply adaptive average pooling to simplify our method and unify the embedding length so that each layer produces roughly an output of the same size. Specifically, the pooling is done as follows:

(1)

where is the embedding length, is layer ’s number of filters, is layer ’s output feature map, and AdaptiveAvgPool [9] reduces to embedding . Finally, embeddings per layer are flattened to be used in imprinting. Imprinting calculates the proxy classifier’s weights matrix as follows:

(2)

where is the class id, is sample’s class id, is the number of samples in class , is the total number of samples, and denotes the indicator function.

The accuracy at each proxy is then calculated using the imprinted weight matrices. The prediction for each sample is calculated for each layer as:

(3)

where is calculated as shown in Eq.(1). This is equivalent to finding the nearest class from the imprinted weights in the embedding space. Ranking of each layer is then calculated as the gain in accuracy from previous pruning candidate.

3.2 Filter criteria

Although existing filter pruning methods are different in algorithms and optimization used, they focus more on finding the optimal per-layer number of filters and share common filter criteria. We divide the methods based on the filter criterion used and propose their layer importance counterpart used in LayerPrune.

Preliminary notion. Consider a network with layers, each layer has weight matrix with input channels, number of filters and is the size of the filters at this channel. Evaluated criteria and methods are:

Weight statistics. [8, 23, 45] differ in the optimization algorithm but share weight statistics as a filter ranking. Layer pruning for this criteria is calculated as:

(4)

Taylor weights. Taylor method [30] is slightly different from previous criterion in that the gradients are included in the ranking as well. Filter ranking is based on where iterates over all individual weights in , is the gradient, is the weight value. Similarly, layer ranking can be expressed as:

(5)

where is element-wise product and is the gradient of loss with respect to weights .

Feature map based heuristics. [27, 31, 24] rank filters based on statistics from output of layer. In [27], ranking is based on the effect on the next layer while [31], similar to Taylor weights, utilizes gradients and norm but on feature maps.

Channel saliency. In this criterion, a scalar is multiplied by the feature maps and optimized within a typical training cycle with task loss and sparsity regularization loss to encourage sparsity. Slimming [25]

utilizes Batch Normalization scale

as the channel saliency. Similarly, we use Batch Normalization scale parameter to calculate layer importance for this criteria, specifically:

(6)

Ensemble. We also consider diverse ensemble of layer ranks where the ensemble rank of each layer is the sum of its rank per method, more specifically:

(7)

where is the layer’s index, is the number of all criteria and LayerRank indicates the order of layer in the sorted list for criterion .

4 Evaluation Results

In this section we present our experimental results comparing state-of-the-art pruning methods and LayerPrune in terms of accuracy and latency reduction on two different hardware platforms. We show latency on high-end GPU 1080Ti and on NVIDIA Jetson Xavier embedded device, which is used in mobile vision systems and contains 512-core Volta GPU. We evaluate the methods on CIFAR10/100 [19] and ImageNet [20] datasets.

4.1 Implementation details

Latency calculation.

Latency model is averaged over 1000 forward pass after 10 warm up forward passes for lazy GPU initialization. Latency is calculated using batch size 1, unless otherwise stated, due to its practical importance in real-time application as in robotics where we process online stream of frames. All pruned architectures are implemented and measured using PyTorch

[1]. For fair comparison, we compare latency reduction on similar accuracy retention from baseline and reported by original papers or compare accuracy on similar latency reduction with methods supporting layer or block pruning.
Handling filter shapes after layer removal. If the pruned layer with weight has , we replace layer ’s weight matrix from to with random initialization. All other layers are initialized from the pre-trained dense model.

4.2 Results on CIFAR

We evaluate CIFAR-10 and CIFAR-100 on ResNet56 [10] and VGG19-BN [36].

Figure 4:

Example of 100 random filter pruned and layer pruned models generated from VGG19-BN (Top-1=73.11%). Accuracy mean and standard deviation is shown in parentheses. Latency is calculated on 1080Ti with batch size 8.

4.2.1 Random filters vs. Random layers

Initial hypothesis verification is to generate random filter and layer pruned models, then train them to compare their accuracy and latency reduction. Random models generation follows the same setup as explained in Section (1

). Each model is trained with SGD optimization for 164 epochs with learning rate 0.1 that decays by 0.1 at apochs 81, 121, and 151. Figure

4 shows the latency-accuracy plot for both random pruning methods. Layer pruned models outperforms filter pruned ones in accuracy by 7.09% on average and can achieve up to 60% latency reduction. In addition, within the same latency budget, filter pruning shows higher variance in accuracy than layer pruning. This suggests that latency constrained optimization with filter pruning is complex and requires careful per layer pruning ratio selection. On the other hand, layer pruning has small accuracy variation, in general within a budget.

4.2.2 Vgg19-Bn

Results on CIFAR-100 are presented in Table 1. The table is divided based on the previously mentioned filter criterion categorization in Section 3.2. First, we compare with Chen et el. [3] on a similar latency reduction as both [3] and LayerPrune perform layer pruning. Although [3] requires training for layer ranking, LayerPrune outperforms it by 1.11%. We achieve up to 56% latency reduction with 1.52% accuracy increase from baseline. As VGG19-BN is over-parametrized for CIFAR-100, removing layers act as a regularization and can find models with better accuracy than the baseline. Unlike with filter pruning methods, they are bounded by small accuracy variations around the baseline. It is worth mentioning that latency reduction of removing similar number of filters using different filter criterion varies from -0.06% to 40.0%. While layer pruning methods, with the same number of pruned layers, regardless of the criterion ranges from 34.3% to 41%. That suggests that latency reduction using filter pruning is sensitive to environment setup and requires complex optimization to respect latency budget.

Figure 5: Layer-wise accuracy using imprinting on CIFAR-100.

To further explain the accuracy increase by LayerPrune, Fig. 5 shows layer wise accuracy approximation on baseline VGG19-BN using imprinting method explained in Section (3.1

). Each bar represents the approximated classification accuracy up to this layer (rounded for visualization). We see a drop in accuracy followed by an increasing trend from conv10 to conv15. This is likely because the number of features is the same from conv10 to conv12. We start to observe an accuracy increase only at conv13 that follows a max pooling layer and has twice as many features. That highlights the importance of downsampling and the doubling the number of features at this point of the model. So layer pruning does not only improve inference speed but can also discover a better regularized shallow model specially on small dataset. It is also worth mentioning that both the proxy classifier from the last layer, conv16, and the actual model classifier, GT, have the same accuracy, showing how the proxy classifier is a plausible approximation to the converged classifier.

VGG19 (73.11%)
Method Shallower? Top1-accuracy (%)
LR(%)
1080Ti bs=8
LR (%)
1080Ti bs=64
LR (%)
Xavier bs=8
LR (%)
Xavier bs = 64
Chen et al. [3] 73.25 56.01 52.86 58.06 49.86
LayerPrune-Imprint 74.36 56.10 53.67 57.79 49.10
Weight norm [8] 73.01 -2.044 -0.873 -4.256 -0.06
ECC [45] 72.71 16.37 36.70 29.17 36.69
LayerPrune 73.60 17.32 14.57 19.512 10.97
LayerPrune 74.80 39.84 37.85 41.86 34.38
Slimming [25] 72.32 16.84 40.08 40.55 39.53
LayerPrune 73.60 17.34 13.86 18.85 10.90
LayerPrune 74.80 39.56 37.30 41.40 34.35
Taylor [30] 72.61 15.87 19.77 -4.89 17.45
LayerPrune 73.60 17.12 13.54 18.81 10.89
LayerPrune 74.80 39.36 37.12 41.34 34.44
Table 1: Comparison of different pruning methods on VGG19-BN CIFAR-100. The accuracy for baseline model is shown in parentheses. LR, bs stands for latency reduction and batch size respectively. in Layer pruning indicates number of layers removed. -ve LR indicates increase in latency. Shallower indicates whether a method prunes layers. Best is shown in bold.

4.2.3 ResNet56

We also compare on the more complex architecture ResNet56 on CIFAR-10 and CIFAR-100 in Table 2. On a similar latency reduction, LayerPrune outperforms [3] by 0.54% and 1.23% on CIFAR-10 and CIFAR-100 respectively. On the other hand, within each filter criterion, LayerPrune outperforms filter pruning and is on par with the baseline in accuracy. In addition, filter pruning can result in latency increase (i.e negative LR) with specific hardware targets and batch sizes [38] as shown with batch size 8. However, LayerPrune consistently shows latency reduction under different environmental setups. We also compare with larger batch size to further encourage filter pruned models to better utilize the resources. Still, we found LayerPrune achieves overall better latency reduction with large batch size. Latency reduction variance, LR var, between different batch sizes within the same hardware platform is shown as well. Consistent with previous results on VGG, LayerPrune is less sensitive to changes in criterion, batch size, and hardware than filter pruning. We also show results up to 2.5x latency reduction with less than 2% accuracy drop.

Method Shallower? Top1-accuracy (%)
LR (%)
1080Ti bs=8
LR (%)
1080Ti bs=64
LR (%)
Xavier bs=8
LR (%)
Xavier bs = 64
CIFAR-10 ResNet56 baseline (93.55%)
Chen et al. [3] 93.09 26.60 26.31 26.96 25.66
LayerPrune-Imprint 93.63 26.41 26.32 27.30 29.11
Taylor weight [30] 93.15 0.31 5.28 -0.11 2.67
LayerPrune 93.49 2.864 3.80 5.97 5.82
LayerPrune 93.35 6.46 8.12 9.33 11.38
Weight norm [8] 92.95 -0.90 5.22 1.49 3.87
L1 norm [23] 93.30 -1.09 -0.48 2.31 1.64
LayerPrune 93.50 2.72 3.88 7.08 5.67
LayerPrune 93.39 5.84 7.94 10.63 11.45
Feature maps [31] 92.7 -0.79 6.17 1.09 8.38
LayerPrune 92.61 3.29 2.40 7.77 2.76
LayerPrune 92.28 6.68 5.63 11.11 5.05
Batch Normalization [25] 93.00 0.6 3.85 2.26 1.42
LayerPrune 93.49 2.86 3.88 7.08 5.67
LayerPrune 93.35 6.46 7.94 10.63 11.31
LayerPrune-Imprint 92.49 57.31 55.14 57.57 63.27
CIFAR-100 ResNet56 baseline (71.2%)
Chen et al. [3] 69.77 38.30 34.31 38.53 39.38
LayerPrune-Imprint 71.00 38.68 35.83 39.52 54.29
Taylor weight [30] 71.03 2.13 5.23 -1.1 3.75
LayerPrune 71.15 3.07 3.74 3.66 5.50
LayerPrune 70.82 6.44 7.18 7.30 11.00
Weight norm [8] 71.00 2.52 6.46 -0.3 3.86
L1 norm [23] 70.65 -1.04 4.06 0.58 1.34
LayerPrune 71.26 3.10 3.68 4.22 5.47
LayerPrune 71.01 6.59 7.03 8.00 10.94
Feature maps [31] 70.00 1.22 9.49 -1.27 7.94
LayerPrune 71.10 2.81 3.24 4.46 5.56
LayerPrune 70.36 6.06 6.70 7.72 7.85
Batch Normalization [25] 70.71 0.37 2.26 -1.02 2.89
LayerPrune 71.26 3.10 3.68 4.22 5.47
LayerPrune 70.97 6.36 6.78 7.59 10.94
LayerPrune-Imprint 68.45 60.69 57.15 61.32 71.65
Table 2: Comparison of different pruning methods on ResNet56 CIFAR-10/100. The accuracy for baseline model is shown in parentheses. LR and bs stands for latency reduction and batch size respectively. in LayerPrune indicates number of blocks removed.

4.3 Results on ImageNet

We evaluate the methods on the challenging ImageNet dataset for classification. For all experiments in this section, PyTorch pretrained models are used as starting point for network pruning. We follow the same setup as in [30] where we prune 100 filters for each 30 minibatches for 10 pruning iterations 222Ablation study including different hyperparameters are presented in supplementary.. The pruned model is then fine-tuned with learning rate 1e using SGD optimizer and 256 batch size. Results on ResNet50 are presented in Table 3. In general, LayerPrune methods improve accuracy over the baseline and their counterpart filter pruning methods. Although feature maps criterion [31] achieves better accuracy by 0.92% over LayerPrune, LayerPrune has higher latency reduction that exceeds by 5.7%. It is worth mentioning that the latency aware optimization ECC has an upper bound latency reduction of 11.56%, on 1080Ti, with accuracy 16.3%. This stems from the fact that iterative filter pruning is bounded by the network’s depth and structure dependency within the network, thus not all layers are considered for pruning such as the gates at residual blocks. In addition, ECC builds a layer-wise bilinear model to approximate latency of a model given number of input channels and output filters per layer. This simplifies the non-linear relationship between number of filters per layer and latency. We show latency reduction on Xavier for an ECC pruned model optimized for 1080Ti, and this pruned model results in latency increase on batch size 1 and the lowest latency reduction on batch size 64. This suggests that, a hardware-aware filter pruned model for one hardware architecture might perform worse on another hardware than even a hardware-agnostic filter pruning method. It is worth noting that the filter pruning HRank [24] with 2.6x FLOPs reduction shows large accuracy degradation compared to LayerPrune (71.98 vs 74.31). Even with aggressive filter pruning, speed up is noticeable with large batch size but shows small speed gain with small batch size. Within shallower models, LayerPrune outperforms SSS on same latency budget even when SSS supports block pruning for ResNet50, that shows the effectiveness of accuracy approximation as a layer importance.

ResNet50 baseline (76.14)
Method Shallower? Top1-accuracy (%)
LR(%)
1080Ti bs=1
LR (%)
1080Ti bs=64
LR (%)
Xavier bs=1
LR (%)
Xavier bs = 64
Weight norm [8] 76.50 6.79 3.46 6.57 8.06
ECC [45] 75.88 13.52 1.59 -4.91** 3.09**
LayerPrune 76.70 15.95 4.81 21.38 6.01
LayerPrune 76.52 20.32 13.23 26.14 13.20
Batch Normalization 75.23 2.49 1.61 -2.79 4.13
LayerPrune 76.70 15.95 4.81 21.38 6.01
LayerPrune 76.52 20.41 8.36 25.11 9.96
Taylor [30] 76.4 2.73 3.6 -1.97 6.60
LayerPrune 76.48 15.79 3.01 21.52 4.85
LayerPrune 75.61 21.35 6.18 27.33 8.42
Feature maps [31] 75.92 10.86 3.86 20.25 8.74
Channel pruning* [12] 72.26 3.54 6.13 2.70 7.42
ThiNet* [27] 72.05 10.76 10.96 15.52 17.06
LayerPrune 75.00 16.56 2.54 23.82 4.49
LayerPrune 71.90 22.15 5.73 29.66 8.03
SSS-ResNet41 [16] 75.50 25.58 24.17 31.39 21.76
LayerPrune-Imprint 76.40 22.63 25.73 30.44 20.38
LayerPrune-Imprint 75.82 30.75 27.64 33.93 25.43
SSS-ResNet32 [16] 74.20 41.16 29.69 42.05 29.59
LayerPrune-Imprint 74.74 40.02 36.59 41.22 34.50
HRank-2.6x-FLOPs* [24] 71.98 11.89 36.09 20.63 40.09
LayerPrune-Imprint 74.31 44.26 41.01 41.01 38.39
Table 3: Comparison of different pruning methods on ResNet50 ImageNet. * manual pre-defined signatures. ** same pruned model optimized for 1080Ti latency consumption model in ECC optimization

4.4 Layer pruning comparison

In this section, we analyse different criteria for layer pruning under the same latency budget as presented in Table 4. Our imprinting method consistently outperforms other methods specifically on higher latency reduction rates. Imprinting is able to get 30% latency reduction with only 0.36% accuracy loss from baseline. Ensemble method, although has better accuracy than the average accuracy, it is still sensitive to individual’s errors. We further compare imprinting layer pruning on similar latency budget with smaller ResNet variants. We outperform ResNet34 by 1.44% (LR=39%) and ResNet18 by 0.56% (LR=65%) in accuracy showing the effectiveness of incorporating accuracy in block importance. Detailed numerical evaluation can be found in supplementary.

ResNet50 (76.14)
1 block (LR 15%) 2 blocks (LR 20%) 3 blocks (LR 25%) 4 blocks (LR 30%)
LayerPrune-Imprint 76.72 76.53 76.40 75.82
LayerPrune-Taylor 76.48 75.61 75.34 75.28
LayerPrune-Feature map 75.00 71.9 70.84 69.05
LayerPrune-Weight magnitude 76.70 76.52 76.12 74.33
LayerPrune-Batch Normalization 76.70 76.22 75.84 75.03
LayerPrune-Ensemble 76.70 76.11 75.76 75.01
Table 4: Comparison of different layer pruning methods supported by LayerPrune on ResNet50 ImageNet. Latency reduction is calculated on 1080Ti with batch size 1.

5 Conclusion

We presented LayerPrune framework which includes set of layer pruning methods. We show the benefits of LayerPrune on latency reduction compared to filter pruning. The key findings of this paper are the following:

  • For a filter criterion, training a LayerPrune model based on this criterion achieves the same, if not better, accuracy as the filter pruned model obtained by using the same criterion.

  • Filter pruning compresses the number of convolution operations per layer and thus latency reduction depends on hardware architecture, while LayerPrune removes the whole layer. In result, filter pruned models might produce non-optimal matrix shapes for the compute kernels that can lead even to latency increase on some hardware targets and batch sizes.

  • Filter pruned models within a latency budget have a larger variance in accuracy than LayerPrune. This stems from the fact that the relation between latency and number of filters is non-linear and optimization constrained by a resource budget requires complex per-layer pruning ratios selection.

  • We also showed the importance of incorporating accuracy approximation in layer ranking by imprinting.

References

  • [1] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind (2017)

    Automatic differentiation in machine learning: a survey

    .
    The Journal of Machine Learning Research (1), pp. 5595–5637. Cited by: §4.1.
  • [2] S. Bianco, R. Cadene, L. Celona, and P. Napoletano (2018) Benchmark analysis of representative deep neural network architectures. IEEE Access, pp. 64270–64277. Cited by: §1.
  • [3] S. Chen and Q. Zhao (2018) Shallowing deep networks: layer-wise pruning based on feature representations. IEEE transactions on pattern analysis and machine intelligence (12), pp. 3048–3056. Cited by: §2, §3.1, §4.2.2, §4.2.3, Table 1, Table 2.
  • [4] T. Chin, C. Zhang, and D. Marculescu (2018) Layer-compensated pruning for resource-constrained convolutional neural networks. NeurIPS. Cited by: §2.
  • [5] S. Elkerdawy, H. Zhang, and N. Ray (2019) Lightweight monocular depth estimation model by joint end-to-end filter pruning. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4290–4294. Cited by: §1.
  • [6] J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §2.
  • [7] S. Han, H. Mao, and W. Dally (2015) Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR 2017. Cited by: §2.
  • [8] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2, §3.2, Table 1, Table 2, Table 3.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, pp. 346–361. Cited by: §3.1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE CVPR, pp. 770–778. Cited by: §1, §4.2.
  • [11] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the ECCV), pp. 784–800. Cited by: §2.
  • [12] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE ICCV, pp. 1389–1397. Cited by: Table 3.
  • [13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [14] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §1.
  • [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 4700–4708. Cited by: §1.
  • [16] Z. Huang and N. Wang (2018) Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pp. 304–320. Cited by: §2, Table 3.
  • [17] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §1.
  • [18] X. Jin, B. Peng, Y. Wu, Y. Liu, J. Liu, D. Liang, J. Yan, and X. Hu (2019) Knowledge distillation via route constrained optimization. In Proceedings of the IEEE ICCV, pp. 1345–1354. Cited by: §1.
  • [19] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §4.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §4.
  • [21] et. a. Kurt Keutzer (2019-12)

    Abandoning the dark arts: scientific approaches to efficient deep learning

    .
    The 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing , Conference on Neural Information Processing Systems. Cited by: §1.
  • [22] C. Lemaire, A. Achkar, and P. Jodoin (2019) Structured pruning of neural networks with budget-aware regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9108–9116. Cited by: §1.
  • [23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. ICLR. Cited by: §2, §3.2, Table 2.
  • [24] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao (2020) HRank: filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538. Cited by: §3.2, §4.3, Table 3.
  • [25] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE ICCV, pp. 2736–2744. Cited by: §2, §3.2, Table 1, Table 2.
  • [26] C. Louizos, M. Welling, and D. P. Kingma (2017) Learning sparse neural networks through regularization. arXiv preprint arXiv:1712.01312. Cited by: §2.
  • [27] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE ICCV, pp. 5058–5066. Cited by: §1, §2, §3.2, Table 3.
  • [28] B. O. M. Siam and M. Jagersand (2019) AMP: adaptive masked proxies for few-shot segmentation. In Proceedings of the IEEE ICCV, Cited by: §3.1.
  • [29] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: §1.
  • [30] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz (2019) Importance estimation for neural network pruning. In Proceedings of the IEEE CVPR, pp. 11264–11272. Cited by: §1, §2, §3.2, §4.3, Table 1, Table 2, Table 3.
  • [31] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016)

    Pruning convolutional neural networks for resource efficient transfer learning

    .
    arXiv preprint arXiv:1611.06440 3. Cited by: §3.2, §4.3, Table 2, Table 3.
  • [32] C. Nugteren and V. Codreanu (2015) CLTune: a generic auto-tuner for opencl kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pp. 195–202. Cited by: §1.
  • [33] H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. In Proceedings of the IEEE CVPR, pp. 5822–5830. Cited by: §3.1.
  • [34] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
  • [35] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M. Stuart, Z. Poulos, and A. Moshovos (2019) Laconic deep learning inference acceleration. In Proceedings of the 46th International Symposium on Computer Architecture, pp. 304–317. Cited by: §2.
  • [36] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §4.2.
  • [37] S. Srinivas and R. V. Babu (2015) Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149. Cited by: §2.
  • [38] V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §4.2.3.
  • [39] B. van Werkhoven (2019) Kernel tuner: a search-optimizing gpu code auto-tuner. Future Generation Computer Systems, pp. 347–358. Cited by: §1.
  • [40] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) Haq: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE CVPR, pp. 8612–8620. Cited by: §1.
  • [41] R. J. Wang, X. Li, and C. X. Ling (2018) Pelee: a real-time object detection system on mobile devices. In Advances in Neural Information Processing Systems, pp. 1963–1972. Cited by: §1.
  • [42] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §2.
  • [43] Z. Wu, C. Shen, and A. Van Den Hengel (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognition, pp. 119–133. Cited by: §1.
  • [44] C. Yang, L. Xie, C. Su, and A. L. Yuille (2019) Snapshot distillation: teacher-student optimization in one generation. In Proceedings of the IEEE CVPR, pp. 2859–2868. Cited by: §1.
  • [45] H. Yang, Y. Zhu, and J. Liu (2019) Ecc: platform-independent energy-constrained deep neural network compression via a bilinear regression model. In Proceedings of the IEEE CVPR, pp. 11206–11215. Cited by: §1, §2, §3.2, Table 1, Table 3.
  • [46] T. Yang, Y. Chen, and V. Sze (2017) Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695. Cited by: §1.
  • [47] T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018) Netadapt: platform-aware neural network adaptation for mobile applications. In Proceedings of the ECCV, pp. 285–300. Cited by: §1, §2.
  • [48] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian (2019) Variational convolutional neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2780–2789. Cited by: §2.