Layer-compensated Pruning for Resource-constrained Convolutional Neural Networks

by   Ting-Wu Chin, et al.
Carnegie Mellon University

Resource-efficient convolution neural networks enable not only the intelligence on edge devices but also opportunities in system-level optimization such as scheduling. In this work, we aim to improve the performance of resource-constrained filter pruning by merging two sub-problems commonly considered, i.e., (i) how many filters to prune for each layer and (ii) which filters to prune given a per-layer pruning budget, into a global filter ranking problem. Our framework entails a novel algorithm, dubbed layer-compensated pruning, where meta-learning is involved to determine better solutions. We show empirically that the proposed algorithm is superior to prior art in both effectiveness and efficiency. Specifically, we reduce the accuracy gap between the pruned and original networks from 0.9 reduction in time needed for meta-learning, i.e., from 1 hour down to 7 minutes. To this end, we demonstrate the effectiveness of our algorithm using ResNet and MobileNetV2 networks under CIFAR-10, ImageNet, and Bird-200 datasets.



There are no comments yet.


page 7


LeGR: Filter Pruning via Learned Global Ranking

Filter pruning has shown to be effective for learning resource-constrain...

A One-step Pruning-recovery Framework for Acceleration of Convolutional Neural Networks

Acceleration of convolutional neural network has received increasing att...

Recursive Least Squares for Training and Pruning Convolutional Neural Networks

Convolutional neural networks (CNNs) have succeeded in many practical ap...

A Main/Subsidiary Network Framework for Simplifying Binary Neural Network

To reduce memory footprint and run-time latency, techniques such as neur...

PFGDF: Pruning Filter via Gaussian Distribution Feature for Deep Neural Networks Acceleration

The existence of a lot of redundant information in convolutional neural ...

KCP: Kernel Cluster Pruning for Dense Labeling Neural Networks

Pruning has become a promising technique used to compress and accelerate...

Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction

Dynamic model pruning is a recent direction that allows for the inferenc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the ubiquity of mobile and edge devices, it has become desirable to bring the performance of convolution neural networks (CNN) to the device without going through the cloud, especially due to latency and privacy considerations. However, mobile and edge devices are often characterized by stringent resource constraints, such as energy consumption and model size. Furthermore, depending on the application domain, multiply-accumulate operations (MACs) or latency may also need to be constrained. On the other hand, as the resource usage for each of the workloads (i.e., CNN) reduces, the system can serve more of them at once, which further creates system-level optimization opportunities. As a result, the main research question is how we design resource-constrained CNNs that are still capable to deliver good enough task performance, or prediction accuracy. This problem has been recently approached from two directions: (i) resource-constrained neural architecture search and (ii) resource-constrained model compression. On one hand, neural architecture search is a bottom-up approach that has shown promise in finding a better solution due to the larger search space; however, it requires massive computational power to conduct the search. On the other hand, resource-constrained model compression is a top-down

approach that seeks solutions in the neighborhood of a provided pre-trained network. As a result, the final solution may be worse in terms of accuracy but it is much more efficient to find when compared to neural architecture search. We conjecture that, the latter approach provides desirable means for machine learning practitioners to take advantage from off-the-shelf, carefully-crafted deep neural networks and adapt them to user-defined resource constraints without having to search for and train a network for a long time.

Among various model compression methods, structural pruning has been widely adopted to approach the resource-constrained deep neural network optimization problem Molchanov et al. (2017); He et al. (2017); Luo et al. (2017); He et al. (2018b); Gordon et al. (2018)

since it provides solution that works directly on existing deep learning frameworks and hardware. Specifically, for CNNs, the structural pruning method we discuss in this paper is filter pruning 

Li et al. (2017). However, resource-constrained pruning is a hard problem since (i) deciding which filter to disable such that accuracy is maximized and constraints are satisfied is a NP-hard combinatorial problem and (ii) deep neural networks are highly non-linear functions that are hard to analyze. To tackle these problems, prior art leverages approximations and considers two sub-problems: (a) how many filters to prune for each layer, which we call the layer scheduling problem, and (b) which filters to prune, given a per-layer pruning budget, which we call the ranking problem.

The majority of prior work focuses on the ranking

problem and proposes various heuristics to measure the importance of each filter 

Jiang et al. (2018); Mao et al. (2017); He et al. (2017); Luo et al. (2017); Yoon et al. (2018); Li et al. (2017) while addressing layer scheduling by either employing manually designed rules based on experience or simply pruning the network uniformly by a fixed percentage across layers. To achieve effective and efficient filter pruning, there is a need for algorithms that determine the layer schedule without humans in the loop. Pioneering this direction, He et al. (2018b)

approach this problem with reinforcement learning to learn an agent for deciding how many filters to prune for each layer given a resource constraint. While achieving better results than a hand-tuned policy, we note that such a formulation is too time-consuming if the goal is the Pareto frontier of the design space of interest. We argue that traversing the Pareto frontier efficiently is a critical feature desired by machine learning practitioners since constraints are not always known

a priori when building a system running convolution neural network models among other workloads.

In this work, we first theoretically derive the pruning problem such that the layer scheduling problem and the ranking problem are treated as a unified problem with a naïve solution. Then, based on the analysis of the derivation, we introduce a novel formulation that takes the approximation errors into account to further improve the naïve solution. Specifically, we leverage meta-learning to learn a set of latent variables that compensate for the approximation error, and we call it the layer-compensated pruning algorithm. Overall, our contributions are as follows:

  • We define the pruning problem from a theoretical standpoint and connect to prior solutions throughout the derivations.

  • We propose a novel, effective, and efficient algorithm, dubbed layer-compensated pruning, which improves prior art by learning to compensate for the approximation error incurred in the derivation in a layer-wise fashion. Specifically, we achieve slight better results 8x faster.

  • In our general formulation, we show that layer-compensated pruning can improve various commonly-adopted heuristic metrics such as , of weights, and -order Taylor approximation.

  • We conduct comprehensive analysis to justify our findings on already-small deep neural networks, i.e., ResNet and MobileNetV2, using three datasets, i.e., CIFAR-10, ImageNet, and Bird-200.

2 Related Work

For resource-constrained DNN design, there are generally two approaches: bottom-up and top-down.

2.1 Bottom-up Resource-constrained DNN Design

Bottom-up approaches try to build the neural network from ground-up while incorporating some awareness of the resources. These types of approaches are often called multi-objective or platform-aware neural architecture search Hsu et al. (2018); Zhou et al. (2018); Tan et al. (2018); Dong et al. (2018). Within this domain, every approach has a different search space, learning algorithm, and decision space. For example, Zhou et al. (2018)

used a policy gradient algorithm to learn two controllers: one is in charge of scaling the number of neurons in an existing layer and the other removes or inserts new operations on top of the current layer given the network embedding. Similarly,

Hsu et al. (2018) used policy gradient to learn a controller that suggests the hyper-parameters for the target network given the network embedding, but with a smaller search space that tunes the hyper-parameters of an existing architecture (e.g., number of filters in convolution and growth rate in CondenseNet Huang et al. (2017)). On the other hand, Tan et al. (2018) relied on a search space that covers connections and operations within a cell and determines how many stacked cells should form a block. Lastly, Dong et al. (2018) proposed to learn a surrogate function that predicts the accuracy of network candidates, which makes the traversal of the Pareto frontier efficient. Although bottom-up approaches enable a promising solution with larger search space, they are much more time-consuming compared to top-down approaches.

2.2 Top-down Resource-constrained DNN Design

We consider pruning and quantization top-down approaches since they start with an existing neural network and try to trim down connections or precision. While there is emerging work toward network quantization Khoram & Li (2018) in a resource-constrained setting, structural pruning is often adopted as the solution for the problem Gordon et al. (2018); Luo et al. (2017); He et al. (2017); Molchanov et al. (2017); He et al. (2018b); Yang et al. (2018a); Jiang et al. (2018); He et al. (2018a); Hu et al. (2016); Liu et al. (2017); Li et al. (2017); Lin et al. (2018); Wen et al. (2016) given its fine-grained control capability over the constraints. We group solutions to structural pruning into the following three categories.

Joint Optimization

In this line of work, current approaches try to jointly optimize for model weights as well as the filter mask

. A common approach is to let the scaling factor of the batch normalization layer that follows the convolution layer act as the mask

and add

-weighted regularization terms to suppress the scaling factor of the batch normalization layer. As a result, standard training procedures using stochastic gradient descent optimize both

and jointly Ye et al. (2018); Liu et al. (2017); Gordon et al. (2018). On the other hand, Louizos et al. (2017)

formulate the loss function from a Bayesian standpoint while 

Dai et al. (2018) derive the loss function from an information-theoretic standpoint. However, we note that such approaches count on the non-intuitive tuning knobs to traverse the Pareto optimal.

Local Ranking with Layer Scheduling

In this line of research, resource-constrained filter pruning is done by solving two sub-problems: (a) decide how many filters to prune for each layer such that the constraint is satisfied, which we call layer scheduling, and (b) decide which filters to prune, given a layer schedule. While most of the prior art Jiang et al. (2018); Mao et al. (2017); He et al. (2017); Luo et al. (2017); Yoon et al. (2018); Li et al. (2017) decides the layer schedule manually based on experience, we argue that the layer schedule directly affects the resource usage of the resulting network. As a result, it is not scalable to rely on experts to determine the layer schedule so as to traverse the Pareto frontier. Although He et al. (2018b) provide a good and scalable (compared to human-in-the-loop) solution for resource-constrained pruning, the reinforcement learning agent generates the layer schedule sequentially for each layer, which is inefficient and presumably harder to learn for deeper networks due to the credit assignment problem Sutton et al. (1998).

Global Ranking with Single-filter Pruning

In this category, the filters of the entire network are ranked together according to some heuristic metrics and pruning is an iterative process that has three steps: rank filters globally, greedily prune one filter at a time, and fine-tune the network. The process continues until resource constraints are satisfied, which makes the traversal along different values of constraints intuitive. Molchanov et al. (2017) propose to leverage first-order Taylor approximation for the ranking and progressively prune one filter at a time, while Theis et al. (2018) later propose to leverage Fisher information for ranking. However, pruning one filter at a time is not scalable when the number of filters is large, as in the case of MobileNetV2 Sandler et al. (2018) which has 17k filters. While our work is also considered a global ranking approach, we conduct multi-filter pruning instead of single-filter pruning, and we find multi-filter pruning to be more effective and more efficient.

3 Problem Formulation

We are interested in solving the problem of resource-constrained filter pruning. Formally, the optimization problem to consider is:


In the objective function, , where and are the input and label, respectively, is the data distribution we care about, is the loss function, is the DNN model, and are the target and pre-trained model weights, which are

dimensional array of vectors with each dimension representing the weights of a filter.

is a dimensional indicator vector representing if the corresponding filter is pruned () or not (.

On the constraint side, is the target resource usage evaluation function and is the desired resource constraint. For ease of notation, we denote by the loss difference caused by . We approach this optimization problem with the commonly adopted optimization framework He et al. (2018b, 2017); Luo et al. (2017); Li et al. (2017); Yang et al. (2018a); Yoon et al. (2018); Mao et al. (2017); He et al. (2018a); Jiang et al. (2018); Lin et al. (2018) as shown in Algorithm 1.

1:  Input: pre-trained model , constraint , iteration count , resource usage function
2:  Output: pruned model
5:  for  to  do
6:     Pruning: = solve (1) with fixed and constraint
7:     Tuning: = solve (1) with fixed
8:     Constraint tightening:
9:  end for
Algorithm 1 Resource Constrained Pruning Framework

Note that solving optimally in each iteration of Algorithm 1 (line 5) is a combinatorial problem that takes O() evaluations of the objective. In practice, greedy approximation is often adopted to find a sub-optimal .

3.1 Greedy Single-filter Pruning

To solve Algorithm 1 (line 5) in a greedy fashion, prior work assumes that will prune one additional filter on top of  Molchanov et al. (2017); Theis et al. (2018). That is, they seek to optimize:


where represents the pruning of the filter, i.e., and . is the set of remaining non-zero elements in . And


In the remainder of this section, we will stop using superscripts whenever there is no confusion.

Computing the loss difference term in equation (2) is nontrivial as a data set needs to be prepared to compute the change in loss function and one needs to evaluate the difference times to proceed in each iteration. A common practice is to define a metric to quantify the loss difference. That is:


where can be easy to compute based on the magnitude of the model parameters and is the approximation error incurred with metric approximation.

In the literature, various metric approximations have been proposed. For example, past work has used of filter weights He et al. (2018a); Yang et al. (2018b), of filter weights Mao et al. (2017); Li et al. (2017), of filter weights and filter weights of the next layer Jiang et al. (2018), first-order Taylor expansion on the loss Molchanov et al. (2017); Lin et al. (2018), Fisher information of the loss Theis et al. (2018)

, and the variance of the max activation 

Yoon et al. (2018).

3.2 Greedy Multi-filter Pruning

Greedy single-filter pruning takes evaluations on the objective and fine-tuning to prune filters from a pretrained model with filters. To prune more filters at a time, or even perform one-shot pruning, it is common to approximate the loss difference caused by a set of pruned filters with addition Luo et al. (2017); Li et al. (2017); Yang et al. (2018a); Yoon et al. (2018); He et al. (2018a, b, 2017); Mao et al. (2017); Jiang et al. (2018); Lin et al. (2018); Yang et al. (2018b). Formally,


where are newly pruned filters, is the approximation error for addition.

If the approximation errors and are both negligible, solving in line 5 of Algorithm 1 is then equivalent to solving the following problem:

Figure 1: Calculated loss difference versus various heuristic metrics in log-scale. Each point is a filter of ResNet-56 trained on CIFAR-10. Different colors represent filters on different layers.

Local Ranking with Layer Scheduling

Prior art Luo et al. (2017); Li et al. (2017); Yoon et al. (2018); He et al. (2018a, b, 2017); Jiang et al. (2018); Lin et al. (2018); Yang et al. (2018b) adopts the aforementioned assumption and approaches equation (6) by introducing the layer scheduling variable into the picture:


and solving it greedily with layer-wise top-k selection. In equation (7), retrieves the filters in the layer and is the number of filters to be kept for the layer. We note that, in this fashion, the layer schedule directly detemines whether the resource constraint is met or not.

Naïve Pruning

Equation (6) can certainly also be solved with a global greedy pruning algorithm without layer scheduling. That is, greedily prunes the filter that has least loss difference until the constraint is satisfied, which we call the naïve pruning.

4 Layer-Compensated Pruning

One might wonder in the above discussion how acceptable is the assumption that the errors and are both negligible. We conducted an experiment as shown in Fig. 1 where we plot the heuristic metrics on the x-axis and the calculated loss difference on the y-axis for each of the filter for three common metrics with different colors represent different layers. It is obvious that is not negligible in practice, specifically, for those filters that have in of weights, the loss difference ranges from to . Since is not negligible, equation (6) should include the error term as follows:


To avoid computing

, we conjecture that good approximation error estimations lead to better solutions from naïve pruning. Furthermore, we leverage a heuristic that treats the approximation errors

to be identical for filters in the same layer. As a result, we propose a set of layer-dependent latent variables that represent the error estimation to be solved. Here is the number of layers in the network. Namely, before solving equation (8), we first solve the following equation to obtain the per-layer error estimation:


where retrieves the layer index that filter belongs to and is a set of filter indexes to be pruned, and ï represents the naïve pruning. Intuitively speaking, we need to find error compensations such that the solution from naïve pruning has the minimum loss difference. Once is obtained, we then replace in equation (8) with and solve it with naïve pruning. While this is still an approximation, we show in later experiments that this approach produces networks with smaller loss difference compared to solving equation (6) naïvely on the unseen testing dataset.

4.1 Learning the Layer-wise Compensations

Equation (9

) can be approached by derivative-free optimization meta-heuristic algorithms such as genetic algorithms, evolutionary strategies, and evolutionary algorithms; we leverage the regularized evolutionary algorithm proposed in 

Real et al. (2018) for its effectiveness in the neural architecture search problem. In our regularized evolutionary algorithm, we first generate a pool of candidates (s), then repeat the following steps: (i) sample a subset from the candidates, (ii) identify the fittest candidate, (iii) generate a new candidate by mutating the fittest candidate, and (iv) replace the oldest candidate in the pool with the generated one. We define the fittest candidate as the one with the minimum objective value in equation (9). We initialize the pool by sampling from normal distributions, i.e.. We denote

as the standard deviation of the metric of filters at layer

. For mutation, we randomly select subset of current to be perturbed by noise , i.e., , where is the hyper-parameter that controls the exploration step of the search and is the step size which can be gradually decreased to reduce exploration in the later stage of the optimization.

5 Evaluations

5.1 Datasets and Training Setting

Our work is evaluated on various benchmark including CIFAR-10 Krizhevsky & Hinton (2009), ImageNet Russakovsky et al. (2015), and Birds-200 Wah et al. (2011)

. The first one is a standard image classification dataset that consists of 50k training images and 10k testing images with total 10 classes to be classified while the second dataset is a large scale image classification dataset that includes 1.2 million training images and 50k testing images with 1000 classes to be classified. On the other hand, we benchmark the layer-compensated pruning on a transfer learning setting as well since in practice, we want a small and fast model on some target datasets rather than ImageNet, and hence, we use the Birds-200 dataset that consists of 6k training images and 5.7k testing images covering 200 species of bird.

For CIFAR-10, the training parameters for the baseline models follow prior work He et al. (2018a)

, which uses stochastic gradient descent with nesterov, 0.1 initial learning rate and drop by 10x at epochs 60, 120, and 160, and train for 200 epochs in total. For pruning, we keep all training hyper-parameters the same but change the initial learning rate to 0.01 and only train for 60 epochs. We drop the learning rate by 10x at corresponding epochs,

i.e., epochs 18, 36, and 48. On the other hand, we use a pre-trained model on ImageNet and when fine-tuning, we use an initial learning rate of which is dropped by 10x at epoch 20. We only train for 30 epochs similar to prior work He et al. (2018b). For all transfer learning datasets, we fine-tune the models that are pre-trained on ImageNet on the target datasets with 0.001 learning rate for the last layer and 0.0001 for other layers; we train for 60 epochs and drop the learning rate by 10x at epoch 48 to obtain the baseline model on the target dataset.

5.2 Implementation Details

MobileNetV2 for CIFAR-10

We made some changes to the original MobileNetV2 architecture design Sandler et al. (2018)

to adapt it from ImageNet with 224x224 input image size to CIFAR-10 with 32x32 input image size. Specifically, we change the convolution stride from two to one for block two and block four. With the changes, the final layer produces 8x8 feature maps, which is also the case for the ResNet that is designed for CIFAR-10 

He et al. (2016).

Heuristic Metrics

We mainly consider of weights unless noted otherwise and we conduct an ablation study in Section 5.3 for other ranking metrics as well. Specifically, of weights of a filter is calculated by , which is the absolute summation across all three dimensions of a filter Li et al. (2017), i.e., input channels, kernel width, and kernel height. Similarly, is , which is the squared summation over all three dimensions He et al. (2018a). Lastly, first-order Taylor approximation is calculated as  Molchanov et al. (2017).

Pruning Residual Connections

We note that pruning residual connections is tricky since pruning complementary filters for operands of the residual addition will result in an output with the same dimension. Concretely, assuming we want to prune two six-filters kernels that are added together by a residual connection, if we prune the first three filters of one operand and the last three filters of the other operand, the output of the addition would still be six channels. To avoid this complication, we follow prior work 

Gordon et al. (2018) and group filters that are added by residual together. That is, either prune or do not prune together. We use addition to calculate the metric for a group.

Limiting Pruning Budget

We limit the number of filters left for each layer to be at least 10% of the original number to avoid extreme pruning and we experiment with various values later in Section 6.1. We note that this is also adopted by He et al. (2018b) where they use 20% as the limiting budget for each layer.

Evolutionary Algorithm

For the evolutionary algorithm that searches for , we set the total number of iterations to 336 and the size of candidate pool to 64 so that the total number of candidate seen is 400, which is the same as in prior work He et al. (2018b). We arbitrarily select to be , which means randomly selects a tenth of the to perturb for each mutation. We linearly reduce the step size to avoid moving back-and-forth around the local optima, i.e., . We randomly sample 3k images from the training set, which follows He et al. (2018b), to evaluate the objective in (9) for each candidate.

5.3 Analysis on CIFAR-10

For our methods on CIFAR-10, we conduct one-shot pruning. That is, we set the constraint to be the target constraint in the first iteration of Algorithm 1.

Figure 2: The Pareto frontier of pruning ResNet-56 on CIFAR-10 using various of methods. Uniform means uniformly pruning a fixed percentage out of every layer. With methods that have error bars (naïve, uniform, and layer-compensated), we average across three trials and plot the mean and standard deviation.

max width=1 Network Method Acc. (%) MAC () Cost of Pareto frontier ResNet-56 Li et al. (2017) 93.04 93.06 90.9 (72.4%) Molchanov et al. (2017) 94.18 93.21 90.8 (72.4%) Naïve 94.18 93.770.06 87.8 (70%) LcP (Ours) 94.18 93.790.11 87.8 (70%) Gordon et al. (2018) 94.18 93.51 77.8 (62%) Naïve 94.18 93.590.04 75.2 (60%) LcP (Ours) 94.18 93.650.06 75.3 (60%) He et al. (2017) 92.8 91.8 62.7 (50%) He et al. (2018b) 92.8 91.9 62.7 (50%) He et al. (2018a) 93.590.58 93.350.31 59.4 (47.4%) Naïve 94.18 93.370.05 62.6 (50%) LcP (Ours) 94.18 93.410.12 62.7 (50%) VGG-13 Louizos et al. (2017) 91.9 91.4 141.5 (45.1%) Louizos et al. (2017) 91.9 91 121.9 (38.9%) Dai et al. (2018) 91.9 91.5 70.6 (22.5%) Naïve 91.9 91.780.27 70.1 (22.4%) LcP (Ours) 91.9 92.380.19 70.3 (22.4%)

Our implementation

Table 1: Comparison with prior art. We group methods into sections according to different MAC operations. Values for our approaches are averaged across three trials and we report the mean and standard deviation. We use bold face to denote the best numbers.


As shown in Fig. 2, we find that when targeting the MAC constraint, the naïve pruning outperforms the uniform pruning baseline and single-filter pruning Molchanov et al. (2017) by a large margin and is slightly better than a joint optimization technique that is optimized for MAC constraint Gordon et al. (2018). Theoretically speaking, single-filter pruning should have smaller approximation error and leads to better solutions, however, we find that due to the small learning rate for tuning between the iterations, it is easy for the network to get stuck at a local optimal. We postulate that each point obtained by single-filter pruning could be further improved by extra fine-tuning with a larger learning rate. Beyond the naïve approach, the layer-compensated pruning (LcP) is able to obtain a better Pareto frontier than all other methods and it is effective under both model size and number of MAC operations constraints. We also compare with prior art and summarize them in Table 1, where LcP performs favorably compared to prior art. We note that among the prior methods that we are comparing, five of them Molchanov et al. (2017); Gordon et al. (2018); He et al. (2018b); Louizos et al. (2017); Dai et al. (2018) target the layer scheduling problem directly or indirectly while others focus on the complementary problems such as the ranking problem He et al. (2017); Li et al. (2017) and the optimization process He et al. (2018a), i.e., line 6 of Algorithm 1. In particular, compared to prior work He et al. (2018b) that uses reinforcement learning to learn the layer schedule for each layer, our algorithm produces lower accuracy degradation with a stronger baseline model.


We include the Pareto frontier traversal time cost in Table 1 where is the time needed for human experts to design the layer schedule, is the trade-off parameter tuning time for joint optimization approaches, is the time it takes for short term fine-tuning for single-filter pruning, is the time it takes for meta-learning to learn the layer schedule given a target constraint value, represents the time to train a pruned network, represents the number of points of interest on the Pareto frontier, and is the number of filters needed to be pruned to achieve the lowest constraint value of interest in single-layer pruning method. First, single-filter pruning Molchanov et al. (2017) requires filters to be pruned one by one and conducts fine-tuning between iterations, which incurs huge overhead to obtain networks with stringent resource usage and to prune networks with large filter numbers. Second, the reinforcement learning approach proposed in prior work He et al. (2018b) requires the reinforcement learning agent to be learned for each constraint values considered, where each learning takes 1 hour on GeForce GTX TITAN Xp GPU for CIFAR-10. Last, although the joint optimization techniques Gordon et al. (2018); Louizos et al. (2017); Dai et al. (2018) are efficient for Pareto frontier traversal since there is no separate optimization for and , controlling the trade-off of resource usage and accuracy in joint optimization methods is not intuitive and requires several trial-and-error to achieve the target constraint. Specifically, it is necessary to tune the regularizer to traverse the Pareto frontier, i.e., we tune the from all the way to to obtain points for prior work Gordon et al. (2018) in Fig. 2. Moreover, such a trade-off knob is affected by the learning rate, that is, different learning rates result in networks with different resource usage while keeping the trade-off knob fixed. In comparison, LcP has an intuitive traversal knob, i.e., MAC operations or model size. Therefore, it is more efficient for traversing the Pareto frontier compared to joint optimization approaches since it eliminates the need for tuning learning rate and the trade-off hyper-parameters. Compared to the reinforcement learning-based approach He et al. (2018b), our algorithm only requires 7 minutes on a single GeForce GTX 1080 Ti GPU, which is comparable or slightly worse than GeForce GTX TITAN Xp GPU, to solve equation (9) while observing the same number of candidates as He et al. (2018b) in meta-learning. We note that the speed gain over prior work He et al. (2018b) mainly comes from two sources: (i) generating the layer schedule with reinforcement learning requires network inferences while for our approach, we merely draw random variables from

normal distributions and (ii) our evolutionary algorithm is non-parametric, which means we do not have to conduct backpropagation for learning while it must be performed through the controller network in the reinforcement learning approach. Additionally, with our formulation in equation (

9), we do not have to constrain the solution space, i.e., , while the formulation in prior art He et al. (2018b) requires the solution space for the later layers to be constrained to satisfy the overall resource constraints.

Naïve Pruning

One might wonder why the naïve pruning works well under the MAC operations constraint, and hence, we visualize the layer schedules produced by the naïve pruning. As shown in Fig. 3, we plot the number of filters for each layer, normalized to the original network, under different MAC constraint values. First, we note that the residual connections are not pruned although they are allowed to be; this is due to the fact that a lot of convolutions are grouped together by the residual connections and this results in large metric score (Section 5.2). On the other hand, we observe that shallower layers get pruned earlier than deeper layers and since shallower layers consist of more MAC operations due to larger feature maps, pruning them earlier results in a better Pareto frontier under the MAC operations constraint. This provides some evidences toward the effectiveness of -based naïve pruning under the MAC constraint and its worse performance under the model size constraint.

Figure 3: The layer schedule produced by the naïve pruning for different MAC constraint values. For example, the blue color bars represent the layer schedule when the network is pruned to 20% MAC operations.

Layer-compensated Pruning

To reason about the effectiveness of the proposed layer-compensated pruning (LcP), we plot the pre-tuning accuracy for the naïve, the uniform, and LcP. As we can see from Fig. 4, LcP produces networks with higher accuracy without fine-tuning, which means the latent variables that we find by solving equation (9) using the training dataset do not over-fit to the training set. We conjecture that a good pre-tuning accuracy acts as a better initialization that leads to a better final performance after fine-tuning.

Figure 4: The testing accuracy of the pruned network before fine-tuning.

We further plot the training curve of our evolutionary algorithm when solving equation (9) targeting different MAC operations constraints in Fig. 5 to understand the process of the meta-learning. We find that when the targeting constraint values are loose, i.e., 90% to 80% of the original MAC operations, the improvement brought by the latent variables is not significant since it is easier to prune the network by just a little without accuracy degradation. On the other hand, our algorithm is more preferable compared to the naïve solution as the pruning constraint gets more stringent. Additionally, the algorithm converges around 100 iterations, which implies that either the step size , as mentioned in Section 4.1, could be further optimized or one could perform an early stopping.

Figure 5: The training curves for the evolutionary algorithm solving equation (9) with different values of the MAC operation constraint. For each constraint, we plot the loss difference for the naïve approach using a triangle in the corresponding color.

Compact Networks

Other than ResNet-56, we also try MobileNetV2 on CIFAR-10. As shown in Fig. 6, we plot the networks with small MAC operations including a network that is obtained by neural architectural search Dong et al. (2018). We first note that the 5x theoretical speedup (53.1M MAC operations) is achieved without accuracy degradation for MobileNetV2. Aligning with ResNet-56, the simple naïve pruning determines a better Pareto frontier compared to the uniform pruning while LcP performs the best. In particular, LcP achieves 2% accuracy improvement (statistically significant) compared to both naïve and uniform at 13.3M MAC operations. Additionally, we find that with LcP, we can push existing networks closer to the network obtained by neural architecture search (i.e., DPP-Net-M). Also, it is worth noting that ResNet-56 is able to compete with uniformly pruned MobileNetV2 when pruned with both the naïve and LcP.

Figure 6: The networks characterized by small number of MAC operations. We plot the mean and standard deviation over three trials.

Different Heuristic Metrics

Different heuristic metrics result in different approximation errors and hence, different performance in the pruned network. As shown in Fig. 7, we find that with LcP, the performance of all considered heuristic metrics could be further improved.

Figure 7: Pruning ResNet-56 with naïve pruning on CIFAR-10 using various of heuristic metrics.

5.4 ImageNet Results

For ImageNet, we conduct iterative constraint tightening. We gradually prune away 25%, 42%, and 50% MAC operations. We note that iterative pruning is adopted in prior work for ImageNet as well He et al. (2018b). We demonstrate the effectiveness of LcP on ResNet-50. As shown in Table 2, LcP is superior to prior art that reports on ResNet-50, which aligns with our observation in previous analysis. When pruning away 25% of MAC operations, LcP achieves even higher accuracy compared to the original one. Furthermore, when pruned to 58% and 50% our algorithm achieves state-of-the-art results compared to prior art. Since the number of MAC operations does not directly translate into speedup, we report the latency of the pruned network with various inference batch size as shown in Table 3.

max width=0.46 Method Top-1 Top-1 Diff Top-5 MAC (%) LcP (Ours) 76.13 76.22 +0.09 92.86 93.05 75 Yu et al. (2018) - - -0.21 - - 73 Huang & Wang (2018) 76.12 74.18 -1.94 92.86 91.91 69 Luo et al. (2017) 72.88 72.04 -0.84 91.14 90.67 63 Lin et al. (2018) 75.13 72.61 -2.52 92.30 91.05 58 He et al. (2018a) 76.15 74.61 -1.54 92.87 92.06 58 LcP (Ours) 76.13 75.28 -0.85 92.86 92.60 58 Yu et al. (2018) - - -0.89 - - 56 He et al. (2017) - - - 92.2 90.8 50 Wang et al. (2018) - - - 91.2 90.4 50 LcP (Ours) 76.13 75.17 -0.96 92.86 92.44 50

Table 2: Summary of pruned ResNet-50 on ImageNet.

max width=0.46 Method ResNet50 5.03 2.17 1.53 Out of Memory ResNet50 (2x) 4.62 1.69 1.11 0.99

Table 3: Latency profile of ResNet-50. We report the latency per image in millisecond for different batch sizes (BS).

5.5 Transfer Learning Results

We analyze how the LcP performs in a transfer learning setting where we have a model pre-trained on a large dataset, e.g., ImageNet, and we want to transfer its knowledge to adapt to a smaller dataset, e.g., Bird-200. We prune the fine-tuned network on the target dataset directly instead of pruning on the large dataset before transferring for two reasons: (i) the user only cares about the performance of the network on the target dataset instead of the source dataset, which means we need the Pareto frontier in the target dataset and (ii) pruning on a smaller dataset is much more efficient compared to pruning on a large dataset. We first obtain a fine-tuned MobileNetV2 on the Bird-200 dataset with top-1 accuracy 80.22%, which matches the reported number from VGG-16 as well as DenseNet-121 from prior art Mallya & Lazebnik (2018). With an already small model such as MobileNetV2, we are able to achieve 78.34% accuracy with 51% MAC operations (153M) while our implemented greedy single-filter pruning Molchanov et al. (2017) achieves 75.94% with 50% MAC operations (150M).

6 Ablation Study

6.1 Limiting the Layer Schedule

Figure 8: Pruning ResNet-56 with naïve pruning on CIFAR-10 using various of pruning budgets.

Since we arbitrarily pick 10% just to intuitively avoid extreme pruning, we study the effect of different pruning budgets on the performance of naïve pruning. As shown in Fig. 8, we find that different budgets have similar performance. However, we observe a drop for 30% budget on a 20% MAC constraint. It is because pruning down to 20% MAC operations results in similar a layer schedule produced by the uniform pruning when the budget is high, i.e., 30%. We note that although CIFAR-10 performs fine even without budgeting, we find that such budgeting is needed when pruning for the Bird-200 dataset.

7 Conclusion

In this work, we consider the filter pruning problem as a global ranking problem rather than two separate sub-problems as commonly used in the literature. Moreover, based on the analysis of the approximation error incurred from the simplification of the problem, we propose the layer-compensated pruning (LcP) that uses meta-learning to learn a set of latent variables that compensate for the layer-wise approximation error and it is able to improve the performance for various heuristic metrics. With such a formulation, we can learn the layer schedule with slightly better performance using 8x less time compared to the reinforcement learning approach proposed in prior art, which is significant if one considers Pareto frontier traversal. Moreover, targeting networks with small number of MAC operations, our algorithm produces networks comparable with the network determined by a bottom-up approach while being superior to the uniform and naïve pruning. Last, we conduct comprehensive analysis on the proposed method to demonstrate both the effectiveness and the efficiency of our approach using two types of neural networks and three datasets.