1 Introduction
With the ubiquity of mobile and edge devices, it has become desirable to bring the performance of convolution neural networks (CNN) to the device without going through the cloud, especially due to latency and privacy considerations. However, mobile and edge devices are often characterized by stringent resource constraints, such as energy consumption and model size. Furthermore, depending on the application domain, multiplyaccumulate operations (MACs) or latency may also need to be constrained. On the other hand, as the resource usage for each of the workloads (i.e., CNN) reduces, the system can serve more of them at once, which further creates systemlevel optimization opportunities. As a result, the main research question is how we design resourceconstrained CNNs that are still capable to deliver good enough task performance, or prediction accuracy. This problem has been recently approached from two directions: (i) resourceconstrained neural architecture search and (ii) resourceconstrained model compression. On one hand, neural architecture search is a bottomup approach that has shown promise in finding a better solution due to the larger search space; however, it requires massive computational power to conduct the search. On the other hand, resourceconstrained model compression is a topdown
approach that seeks solutions in the neighborhood of a provided pretrained network. As a result, the final solution may be worse in terms of accuracy but it is much more efficient to find when compared to neural architecture search. We conjecture that, the latter approach provides desirable means for machine learning practitioners to take advantage from offtheshelf, carefullycrafted deep neural networks and adapt them to userdefined resource constraints without having to search for and train a network for a long time.
Among various model compression methods, structural pruning has been widely adopted to approach the resourceconstrained deep neural network optimization problem Molchanov et al. (2017); He et al. (2017); Luo et al. (2017); He et al. (2018b); Gordon et al. (2018)
since it provides solution that works directly on existing deep learning frameworks and hardware. Specifically, for CNNs, the structural pruning method we discuss in this paper is filter pruning
Li et al. (2017). However, resourceconstrained pruning is a hard problem since (i) deciding which filter to disable such that accuracy is maximized and constraints are satisfied is a NPhard combinatorial problem and (ii) deep neural networks are highly nonlinear functions that are hard to analyze. To tackle these problems, prior art leverages approximations and considers two subproblems: (a) how many filters to prune for each layer, which we call the layer scheduling problem, and (b) which filters to prune, given a perlayer pruning budget, which we call the ranking problem.The majority of prior work focuses on the ranking
problem and proposes various heuristics to measure the importance of each filter
Jiang et al. (2018); Mao et al. (2017); He et al. (2017); Luo et al. (2017); Yoon et al. (2018); Li et al. (2017) while addressing layer scheduling by either employing manually designed rules based on experience or simply pruning the network uniformly by a fixed percentage across layers. To achieve effective and efficient filter pruning, there is a need for algorithms that determine the layer schedule without humans in the loop. Pioneering this direction, He et al. (2018b)approach this problem with reinforcement learning to learn an agent for deciding how many filters to prune for each layer given a resource constraint. While achieving better results than a handtuned policy, we note that such a formulation is too timeconsuming if the goal is the Pareto frontier of the design space of interest. We argue that traversing the Pareto frontier efficiently is a critical feature desired by machine learning practitioners since constraints are not always known
a priori when building a system running convolution neural network models among other workloads.In this work, we first theoretically derive the pruning problem such that the layer scheduling problem and the ranking problem are treated as a unified problem with a naïve solution. Then, based on the analysis of the derivation, we introduce a novel formulation that takes the approximation errors into account to further improve the naïve solution. Specifically, we leverage metalearning to learn a set of latent variables that compensate for the approximation error, and we call it the layercompensated pruning algorithm. Overall, our contributions are as follows:

We define the pruning problem from a theoretical standpoint and connect to prior solutions throughout the derivations.

We propose a novel, effective, and efficient algorithm, dubbed layercompensated pruning, which improves prior art by learning to compensate for the approximation error incurred in the derivation in a layerwise fashion. Specifically, we achieve slight better results 8x faster.

In our general formulation, we show that layercompensated pruning can improve various commonlyadopted heuristic metrics such as , of weights, and order Taylor approximation.

We conduct comprehensive analysis to justify our findings on alreadysmall deep neural networks, i.e., ResNet and MobileNetV2, using three datasets, i.e., CIFAR10, ImageNet, and Bird200.
2 Related Work
For resourceconstrained DNN design, there are generally two approaches: bottomup and topdown.
2.1 Bottomup Resourceconstrained DNN Design
Bottomup approaches try to build the neural network from groundup while incorporating some awareness of the resources. These types of approaches are often called multiobjective or platformaware neural architecture search Hsu et al. (2018); Zhou et al. (2018); Tan et al. (2018); Dong et al. (2018). Within this domain, every approach has a different search space, learning algorithm, and decision space. For example, Zhou et al. (2018)
used a policy gradient algorithm to learn two controllers: one is in charge of scaling the number of neurons in an existing layer and the other removes or inserts new operations on top of the current layer given the network embedding. Similarly,
Hsu et al. (2018) used policy gradient to learn a controller that suggests the hyperparameters for the target network given the network embedding, but with a smaller search space that tunes the hyperparameters of an existing architecture (e.g., number of filters in convolution and growth rate in CondenseNet Huang et al. (2017)). On the other hand, Tan et al. (2018) relied on a search space that covers connections and operations within a cell and determines how many stacked cells should form a block. Lastly, Dong et al. (2018) proposed to learn a surrogate function that predicts the accuracy of network candidates, which makes the traversal of the Pareto frontier efficient. Although bottomup approaches enable a promising solution with larger search space, they are much more timeconsuming compared to topdown approaches.2.2 Topdown Resourceconstrained DNN Design
We consider pruning and quantization topdown approaches since they start with an existing neural network and try to trim down connections or precision. While there is emerging work toward network quantization Khoram & Li (2018) in a resourceconstrained setting, structural pruning is often adopted as the solution for the problem Gordon et al. (2018); Luo et al. (2017); He et al. (2017); Molchanov et al. (2017); He et al. (2018b); Yang et al. (2018a); Jiang et al. (2018); He et al. (2018a); Hu et al. (2016); Liu et al. (2017); Li et al. (2017); Lin et al. (2018); Wen et al. (2016) given its finegrained control capability over the constraints. We group solutions to structural pruning into the following three categories.
Joint Optimization
In this line of work, current approaches try to jointly optimize for model weights as well as the filter mask
. A common approach is to let the scaling factor of the batch normalization layer that follows the convolution layer act as the mask
and addweighted regularization terms to suppress the scaling factor of the batch normalization layer. As a result, standard training procedures using stochastic gradient descent optimize both
and jointly Ye et al. (2018); Liu et al. (2017); Gordon et al. (2018). On the other hand, Louizos et al. (2017)formulate the loss function from a Bayesian standpoint while
Dai et al. (2018) derive the loss function from an informationtheoretic standpoint. However, we note that such approaches count on the nonintuitive tuning knobs to traverse the Pareto optimal.Local Ranking with Layer Scheduling
In this line of research, resourceconstrained filter pruning is done by solving two subproblems: (a) decide how many filters to prune for each layer such that the constraint is satisfied, which we call layer scheduling, and (b) decide which filters to prune, given a layer schedule. While most of the prior art Jiang et al. (2018); Mao et al. (2017); He et al. (2017); Luo et al. (2017); Yoon et al. (2018); Li et al. (2017) decides the layer schedule manually based on experience, we argue that the layer schedule directly affects the resource usage of the resulting network. As a result, it is not scalable to rely on experts to determine the layer schedule so as to traverse the Pareto frontier. Although He et al. (2018b) provide a good and scalable (compared to humanintheloop) solution for resourceconstrained pruning, the reinforcement learning agent generates the layer schedule sequentially for each layer, which is inefficient and presumably harder to learn for deeper networks due to the credit assignment problem Sutton et al. (1998).
Global Ranking with Singlefilter Pruning
In this category, the filters of the entire network are ranked together according to some heuristic metrics and pruning is an iterative process that has three steps: rank filters globally, greedily prune one filter at a time, and finetune the network. The process continues until resource constraints are satisfied, which makes the traversal along different values of constraints intuitive. Molchanov et al. (2017) propose to leverage firstorder Taylor approximation for the ranking and progressively prune one filter at a time, while Theis et al. (2018) later propose to leverage Fisher information for ranking. However, pruning one filter at a time is not scalable when the number of filters is large, as in the case of MobileNetV2 Sandler et al. (2018) which has 17k filters. While our work is also considered a global ranking approach, we conduct multifilter pruning instead of singlefilter pruning, and we find multifilter pruning to be more effective and more efficient.
3 Problem Formulation
We are interested in solving the problem of resourceconstrained filter pruning. Formally, the optimization problem to consider is:
(1) 
In the objective function, , where and are the input and label, respectively, is the data distribution we care about, is the loss function, is the DNN model, and are the target and pretrained model weights, which are
dimensional array of vectors with each dimension representing the weights of a filter.
is a dimensional indicator vector representing if the corresponding filter is pruned () or not (.On the constraint side, is the target resource usage evaluation function and is the desired resource constraint. For ease of notation, we denote by the loss difference caused by . We approach this optimization problem with the commonly adopted optimization framework He et al. (2018b, 2017); Luo et al. (2017); Li et al. (2017); Yang et al. (2018a); Yoon et al. (2018); Mao et al. (2017); He et al. (2018a); Jiang et al. (2018); Lin et al. (2018) as shown in Algorithm 1.
Note that solving optimally in each iteration of Algorithm 1 (line 5) is a combinatorial problem that takes O() evaluations of the objective. In practice, greedy approximation is often adopted to find a suboptimal .
3.1 Greedy Singlefilter Pruning
To solve Algorithm 1 (line 5) in a greedy fashion, prior work assumes that will prune one additional filter on top of Molchanov et al. (2017); Theis et al. (2018). That is, they seek to optimize:
(2) 
where represents the pruning of the filter, i.e., and . is the set of remaining nonzero elements in . And
(3) 
In the remainder of this section, we will stop using superscripts whenever there is no confusion.
Computing the loss difference term in equation (2) is nontrivial as a data set needs to be prepared to compute the change in loss function and one needs to evaluate the difference times to proceed in each iteration. A common practice is to define a metric to quantify the loss difference. That is:
(4) 
where can be easy to compute based on the magnitude of the model parameters and is the approximation error incurred with metric approximation.
In the literature, various metric approximations have been proposed. For example, past work has used of filter weights He et al. (2018a); Yang et al. (2018b), of filter weights Mao et al. (2017); Li et al. (2017), of filter weights and filter weights of the next layer Jiang et al. (2018), firstorder Taylor expansion on the loss Molchanov et al. (2017); Lin et al. (2018), Fisher information of the loss Theis et al. (2018)
, and the variance of the max activation
Yoon et al. (2018).3.2 Greedy Multifilter Pruning
Greedy singlefilter pruning takes evaluations on the objective and finetuning to prune filters from a pretrained model with filters. To prune more filters at a time, or even perform oneshot pruning, it is common to approximate the loss difference caused by a set of pruned filters with addition Luo et al. (2017); Li et al. (2017); Yang et al. (2018a); Yoon et al. (2018); He et al. (2018a, b, 2017); Mao et al. (2017); Jiang et al. (2018); Lin et al. (2018); Yang et al. (2018b). Formally,
(5) 
where are newly pruned filters, is the approximation error for addition.
If the approximation errors and are both negligible, solving in line 5 of Algorithm 1 is then equivalent to solving the following problem:
(6) 
Local Ranking with Layer Scheduling
Prior art Luo et al. (2017); Li et al. (2017); Yoon et al. (2018); He et al. (2018a, b, 2017); Jiang et al. (2018); Lin et al. (2018); Yang et al. (2018b) adopts the aforementioned assumption and approaches equation (6) by introducing the layer scheduling variable into the picture:
(7) 
and solving it greedily with layerwise topk selection. In equation (7), retrieves the filters in the layer and is the number of filters to be kept for the layer. We note that, in this fashion, the layer schedule directly detemines whether the resource constraint is met or not.
Naïve Pruning
Equation (6) can certainly also be solved with a global greedy pruning algorithm without layer scheduling. That is, greedily prunes the filter that has least loss difference until the constraint is satisfied, which we call the naïve pruning.
4 LayerCompensated Pruning
One might wonder in the above discussion how acceptable is the assumption that the errors and are both negligible. We conducted an experiment as shown in Fig. 1 where we plot the heuristic metrics on the xaxis and the calculated loss difference on the yaxis for each of the filter for three common metrics with different colors represent different layers. It is obvious that is not negligible in practice, specifically, for those filters that have in of weights, the loss difference ranges from to . Since is not negligible, equation (6) should include the error term as follows:
(8) 
To avoid computing
, we conjecture that good approximation error estimations lead to better solutions from naïve pruning. Furthermore, we leverage a heuristic that treats the approximation errors
to be identical for filters in the same layer. As a result, we propose a set of layerdependent latent variables that represent the error estimation to be solved. Here is the number of layers in the network. Namely, before solving equation (8), we first solve the following equation to obtain the perlayer error estimation:(9) 
where retrieves the layer index that filter belongs to and is a set of filter indexes to be pruned, and ï represents the naïve pruning. Intuitively speaking, we need to find error compensations such that the solution from naïve pruning has the minimum loss difference. Once is obtained, we then replace in equation (8) with and solve it with naïve pruning. While this is still an approximation, we show in later experiments that this approach produces networks with smaller loss difference compared to solving equation (6) naïvely on the unseen testing dataset.
4.1 Learning the Layerwise Compensations
Equation (9
) can be approached by derivativefree optimization metaheuristic algorithms such as genetic algorithms, evolutionary strategies, and evolutionary algorithms; we leverage the regularized evolutionary algorithm proposed in
Real et al. (2018) for its effectiveness in the neural architecture search problem. In our regularized evolutionary algorithm, we first generate a pool of candidates (s), then repeat the following steps: (i) sample a subset from the candidates, (ii) identify the fittest candidate, (iii) generate a new candidate by mutating the fittest candidate, and (iv) replace the oldest candidate in the pool with the generated one. We define the fittest candidate as the one with the minimum objective value in equation (9). We initialize the pool by sampling from normal distributions, i.e., . We denoteas the standard deviation of the metric of filters at layer
. For mutation, we randomly select subset of current to be perturbed by noise , i.e., , where is the hyperparameter that controls the exploration step of the search and is the step size which can be gradually decreased to reduce exploration in the later stage of the optimization.5 Evaluations
5.1 Datasets and Training Setting
Our work is evaluated on various benchmark including CIFAR10 Krizhevsky & Hinton (2009), ImageNet Russakovsky et al. (2015), and Birds200 Wah et al. (2011)
. The first one is a standard image classification dataset that consists of 50k training images and 10k testing images with total 10 classes to be classified while the second dataset is a large scale image classification dataset that includes 1.2 million training images and 50k testing images with 1000 classes to be classified. On the other hand, we benchmark the layercompensated pruning on a transfer learning setting as well since in practice, we want a small and fast model on some target datasets rather than ImageNet, and hence, we use the Birds200 dataset that consists of 6k training images and 5.7k testing images covering 200 species of bird.
For CIFAR10, the training parameters for the baseline models follow prior work He et al. (2018a)
, which uses stochastic gradient descent with nesterov, 0.1 initial learning rate and drop by 10x at epochs 60, 120, and 160, and train for 200 epochs in total. For pruning, we keep all training hyperparameters the same but change the initial learning rate to 0.01 and only train for 60 epochs. We drop the learning rate by 10x at corresponding epochs,
i.e., epochs 18, 36, and 48. On the other hand, we use a pretrained model on ImageNet and when finetuning, we use an initial learning rate of which is dropped by 10x at epoch 20. We only train for 30 epochs similar to prior work He et al. (2018b). For all transfer learning datasets, we finetune the models that are pretrained on ImageNet on the target datasets with 0.001 learning rate for the last layer and 0.0001 for other layers; we train for 60 epochs and drop the learning rate by 10x at epoch 48 to obtain the baseline model on the target dataset.5.2 Implementation Details
MobileNetV2 for CIFAR10
We made some changes to the original MobileNetV2 architecture design Sandler et al. (2018)
to adapt it from ImageNet with 224x224 input image size to CIFAR10 with 32x32 input image size. Specifically, we change the convolution stride from two to one for block two and block four. With the changes, the final layer produces 8x8 feature maps, which is also the case for the ResNet that is designed for CIFAR10
He et al. (2016).Heuristic Metrics
We mainly consider of weights unless noted otherwise and we conduct an ablation study in Section 5.3 for other ranking metrics as well. Specifically, of weights of a filter is calculated by , which is the absolute summation across all three dimensions of a filter Li et al. (2017), i.e., input channels, kernel width, and kernel height. Similarly, is , which is the squared summation over all three dimensions He et al. (2018a). Lastly, firstorder Taylor approximation is calculated as Molchanov et al. (2017).
Pruning Residual Connections
We note that pruning residual connections is tricky since pruning complementary filters for operands of the residual addition will result in an output with the same dimension. Concretely, assuming we want to prune two sixfilters kernels that are added together by a residual connection, if we prune the first three filters of one operand and the last three filters of the other operand, the output of the addition would still be six channels. To avoid this complication, we follow prior work
Gordon et al. (2018) and group filters that are added by residual together. That is, either prune or do not prune together. We use addition to calculate the metric for a group.Limiting Pruning Budget
Evolutionary Algorithm
For the evolutionary algorithm that searches for , we set the total number of iterations to 336 and the size of candidate pool to 64 so that the total number of candidate seen is 400, which is the same as in prior work He et al. (2018b). We arbitrarily select to be , which means randomly selects a tenth of the to perturb for each mutation. We linearly reduce the step size to avoid moving backandforth around the local optima, i.e., . We randomly sample 3k images from the training set, which follows He et al. (2018b), to evaluate the objective in (9) for each candidate.
5.3 Analysis on CIFAR10
For our methods on CIFAR10, we conduct oneshot pruning. That is, we set the constraint to be the target constraint in the first iteration of Algorithm 1.
Effectiveness
As shown in Fig. 2, we find that when targeting the MAC constraint, the naïve pruning outperforms the uniform pruning baseline and singlefilter pruning Molchanov et al. (2017) by a large margin and is slightly better than a joint optimization technique that is optimized for MAC constraint Gordon et al. (2018). Theoretically speaking, singlefilter pruning should have smaller approximation error and leads to better solutions, however, we find that due to the small learning rate for tuning between the iterations, it is easy for the network to get stuck at a local optimal. We postulate that each point obtained by singlefilter pruning could be further improved by extra finetuning with a larger learning rate. Beyond the naïve approach, the layercompensated pruning (LcP) is able to obtain a better Pareto frontier than all other methods and it is effective under both model size and number of MAC operations constraints. We also compare with prior art and summarize them in Table 1, where LcP performs favorably compared to prior art. We note that among the prior methods that we are comparing, five of them Molchanov et al. (2017); Gordon et al. (2018); He et al. (2018b); Louizos et al. (2017); Dai et al. (2018) target the layer scheduling problem directly or indirectly while others focus on the complementary problems such as the ranking problem He et al. (2017); Li et al. (2017) and the optimization process He et al. (2018a), i.e., line 6 of Algorithm 1. In particular, compared to prior work He et al. (2018b) that uses reinforcement learning to learn the layer schedule for each layer, our algorithm produces lower accuracy degradation with a stronger baseline model.
Efficiency
We include the Pareto frontier traversal time cost in Table 1 where is the time needed for human experts to design the layer schedule, is the tradeoff parameter tuning time for joint optimization approaches, is the time it takes for short term finetuning for singlefilter pruning, is the time it takes for metalearning to learn the layer schedule given a target constraint value, represents the time to train a pruned network, represents the number of points of interest on the Pareto frontier, and is the number of filters needed to be pruned to achieve the lowest constraint value of interest in singlelayer pruning method. First, singlefilter pruning Molchanov et al. (2017) requires filters to be pruned one by one and conducts finetuning between iterations, which incurs huge overhead to obtain networks with stringent resource usage and to prune networks with large filter numbers. Second, the reinforcement learning approach proposed in prior work He et al. (2018b) requires the reinforcement learning agent to be learned for each constraint values considered, where each learning takes 1 hour on GeForce GTX TITAN Xp GPU for CIFAR10. Last, although the joint optimization techniques Gordon et al. (2018); Louizos et al. (2017); Dai et al. (2018) are efficient for Pareto frontier traversal since there is no separate optimization for and , controlling the tradeoff of resource usage and accuracy in joint optimization methods is not intuitive and requires several trialanderror to achieve the target constraint. Specifically, it is necessary to tune the regularizer to traverse the Pareto frontier, i.e., we tune the from all the way to to obtain points for prior work Gordon et al. (2018) in Fig. 2. Moreover, such a tradeoff knob is affected by the learning rate, that is, different learning rates result in networks with different resource usage while keeping the tradeoff knob fixed. In comparison, LcP has an intuitive traversal knob, i.e., MAC operations or model size. Therefore, it is more efficient for traversing the Pareto frontier compared to joint optimization approaches since it eliminates the need for tuning learning rate and the tradeoff hyperparameters. Compared to the reinforcement learningbased approach He et al. (2018b), our algorithm only requires 7 minutes on a single GeForce GTX 1080 Ti GPU, which is comparable or slightly worse than GeForce GTX TITAN Xp GPU, to solve equation (9) while observing the same number of candidates as He et al. (2018b) in metalearning. We note that the speed gain over prior work He et al. (2018b) mainly comes from two sources: (i) generating the layer schedule with reinforcement learning requires network inferences while for our approach, we merely draw random variables from
normal distributions and (ii) our evolutionary algorithm is nonparametric, which means we do not have to conduct backpropagation for learning while it must be performed through the controller network in the reinforcement learning approach. Additionally, with our formulation in equation (
9), we do not have to constrain the solution space, i.e., , while the formulation in prior art He et al. (2018b) requires the solution space for the later layers to be constrained to satisfy the overall resource constraints.Naïve Pruning
One might wonder why the naïve pruning works well under the MAC operations constraint, and hence, we visualize the layer schedules produced by the naïve pruning. As shown in Fig. 3, we plot the number of filters for each layer, normalized to the original network, under different MAC constraint values. First, we note that the residual connections are not pruned although they are allowed to be; this is due to the fact that a lot of convolutions are grouped together by the residual connections and this results in large metric score (Section 5.2). On the other hand, we observe that shallower layers get pruned earlier than deeper layers and since shallower layers consist of more MAC operations due to larger feature maps, pruning them earlier results in a better Pareto frontier under the MAC operations constraint. This provides some evidences toward the effectiveness of based naïve pruning under the MAC constraint and its worse performance under the model size constraint.
Layercompensated Pruning
To reason about the effectiveness of the proposed layercompensated pruning (LcP), we plot the pretuning accuracy for the naïve, the uniform, and LcP. As we can see from Fig. 4, LcP produces networks with higher accuracy without finetuning, which means the latent variables that we find by solving equation (9) using the training dataset do not overfit to the training set. We conjecture that a good pretuning accuracy acts as a better initialization that leads to a better final performance after finetuning.
We further plot the training curve of our evolutionary algorithm when solving equation (9) targeting different MAC operations constraints in Fig. 5 to understand the process of the metalearning. We find that when the targeting constraint values are loose, i.e., 90% to 80% of the original MAC operations, the improvement brought by the latent variables is not significant since it is easier to prune the network by just a little without accuracy degradation. On the other hand, our algorithm is more preferable compared to the naïve solution as the pruning constraint gets more stringent. Additionally, the algorithm converges around 100 iterations, which implies that either the step size , as mentioned in Section 4.1, could be further optimized or one could perform an early stopping.
Compact Networks
Other than ResNet56, we also try MobileNetV2 on CIFAR10. As shown in Fig. 6, we plot the networks with small MAC operations including a network that is obtained by neural architectural search Dong et al. (2018). We first note that the 5x theoretical speedup (53.1M MAC operations) is achieved without accuracy degradation for MobileNetV2. Aligning with ResNet56, the simple naïve pruning determines a better Pareto frontier compared to the uniform pruning while LcP performs the best. In particular, LcP achieves 2% accuracy improvement (statistically significant) compared to both naïve and uniform at 13.3M MAC operations. Additionally, we find that with LcP, we can push existing networks closer to the network obtained by neural architecture search (i.e., DPPNetM). Also, it is worth noting that ResNet56 is able to compete with uniformly pruned MobileNetV2 when pruned with both the naïve and LcP.
Different Heuristic Metrics
Different heuristic metrics result in different approximation errors and hence, different performance in the pruned network. As shown in Fig. 7, we find that with LcP, the performance of all considered heuristic metrics could be further improved.
5.4 ImageNet Results
For ImageNet, we conduct iterative constraint tightening. We gradually prune away 25%, 42%, and 50% MAC operations. We note that iterative pruning is adopted in prior work for ImageNet as well He et al. (2018b). We demonstrate the effectiveness of LcP on ResNet50. As shown in Table 2, LcP is superior to prior art that reports on ResNet50, which aligns with our observation in previous analysis. When pruning away 25% of MAC operations, LcP achieves even higher accuracy compared to the original one. Furthermore, when pruned to 58% and 50% our algorithm achieves stateoftheart results compared to prior art. Since the number of MAC operations does not directly translate into speedup, we report the latency of the pruned network with various inference batch size as shown in Table 3.
5.5 Transfer Learning Results
We analyze how the LcP performs in a transfer learning setting where we have a model pretrained on a large dataset, e.g., ImageNet, and we want to transfer its knowledge to adapt to a smaller dataset, e.g., Bird200. We prune the finetuned network on the target dataset directly instead of pruning on the large dataset before transferring for two reasons: (i) the user only cares about the performance of the network on the target dataset instead of the source dataset, which means we need the Pareto frontier in the target dataset and (ii) pruning on a smaller dataset is much more efficient compared to pruning on a large dataset. We first obtain a finetuned MobileNetV2 on the Bird200 dataset with top1 accuracy 80.22%, which matches the reported number from VGG16 as well as DenseNet121 from prior art Mallya & Lazebnik (2018). With an already small model such as MobileNetV2, we are able to achieve 78.34% accuracy with 51% MAC operations (153M) while our implemented greedy singlefilter pruning Molchanov et al. (2017) achieves 75.94% with 50% MAC operations (150M).
6 Ablation Study
6.1 Limiting the Layer Schedule
Since we arbitrarily pick 10% just to intuitively avoid extreme pruning, we study the effect of different pruning budgets on the performance of naïve pruning. As shown in Fig. 8, we find that different budgets have similar performance. However, we observe a drop for 30% budget on a 20% MAC constraint. It is because pruning down to 20% MAC operations results in similar a layer schedule produced by the uniform pruning when the budget is high, i.e., 30%. We note that although CIFAR10 performs fine even without budgeting, we find that such budgeting is needed when pruning for the Bird200 dataset.
7 Conclusion
In this work, we consider the filter pruning problem as a global ranking problem rather than two separate subproblems as commonly used in the literature. Moreover, based on the analysis of the approximation error incurred from the simplification of the problem, we propose the layercompensated pruning (LcP) that uses metalearning to learn a set of latent variables that compensate for the layerwise approximation error and it is able to improve the performance for various heuristic metrics. With such a formulation, we can learn the layer schedule with slightly better performance using 8x less time compared to the reinforcement learning approach proposed in prior art, which is significant if one considers Pareto frontier traversal. Moreover, targeting networks with small number of MAC operations, our algorithm produces networks comparable with the network determined by a bottomup approach while being superior to the uniform and naïve pruning. Last, we conduct comprehensive analysis on the proposed method to demonstrate both the effectiveness and the efficiency of our approach using two types of neural networks and three datasets.
References
 Dai et al. (2018) Dai, B., Zhu, C., and Wipf, D. Compressing neural networks using the variational information bottleneck. arXiv preprint arXiv:1802.10399, 2018.
 Dong et al. (2018) Dong, J.D., Cheng, A.C., Juan, D.C., Wei, W., and Sun, M. Dppnet: Deviceaware progressive search for paretooptimal neural architectures. arXiv preprint arXiv:1806.08198, 2018.

Gordon et al. (2018)
Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T.J., and Choi, E.
Morphnet: Fast & simple resourceconstrained structure learning of
deep networks.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 He et al. (2017) He, Y., Zhang, X., and Sun, J. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, 2017.
 He et al. (2018a) He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, pp. 2234–2240, 2018a.
 He et al. (2018b) He, Y., Lin, J., Liu, Z., Wang, H., Li, L.J., and Han, S. Amc: Automl for model compression and acceleration on mobile devices. arXiv preprint arXiv:1802.03494, 2018b.
 Hsu et al. (2018) Hsu, C.H., Chang, S.H., Juan, D.C., Pan, J.Y., Chen, Y.T., Wei, W., and Chang, S.C. Monas: Multiobjective neural architecture search using reinforcement learning. arXiv preprint arXiv:1806.10332, 2018.
 Hu et al. (2016) Hu, H., Peng, R., Tai, Y.W., and Tang, C.K. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
 Huang et al. (2017) Huang, G., Liu, S., van der Maaten, L., and Weinberger, K. Q. Condensenet: An efficient densenet using learned group convolutions. group, 3(12):11, 2017.
 Huang & Wang (2018) Huang, Z. and Wang, N. Datadriven sparse structure selection for deep neural networks. In The European Conference on Computer Vision (ECCV), September 2018.
 Jiang et al. (2018) Jiang, C., Li, G., Qian, C., and Tang, K. Efficient dnn neuron pruning by minimizing layerwise nonlinear reconstruction error. In IJCAI, volume 2018, pp. 2–2, 2018.
 Khoram & Li (2018) Khoram, S. and Li, J. Adaptive quantization of neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SyOK1Sg0W.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Li et al. (2017) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. International Conference on Learning Representation (ICLR), 2017.
 Lin et al. (2018) Lin, S., Ji, R., Li, Y., Wu, Y., Huang, F., and Zhang, B. Accelerating convolutional networks via global & dynamic filter pruning. In IJCAI, pp. 2425–2432, 2018.
 Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2755–2763. IEEE, 2017.
 Louizos et al. (2017) Louizos, C., Ullrich, K., and Welling, M. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pp. 3288–3298, 2017.
 Luo et al. (2017) Luo, J.H., Wu, J., and Lin, W. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
 Mallya & Lazebnik (2018) Mallya, A. and Lazebnik, S. Packnet: Adding multiple tasks to a single network by iterative pruning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 Mao et al. (2017) Mao, H., Han, S., Pool, J., Li, W., Liu, X., Wang, Y., and Dally, W. J. Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprint arXiv:1705.08922, 2017.
 Molchanov et al. (2017) Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient inference. International Conference on Learning Representation (ICLR), 2017.
 Real et al. (2018) Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
 Sutton et al. (1998) Sutton, R. S., Barto, A. G., et al. Reinforcement learning: An introduction. MIT press, 1998.
 Tan et al. (2018) Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 Theis et al. (2018) Theis, L., Korshunova, I., Tejani, A., and Huszár, F. Faster gaze prediction with dense networks and fisher pruning. arXiv preprint arXiv:1801.05787, 2018.
 Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The caltechucsd birds2002011 dataset. 2011.
 Wang et al. (2018) Wang, H., Zhang, Q., Wang, Y., and Hu, H. Structured probabilistic pruning for convolutional neural network acceleration. 2018.
 Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 Yang et al. (2018a) Yang, H., Zhu, Y., and Liu, J. Endtoend learning of energyconstrained deep neural networks. arXiv preprint arXiv:1806.04321, 2018a.
 Yang et al. (2018b) Yang, T.J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., and Adam, H. Netadapt: Platformaware neural network adaptation for mobile applications. arXiv preprint arXiv:1804.03230, 2018b.
 Ye et al. (2018) Ye, J., Lu, X., Lin, Z., and Wang, J. Z. Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. International Conference on Learning Representation (ICLR), 2018.
 Yoon et al. (2018) Yoon, H.J., Robinson, S., Christian, J. B., Qiu, J. X., and Tourassi, G. D. Filter pruning of convolutional neural networks for text classification: A case study of cancer pathology report comprehension. In Biomedical & Health Informatics (BHI), 2018 IEEE EMBS International Conference on, pp. 345–348. IEEE, 2018.
 Yu et al. (2018) Yu, R., Li, A., Chen, C.F., Lai, J.H., Morariu, V. I., Han, X., Gao, M., Lin, C.Y., and Davis, L. S. Nisp: Pruning networks using neuron importance score propagation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 Zhou et al. (2018) Zhou, Y., Ebrahimi, S., Arık, S. Ö., Yu, H., Liu, H., and Diamos, G. Resourceefficient neural architect. arXiv preprint arXiv:1806.07912, 2018.
Comments
There are no comments yet.