LeGR
Codebase for paper "LeGR: Filter Pruning via Learned Global Ranking"
view repo
Filter pruning has shown to be effective for learning resource-constrained convolutional neural networks (CNNs). However, prior methods for resource-constrained filter pruning have some limitations that hinder their effectiveness and efficiency. When searching for constraint-satisfying CNNs, prior methods either alter the optimization objective or adopt local search algorithms with heuristic parameterization, which are sub-optimal, especially in low-resource regime. From the efficiency perspective, prior methods are often costly to search for constraint-satisfying CNNs. In this work, we propose learned global ranking, dubbed LeGR, which improves upon prior art in the two aforementioned dimensions. Inspired by theoretical analysis, LeGR is parameterized to learn layer-wise affine transformations over the filter norms to construct a learned global ranking. With global ranking, resource-constrained filter pruning at various constraint levels can be done efficiently. We conduct extensive empirical analyses to demonstrate the effectiveness of the proposed algorithm with ResNet and MobileNetV2 networks on CIFAR-10, CIFAR-100, Bird-200, and ImageNet datasets. Code is publicly available at https://github.com/cmu-enyac/LeGR.
READ FULL TEXT VIEW PDFCodebase for paper "LeGR: Filter Pruning via Learned Global Ranking"
With the ubiquity of mobile and edge devices, it has become desirable to bring the performance of convolutional neural networks (CNNs) to edge devices without going through the cloud, especially due to privacy and latency considerations. However, mobile and edge devices are often characterized by stringent resource constraints, such as energy consumption and model size. Furthermore, depending on the application domain, the number of floating-point operations (FLOPs) or latency may also need to be constrained. To simplify the discussion, we use FLOPs as the target resource throughout the paper, keeping in mind that it can be replaced with any other resource of interest (e.g., model size, inference runtime, energy consumption, etc.).
Filter pruning is one of the techniques that can trade-off accuracy for FLOPs to arrive at CNNs with various FLOPs of interest. Filter pruning has received growing interests because of three reasons. First, it is a simple and systematic way to explore the trade-offs between accuracy and FLOPs. Second, it is easy to apply to various CNNs. Lastly, the pruned network can be accelerated using modern deep learning framework without specialized hardware or software support. The key research question in the filter pruning literature is to provide a systematic algorithm that better trades accuracy for hardware metrics such as FLOPs.
Besides the quality of the trade-off curve given by a filter pruning algorithm, the cost of obtaining the curve is also of interest when using filter pruning algorithms in production. When building a system powered by CNNs, e.g., an autonomous robot, the CNN designer often do not know a priori what the most suitable constraint level is, but rather proceed in a trial-and-error fashion with some constraint levels in mind. Thus, obtaining the trade-off curve faster is also of interest.
Most of the prior methods in filter pruning provides a non-intuitive tunable parameter that trades-off accuracy for FLOPs ye2018rethinking ; liu2017learning ; molchanov2016pruning ; he2017channel ; luo2017thinet ; li2016pruning . However, when the goal is to obtain a model at a target constraint level, these algorithms generally require many rounds of tuning for the tunable parameter, which is time-consuming. Recently, algorithms targeting the resource-constrained setting are proposed he2018amc ; gordon2018morphnet , however, there are several limitations that hinder their effectiveness. Specifically, MorphNet gordon2018morphnet alters the optimization target by adding regularization when searching for a constraint-satisfying CNN. On the other hand, AMC he2018amc
uses reinforcement learning to search for a constraint-satisfying CNN. However, their parameterization of the search space (
i.e., the state space and action space) involves lots of human-induced heuristics. The aforementioned limitations are more pronounced in a low-FLOPs regime, which leads to less accurate CNNs given a target constraint. Additionally, prior art is costly in searching constraint-satisfying CNNs.In this research, we introduce a different parameterization for searching constraint-satisfying CNNs. Our parameterization is inspired by theoretical analysis of the loss difference between the pre-trained model and the pruned-and-fine-tuned model. Rather than searching for the percentage of filters to prune for each layer he2018amc , we search layer-wise affine transformations over filter norms such that the transformed filter norms can rank filters globally across layers. Beyond the better empirical results as shown in Sec. 4, the global ranking structure provides an efficient way to explore CNNs of different constraint levels, which can be done simply by thresholding the bottom ranked filters. We show empirically that our proposed method outperforms prior art in resource-constrained filter pruning. We justify our finding with extensive empirical analyses using ResNet and MobileNetV2 on CIFAR-10/100, Bird-200, and ImageNet datasets. The main contributions of this work are as follows:
We propose a novel parameterization, dubbed LeGR, for searching resource-constrained CNNs with filter pruning. LeGR is inspired by the theoretical analysis of the loss difference between the pre-trained model and the pruned-and-fine-tuned model.
We show empirically that LeGR outperforms prior art in resource-constrained filter pruning with CIFAR, Bird-200, and ImageNet datasets using VGG, ResNet, and MobileNetV2.
Recently, neural architecture search has been adopted for identifying neural networks that not only achieve good accuracy, but also are constrained by compute-related metrics, such as size or FLOPs. Within this domain there are various recent approaches dai2018chamnet ; cai2018proxylessnas ; dong2018dpp ; tan2018mnasnet ; hsu2018monas ; zhou2018resource ; stamoulis2019single ; stamoulis2018designing , however, most of them are much more expensive compared to top-down approaches such as quantization ding2019regularizing ; ding2019flightnns ; choi2018bridging and filter pruning discussed in this paper.
We group prior art of filter pruning into two categories depending on whether it learns a set of non-trivial sparsity across layers. We note that the sparsity of a layer discussed in this work is defined as the percentage of the pruned filters in a layer compared to its pre-trained model.
Most literature that falls within this category focuses on proposing a metric to evaluate the importance of filters within a layer. For example, past work has used -norm of filter weights he2018soft , -norm of filter weights mao2017exploring ; li2016pruning , the error contribution toward the output of the current layer he2017channel ; luo2017thinet
, and the variance of the max activation
yoon2018filter. While the aforementioned work proposes insightful metrics to determine filter importance within a layer, the number of filters or channels to be pruned for each layer is either hand-crafted that is not systematic and requires expert knowledge, or uniformly distributed across layers that is sub-optimal.
In this category, there are two groups of work. The first group merge pruning with training using sparsity-induced regularization in the loss function while the second group determine which filters to prune based on the information extracted from the pre-trained model.
In the group of joint optimization, prior methods merge the training of a CNN with pruning by adding a sparsity-induced regularization term to the loss function. In this fashion, the sparsity can be learned as a by-product of the optimization louizos2017bayesian ; dai2018compressing ; ye2018rethinking ; liu2017learning ; gordon2018morphnet . With the regularizer controlling the final sparsity, tuning the hyper-parameter of the regularizer can be expensive to achieve a specific target constraint. Moreover, we find that obtaining networks in low-FLOPs regime using larger regularization has large variance.
The second group of methods proceed based on a pre-trained model. Lin et al. lin2018accelerating use first-order Taylor approximation to globally rank and prune filters. Zhuang et al. zhuang2018discrimination propose to use the impact on the auxiliary classification loss of each channel to evaluate the importance of a channel and prune gradually and greedily. Yang et al. yang2018ecc use ADMM to search for energy-constrained models. Yang et al. yang2018netadapt use filter norms to determine which filters to prune for each layer and determine the best layer to be pruned by evaluating all the layers. Molchanov et al. molchanov2016pruning proposed to use the scaled first order Talyor approximation to evaluate the importance of a filter and prunes one filter at a time. While pruning iteratively and greedily produce non-trivial layer sparsity, it is often costly and sub-optimal. To counteract greedy solutions, He et al. he2018amc use norm to rank filters intra-layer wise and use reinforcement learning to search the layer-wise sparsity given a target constraint. However, their parameterization of the search space is based on heuristics. In this research, instead of searching for the layer-wise sparsity directly with heuristic parameterization, we search for the layer-wise affine transformations over filter norms, which is theoretically-inspired as we will illustrate in the following section.
In this section, we introduce the development of our proposed parameterization toward learning the sparsity for each layer. Specifically, we proposed learned global ranking, or LeGR, which is to learn layer-wise affine transformations over filter norms such that the transformed filter norms can rank filters across layers. The overall pruning flow of LeGR is shown in Fig. 1. Given the learned affine transformations (a - pair), the filter norms of a pre-trained model are transformed and compare across layers. With a user-defined constraint level, pruning is simply thresholding out the bottom ranked filters until the constraint is satisfied. The pruned network will then be fine-tuned to obtain the final pruned network. We note that the affine transformations for a network only need to be learned once and can be re-used many times for different user-defined constraint levels.
The rationale behind the proposed algorithm stems from minimizing a surrogate of a derived upper bound for the loss difference between (1) the pruned-and-fine-tuned CNN and (2) the pre-trained CNN.
To develop such a method, we treat filter pruning as an optimization problem with the objective of minimizing the loss difference between (1) the pruned-and-fine-tuned model and (2) the pre-trained model. Concretely, we would like to solve for the filter masking binary variables
, with being the number of filters. If a filter is pruned, the corresponding mask will be zero (), otherwise it will be one (. Thus, we have the following optimization problem:(1) |
, where denotes all the filters of the CNN, denotes the loss function of filters where and are the input and label, respectively. denotes the training data, is the CNN model and is the loss function for prediction (e.g., cross entropy loss). denotes the learning rate, denotes the number of gradient updates, denotes the gradient with respect to the filter weights computed at step , and denotes element-wise multiplication. On the constraint side, is the modeling function for FLOPs and
is the desired FLOPs constraint. By fine-tuning, we mean updating the filter weights with stochastic gradient descent (SGD) for
steps.Let us assume the loss function is -Lipschitz continuous for the -th layer of the CNN, then the following holds:
(2) |
, where is the layer index for the -th filter, , and denotes norms.
On the constraint side of equation (1), let be the FLOPs of layer where filter resides. Analytically, FLOPs of a layer depends linearly on the number of filters in its preceding layer:
(3) |
, where returns a set of filter indices for the layer that precedes layer and is a layer-dependent positive constant. Let denote the FLOPs of layer for the pre-trained network (), one can see from equation (3) that . Thus, the following holds:
(4) |
Based on equations (2) and (4), instead of minimizing equation (1), we minimize its upper bound in a Lagrangian form. That is,
(5) |
, where and . To guarantee the solution will satisfy the constraint, we rank all the filters by their scores and threshold out the bottom ranked (small in scores) filters such that the constraint is satisfied and is maximized. We term the process of thresholding until the constraint is met LeGR-Pruning.
Since we do not know the layer-wise Lipschitz constants a priori, we treat and
as latent variables to be estimated. We assume that the more accurate the estimates are, the better the network obtained by LeGR-Pruning performs on the original objective,
i.e., equation (1).Specifically, to estimate - pair, we use the regularized evolutionary algorithm proposed in real2018regularized for its effectiveness in the neural architecture search problem. We can treat each latent variables - pair as a network architecture, which can be obtained by the flow introduced in Fig. 1. Once a pruned architecture is obtained, we fine-tune the resulting architecture by gradient steps and use its accuracy on the validation set as the fitness for the corresponding - pair. We note that, we use instead of for approximation and we empirically find that (200 gradient updates) works pretty well under the pruning setting across the datasets and networks we study.
Hence, in our regularized evolutionary algorithm, as shown in Algorithm 1, we first generate a pool of candidates ( and ) and record the fitness for each candidate, and then repeat the following steps: (i) sample a subset from the candidates, (ii) identify the fittest candidate, (iii) generate a new candidate by mutating the fittest candidate and measure its fitness accordingly, and (iv) replace the oldest candidate in the pool with the generated one. To mutate the fittest candidate, we randomly select a subset of the layers and conduct one step of random walk from their current values, i.e., .
Our work is evaluated on various image classification benchmarks including CIFAR-10/100 krizhevsky2009learning , ImageNet russakovsky2015imagenet , and Birds-200 wah2011caltech
. CIFAR-10/100 consists of 50k training images and 10k testing images with a total of 10/100 classes to be classified. ImageNet is a large scale image classification dataset that includes 1.2 million training images and 50k testing images with 1k classes to be classified. On the other hand, we also benchmark the proposed algorithms on a transfer learning setting since in practice, we want a small and fast model on some target datasets. Specifically, we use the Birds-200 dataset that consists of 6k training images and 5.7k testing images covering 200 species of bird.
For Bird-200, we use 10% of the training data as the validation set used for early stopping and to avoid over-fitting. The training scheme for CIFAR-10/100 follows he2018soft , which uses stochastic gradient descent with nesterov nesterov1983method , weight decay , batch size 128,
initial learning rate with decrease by 5x at epochs 60, 120, and 160, and training for 200 epochs in total. For control experiments with CIFAR-100 and Bird-200, the fine-tuning after pruning setting is as follows: we keep all training hyper-parameters the same, but change the initial learning rate to
and train for 60 epochs (i.e., k). We drop the learning rate by 10x at proportionally at similar times, i.e., epochs 18, 36, and 48. To compare numbers with prior art in CIFAR-10 and ImageNet, we follow the number of iterations in zhuang2018discrimination . Specifically, for CIFAR-10 we fine-tuned for 400 epochs with initial learning rate , drop by 5x at epochs 120, 240, and 320. For ImageNet, we use pre-trained models and we fine-tuned the pruned models for 60 epochs with initial learning rate , drop by 10x at epochs 30 and 45.For the hyper-parameters of LeGR, we select , meaning fine-tune for 200 gradient steps before measuring the validation accuracy when searching for the - pair. We note that we do the same for AMC he2018amc for a fair comparison. Moreover, we set the number of architectures explored to be the same with AMC, i.e., 400. Among the 400 searched architectures we set the pool size and number of search iterations . We set the sample size of the evolutionary algorithm which follows real2018regularized . The exploration is set to 1 with linear decrease and mutation ratio is set to 0.1 to sample 10% of the layers to mutate. In the following experiments, we use the smallest considered to search for the latent variables and and the found - pair to obtain the pruned networks at various constraint levels. For example, for ResNet-56 with CIFAR-100 (Fig. 4), we use to obtain the - pair and use the same - pair to obtain the seven networks () with the flow described in Fig. 1.
We first discuss the parameterization between LeGR and AMC using the same solver (DDPG lillicrap2015continuous ). We use DDPG in a sequential fashion that follows he2018amc while our state space and action space are different. LeGR requires two continuous actions (i.e., and ) for layer while AMC needs one action (i.e., sparsity) only. We conduct the comparison of pruning ResNet-56 to 50% of its original FLOPs targeting CIFAR-100 with and hyper-parameters follow he2018amc . As show in Fig. 2, while both LeGR and AMC outperform random search, LeGR converges faster to a better solution. Moreover, the solutions explored by LeGR are always tight to the user-defined constraint while AMC spends time exploring solutions that are not tight. Since there is a trade-off between accuracy and FLOPs for filter pruning, it would be better to explore more architectures that are tight to the constraint.
We compare LeGR with resource-constrained filter pruning methods, i.e., MorphNet gordon2018morphnet , AMC he2018amc , and a baseline that prunes filters uniformly across layers using ResNet-56 and MobileNetV2. As shown in Fig. 4, we find that all of the approaches outperform the uniform baseline in a high-FLOPs regime. However, both AMC and MorphNet have higher variances when pruned more aggressively. In both CNNs, LeGR outperforms the prior art, especially in low-FLOPs regime.
On the efficiency side, we measure the average time an algorithm takes us to generate a data point for ResNet-56 in Fig. 4 using our hardware (i.e., NVIDIA GTX 1080 Ti). Figure 4 shows the efficiency of AMC, MorphNet, and the proposed LeGR. The learning cost can be dissected into two parts: (1) pruning: the time it takes to search for a constraint-satisfying network and (2) fine-tuning: the time it takes for fine-tuning the weights of a pruned network. For MorphNet, we consider three trials for each constraint level to find an appropriate hyper-parameter to satisfy the constraint. The numbers are normalize to the pruning time of AMC. In terms of pruning time, LeGR is 7x and 5x faster than AMC and MorphNet, respectively. Considering the total learning time, LeGR is 3x and 2x faster than AMC and MorphNet, respectively. The efficiency comes from the fact that LeGR only searches the - pair once and re-use it for seven constraint levels. In contrast, both AMC and MorphNet have to search for constraint-satisfying networks for every new constraint level.
We also compare, in Table. 1, LeGR with the prior art that reports results on CIFAR-10. First, for ResNet-56, we find that LeGR outperform most of the prior art in both FLOPs and accuracy dimensions and perform similarly compared to he2018soft ; zhuang2018discrimination in accuracy but with less FLOPs. For VGG-13, LeGR achieves significantly better results compared to the prior art.
For ImageNet, we prune ResNet-50 and MobileNetV2 with LeGR to compare with prior art. As shown in Table 2, LeGR is superior to prior art that reports on ResNet-50. Specifically, when pruning to 73% of FLOPs, LeGR achieves even higher accuracy compared to the pre-trained model. When pruned to 58% and 47% our algorithm achieves better results compared to prior art. Similarly, for MobileNetV2, LeGR achieves superior results compared to prior art.
We analyze how LeGR performs in a transfer learning setting where we have a model pre-trained on a large dataset, i.e., ImageNet, and we want to transfer its knowledge to adapt to a smaller dataset, i.e., Bird-200. We prune the fine-tuned network on the target dataset directly instead of pruning on the large dataset before transferring for two reasons: (1) the user only cares about the performance of the network on the target dataset instead of the source dataset, which means we need the accuracy and FLOPs trade-off curve in the target dataset and (2) pruning on a smaller dataset is much more efficient compared to pruning on a large dataset. We note that directly pruning on target dataset has been adopted in prior art as well zhong2018target ; luo2017thinet . Also, to avoid over-fitting, we use 10% of the training data to act as validation set to pick the best model for testing. We first obtain a fine-tuned MobileNetV2 and ResNet-50 on the Bird-200 dataset with top-1 accuracy 80.2% and 79.5%, respectively. The numbers are comparable to the reported number from ResNet-101 li2018delta , VGG-16, DenseNet-121 Mallya_2018_CVPR from prior art. As shown in Fig. 5, we find that LeGR outperforms Uniform and AMC, which is consistent with previous analyses. Moreover, it is interesting that MobileNetV2, a more compact model, outperforms ResNet-50 in both accuracy and FLOPs dimensions under this setting.
In this work, we propose LeGR, a novel parameterization for searching constraint-satisfying CNNs with filter pruning. The rationale behind LeGR stems from minimizing the upper bound of the loss difference between the pre-trained model and the pruned-and-fine-tuned model. Our empirical results show that LeGR outperforms prior art in resource-constrained filter pruning especially in the low-FLOPs regime. Additionally, LeGR can be 7x and 5x faster in searching constraint-satisfying CNNs when there are multiple constraint levels considered. We verify the effectiveness of LeGR using two kinds of CNNs including ResNet and MobileNetV2 on various datasets such as CIFAR, Bird-200, and ImageNet.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.Regularizing activation distribution for training binarized deep networks.
2019.Nisp: Pruning networks using neuron importance score propagation.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.LeGR is a method that uses norm-based global ranking to conduct pruning, which is relevant to prior art [25] that uses first order Taylor approximation to globally rank the filters. We specifically compare the norm of filter weights and first order Taylor approximation to the loss increase caused by pruning (no fine-tuning considered [3, 25]) as the global ranking metric. Due to the the large amount of filter removal, Taylor approximation could be erroneous. In Fig. 6, we plot the accuracy after pruning before fine-tuning. It is clear that ranking filters with their norms consistently outperforms using the first order Taylor expansion.
So far, we have discussed filter pruning by assuming we have a pre-trained model to begin with. To study the impact of the pre-trained model, we train the pre-trained model with various epochs and analyze how it affects the accuracy of the LeGR-pruned model. Specifically, we fix the hyper-parameters used for training the pre-trained model and conduct early stopping. As shown in Fig. 8, we find that the pre-trained model does not need to converge to be pruned. The pruned model can achieve 68.4% top-1 accuracy when using the pre-trained model that is only trained for half of the training time (i.e., 100 epochs). In comparison, the pruned model that stems from the fully-trained model achieves 68.2% top-1 accuracy. Interestingly, the pre-trained models at epoch 100 and 190 have a 8.6% accuracy difference (63.3 vs. 71.9). This empirical evidences suggest that one might be able to obtain effective pruned models without paying the cost to train the full-blown model, which is costly.
Since we use to approximate when searching for the - pair, it is expected that the closer to the better - pair LeGR can find. In this subsection, we are interested in how affects the performance of LeGR. Specifically, we use LeGR to prune ResNet-56 for CIFAR-100 and solve for the latent variables at three constraint levels . For , we experiment with . We note that once the - pair is found, we use LeGR-Pruning to obtain the resource-constrained CNN and fine-tune it for steps. In this experiment, . As shown in Fig. 8, the results align with our intuition that there are diminishing returns in increasing . With the considered network and dataset, fine-tuning for 50 SGD steps is good enough to guide the learning of and .
Comments
There are no comments yet.