In the last decade, Deep Neural Networks (DNNs) popularity has grown exponentially as the method results improved, and they are now used in a variety of applications such as classification, detection, etc. However, these improvements are often faced with increasing model complexity, resulting in a need for more computational resources. Hence, even though hardware performances quickly improve, it is still hard to directly deploy models into targeted edge devices or smartphones. To this end, various attempts to make heavy models more compact have been proposed, based on different compression methods such as knowledge distillation [PolinoPA18, GuoWWYLHL20], pruning [l1norm, hrank, ABCPruner, LRP_pruning], quantization [QuZCT20], neural architecture search (NAS) [hournas], etc. Among these categories, network pruning, which removes redundant and unimportant connections, is one of the most popular and promising compression methods, and recently received great interest from the industry that seeks to compress their AI models and fit them on small target devices with resource constraints. Indeed, being able to run the models on-device instead of using cloud computing brings numerous advantages such as reduced costs and energy consumption, increased speed, or data privacy.
As manually defining what percentage of each layer should be pruned is a time-consuming process that requires human expertise, recent works have proposed methods that automatically prune the redundant filters throughout the network to meet a given constraint such as the number of parameters, FLOPs, or hardware platform [liunetworkslimming, you2019gate, Li_2021_CVPR, importance_estimation, ABCPruner, LRP_pruning, nisp, Info_flow18, Info_flow21_PAMI]. To automatically find the best-pruned architectures, these methods rely on various metrics such as the 2nd order Taylor expansions [importance_estimation], the layer-wise relevance propagation score [LRP_pruning], etc. For further details, please find Sec. 2. Although these strategies improved over time, they usually do not explicitly aim to preserve the model accuracy, or they do it in a computationally expensive way.
In this paper, we make the hypothesis that the pruned architecture that can lead to the best accuracy after finetuning is the one that most efficiently preserves the accuracy during the pruning process (see Sec. 4.5.1). We therefore introduce an automatic pruning method, called AutoBot, that uses trainable bottlenecks to efficiently preserve the model accuracy while minimizing the FLOPs, as shown in Fig. 1. These bottlenecks only require one single epoch of training with 25.6% (CIFAR-10) or 7.49% (ILSVRC2012) of the dataset to efficiently learn which filters to prune. We compare AutoBot with various pruning methods, and show a significant improvement of the pruned models before finetuning, leading to a SOTA accuracy after finetuning. We also perform a practical deployment test on several edge devices to demonstrate the speed improvement of the pruned models.
To summarize, our contributions are as follows:
We introduce AutoBot, a novel automatic pruning method that uses a trainable bottleneck to efficiently learn which filter to prune in order to maximize the accuracy while minimizing the FLOPs of the model. This method can easily and intuitively be implemented regardless of the dataset or model architecture.
We demonstrate that preserving the accuracy during the pruning process has a strong impact on the accuracy of the finetuned model (Sec. 4.5.1).
Extensive experiments show that AutoBot can efficiently preserve the accuracy after pruning (before finetuning), and outperforms all existing pruning methods once finetuned.
2 Related Works
Pruning redundant filters is a common CNN compression solution, as it is an intuitive method that proved its efficiency multiple times in the past [l1norm, hrank, LRP_pruning]. In this section, we summarize some related works compared to our proposed method.
Traditionally, magnitude-based pruning aims to exploit the inherent characteristics of the network to define a pruning criterion, without modifying the network parameters. Popular criteria include -norm [l1norm, Minbao2021DCFF, MengCLLGLS20, Li2020CVPR, adaptive1], Taylor expansion [importance_estimation], Gradient [LiuWgradient19]hrank], sparsity of output feature maps [apoz], geometric median [GM], etc. Recently, Tang et al. [SCOP] proposed a scientific control pruning method, called SCOP, which introduces knockoff features as the control group. In contrast, adaptive pruning needs to retrain the networks from scratch with a modified training loss or architecture which adds new constraints. Several works [liunetworkslimming, luo2017thinet, YeRethinking18] add trainable parameters to each feature map channel to obtain data-driven channel sparsity, enabling the model to automatically identify redundant filters. Luo et al. [luo2017thinet] introduce Thinet that formally establishes filter pruning as an optimization problem and prunes filters based on statistical information computed from its next layer, not the current layer. Lin et al. [GAL] propose a structured pruning method that jointly prunes filters and other structures by introducing a soft mask with sparsity regularization. However, retraining the model from scratch is a time- and resource-consuming process that does not significantly improve the accuracy compared to magnitude-based pruning. Although these two pruning strategies are intuitive, the pruning ratio must be manually defined layer-by-layer, which is a time-consuming process that requires human expertise. Instead, in this paper, we focus on automatic pruning.
As suggested by the name, automatic network pruning is to prune the redundant filters throughout the network automatically under any constraints such as a number of parameters, FLOPs, or hardware platform. In this respect, a large number of automatic pruning methods have been proposed. Liu et al. [liunetworkslimming] optimize the scaling factor in the batch-norm layer as a channel selection indicator to decide which channels are unimportant. You et al. [you2019gate] propose an automatic pruning method, called Gate Decorator, which transforms CNN modules by multiplying their output by channel-wise scaling factors and adopt an iterative pruning framework called Tick-Tock to boost pruning accuracy. Li et al. [Li_2021_CVPR]
propose a collaborative compression method that mutually combines channel pruning and tensor decomposition. Molchanovet al. [importance_estimation]estimates the contribution of a filter to the final loss using 2nd order Taylor expansions and iteratively removes those with smaller scores. Lin et al. [ABCPruner] propose ABCPruner to find the optimal pruned structure automatically by updating the structure set and recalculating the fitness. Back-propagation methods [LRP_pruning, nisp] compute the relevance score of each filter by following the information flow from the model output. Dai et al. [Info_flow18] and Zheng et al. [Info_flow21_PAMI]
adopt information theory to preserve the information between the hidden representation and input or output.
However, most existing methods are computationally and time expensive because they either require to retrain the model from scratch [liunetworkslimming], apply iterative pruning [you2019gate, importance_estimation, LRP_pruning, nisp, Li_2021_CVPR] or finetune the model while pruning [ABCPruner, Info_flow18]. When the model isn’t retrained or finetuned during the pruning process, they generally do not preserve the model accuracy after pruning [Info_flow21_PAMI, LRP_pruning, nisp], and thus require to be finetuned for a large number of epochs. In contrast to other automatic pruning methods, AutoBot stands out by its ability to efficiently preserve the accuracy of the model during the pruning process, in a simple and intuitive way.
Motivated by several bottleneck approaches [info_bottleneck, deep_info_bottleneck, schulz2020iba], our method can efficiently control the information flow throughout the pretrained network using Trainable Bottlenecks that are injected into the model. The objective function of the trainable bottleneck is to maximize the information flow from input to output while minimizing the loss by adjusting the amount of information in a model under the given constraints. Note that during the training procedure, only the parameters of the trainable bottlenecks are updated while all the pretrained parameters of the model are frozen.
Compared to other pruning methods inspired by the information bottleneck [Info_flow18, Info_flow21_PAMI], we do not consider the compression of mutual information between the input/output and the hidden representations in order to evaluate the information flow. Such methods are orthogonal to AutoBot, which explicitly quantifies how much information is passed to the next layer during the forward pass. Furthermore, we optimize the trainable bottlenecks on a fraction of one single epoch only. Our AutoBot pruning process is summarized in the Alg. 1.
3.1 Trainable Bottleneck
We formally define the concept of trainable bottleneck as a module that can restrict the information flow throughout the network during the forward pass, using trainable parameters. Mathematically, it can be formulated as:
where stands for the trainable bottleneck, denotes the bottleneck parameters of the module, and and denote the input and output feature map of the bottleneck at the module, respectively. For instance, Schulz et al. [schulz2020iba] control the amount of information into the model by injecting noise into it. In this case, is expressed as where denotes the noise.
Inspired by the information bottleneck concept [info_bottleneck, deep_info_bottleneck], we formulate a general bottleneck that is not limited to only information theory but can be optimized to satisfy any constraint as follow:
where stands for the cross-entropy loss, and stand for the model input and output, is the set of the bottleneck parameters () in the model, is a constraint function, and is the desired constraint.
3.2 Pruning Strategy
In this work, we propose trainable bottlenecks for automatic network pruning. To this end, we inject a bottleneck into each convolution block throughout the network such that the information flow of the estimated model to be pruned is quantified by restricting trainable parameters layer-wisely.
Compared to previous works, our bottleneck function (Eq. 1) do not use noise to control the information flow:
where . Therefore the range of is changing from to . For pruning, this is more relevant since replacing a module input by zeros is equivalent to pruning the module (i.e. pruning the corresponding output of the previous module).
Following the general objective function of the trainable bottleneck (Eq. 2), we introduce two regularizers and to obtain the following function:
where is the target FLOPs (manually fixed). As we will explain more in detail in the next paragraphs, the role of is to indicate the constraint for the pruned architecture under while makes the parameters converge toward binary values (0 or 1).
As an evaluation metric, FLOPs is always linked with inference time. Therefore, when it comes to running a neural network on a device with limited computational resources, pruning to efficiently reduce the FLOPs is a common solution. Our work also tightly embraces this rule, as it can make any size of pruned models by constraining the FLOPs according to the targeted devices. Formally, given a neural network consisting of multiple convolutional blocks, we enforce the following condition:
is the vector of the parameters at the information bottleneck following theconvolution block, is the function that computes the FLOPs of the module of the convolution block weighted by , is the total number of convolution blocks in the model and is the total number of modules in the convolution block. For instance, assume that
is for a convolutional module without bias and padding. Then it can be simply expressed as:
where and are the height and width of the output feature map of the convolution, and is its kernel size. Notice that within the convolution block, all modules share . That is, at a block level all the modules belonging to the same convolution block are pruned together.
A key issue for pruning is that finding redundant filters is a discrete problem, i.e. they should either be pruned or not. In our case, this problem is manifested by the fact that cannot be binary as the optimization problem would be non-differentiable, meaning that back-propagation would not work. In order to tackle this issue, we force the continuous parameters to converge toward a binary solution that indicates the presence (= 1) or absence (= 0) of a filter. This is the role of the constraint :
|before finetuning||(Pruning Ratio)||(Pruning Ratio)|
|VGG-16 [vgg]||–||93.96%||0.0%||314.29M (0.0%)||14.99M (0.0%)|
|L1 [l1norm]||88.70%*||93.40%||-0.56%||206.00M (34.5%)||5.40M (64.0%)|
|CC-0.4 [Li_2021_CVPR]||✓||–||94.15%||+0.19%||154.00M (51.0%)||5.02M (66.5%)|
|AutoBot (Ours)||✓||88.29%||94.19%||+0.23%||145.61M (53.7%)||7.53M (49.8%)|
|CC-0.5 [Li_2021_CVPR]||✓||–||94.09%||+0.13%||123.00M (60.9%)||5.02M (73.2%)|
|HRank-65 [hrank]||10.00%**||92.34%||-1.62%||108.61M (65.4%)||2.64M (82.4%)|
|AutoBot (Ours)||✓||82.73%||94.01%||+0.05%||108.71M (65.4%)||6.44M (57.0%)|
|ITPruner [Info_flow21_PAMI]||✓||–||94.00%||+0.04%||98.80 (68.6%)||–|
|ABCPruner [ABCPruner]||✓||–||93.08%||-0.88%||82.81M (73.7%)||1.67M (88.9%)|
|DCFF [Minbao2021DCFF]||–||93.49%||-0.47%||72.77M (76.8%)||1.06M (92.9%)|
|AutoBot (Ours)||✓||71.24%||93.62%||-0.34%||72.60M (76.9%)||5.51M (63.24%)|
|VIBNet [Info_flow18]||✓||–||91.50%||-2.46%||70.63M (77.5%)||– (94.7%)|
|ResNet-56 [resNet]||–||93.27%||0.0%||126.55M (0.0%)||0.85M (0.0%)|
|L1 [l1norm]||–||93.06%||-0.21%||90.90M (28.2%)||0.73M (14.1%)|
|HRank-50 [hrank]||10.78%**||93.17%||-0.10%||62.72M (50.4%)||0.49M (42.4%)|
|SCP [KangH20]||–||93.23%||-0.04%||61.89M (51.1%)||0.44M (48.2%)|
|CC [Li_2021_CVPR]||✓||–||93.64%||+0.37%||60.00M (52.6%)||0.44M (48.2%)|
|ITPruner [Info_flow21_PAMI]||✓||–||93.43%||+0.16%||59.50 (53.0%)||–|
|FPGM [GM]||–||93.26%||-0.01%||59.40M (53.0%)||–|
|LFPC [Cpruning_variousCriteria]||–||93.24%||-0.03%||59.10M (53.3%)||–|
|ABCPruner [ABCPruner]||✓||–||93.23%||-0.04%||58.54M (53.7%)||0.39M (54.1%)|
|DCFF [Minbao2021DCFF]||–||93.26%||-0.01%||55.84M (55.9%)||0.38M (55.3%)|
|AutoBot (Ours)||✓||85.58%||93.76%||+0.49%||55.82M (55.9%)||0.46M (45.9%)|
|SCOP [SCOP]||–||93.64%||+0.37%||– (56.0%)||– (56.3%)|
|ResNet-110 [resNet]||–||93.5%||0.0%||254.98M (0.0%)||1.73M (0.0%)|
|L1 [l1norm]||–||93.30%||-0.20%||155.00M (39.2%)||1.16M (32.9%)|
|HRank-58 [hrank]||–||93.36%||-0.14%||105.70M (58.5%)||0.70M (59.5%)|
|LFPC [Cpruning_variousCriteria]||–||93.07%||-0.43%||101.00M (60.3%)||–|
|ABCPruner [ABCPruner]||✓||–||93.58%||+0.08%||89.87M (64.8%)||0.56M (67.6%)|
|DCFF [Minbao2021DCFF]||–||93.80%||+0.30%||85.30M (66.5%)||0.56M (67.6%)|
|AutoBot (Ours)||✓||84.37%||94.15%||+0.65%||85.28M (66.6%)||0.70M (59.5%)|
|GoogLeNet [googlenet]||–||95.05%||0.0%||1.53B (0.0%)||6.17M (0.0%)|
|L1 [l1norm]||–||94.54%||-0.51%||1.02B (33.3%)||3.51M (43.1%)|
|Random||–||94.54%||-0.51%||0.96B (37.3%)||3.58M (42.0%)|
|HRank-54 [hrank]||–||94.53%||-0.52%||0.69B (54.9%)||2.74M (55.6%)|
|CC [Li_2021_CVPR]||✓||–||94.88%||-0.17%||0.61M (60.1%)||2.26M (63.4%)|
|ABCPruner [ABCPruner]||✓||–||94.84%||-0.21%||0.51B (66.7%)||2.46M (60.1%)|
|DCFF [Minbao2021DCFF]||–||94.92%||-0.13%||0.46B (69.9%)||2.08M (66.3%)|
|HRank-70 [hrank]||10.00%**||94.07%||-0.98%||0.45B (70.6%)||1.86M (69.9%)|
|AutoBot (Ours)||✓||90.18%||95.23%||+0.16%||0.45B (70.6%)||1.66M (73.1%)|
|DenseNet-40 [densenet]||–||94.81%||0.0%||287.71M (0.0%)||1.06M (0.0%)|
|Network Slimming [liunetworkslimming]||✓||–||94.81%||-0.00%||190.00M (34.0%)||0.66M (37.7%)|
|GAL-0.01 [GAL]||–||94.29%||-0.52%||182.92M (36.4%)||0.67M (36.8%)|
|AutoBot (Ours)||✓||87.85%||94.67%||-0.14%||167.64M (41.7%)||0.76M (28.3%)|
|HRank-40 [hrank]||25.58%**||94.24%||-0.57%||167.41M (41.8%)||0.66M (37.7%)|
|Variational CNN [zhao2019variational]||–||93.16%||-1.65%||156.00M (45.8%)||0.42M (60.4%)|
|AutoBot (Ours)||✓||83.20%||94.41%||-0.4%||128.25M (55.4%)||0.62M (41.5%)|
|GAL-0.05 [GAL]||–||93.53%||-1.28%||128.11M (55.5%)||0.45M (57.5%)|
|*according to [neuron_merging]|
|**based on the code used in the corresponding paper|
where is the FLOPs of the original model, and is the predefined target FLOPs. And
where is the total number of parameters. In contrast to and , these loss terms are normalized such that the scale of the loss is always the same. As a result, for a given dataset, the training parameters are stable across different architectures. The optimization problem to update the proposed information bottlenecks for automatic pruning can be summarized as follows:
are hyperparameters that indicate the relative importance of each objective.
Optimal threshold Once the bottlenecks are trained, can be directly used as a pruning criterion. Therefore, we propose a way to quickly find the threshold under which neurons should be pruned. Since our bottleneck allows us to quickly and accurately compute the weighted FLOPs (Eq. 5), we can estimate the FLOPs of the model to be pruned without actual pruning. This is done by setting to zero for the filters to be pruned, or one otherwise. We call this process pseudo-pruning. In order to find the optimal threshold, we initialize a threshold to 0.5 and pseudo-prune all filters with lower than this threshold. We then compute the weighted FLOPs, and adopt the dichotomy algorithm to efficiently minimize the distance between the current and targeted FLOPs. This process is repeated until the gap is small enough. Once we have found the optimal threshold, we cut out all bottlenecks from the model and finally prune all the filters with lower than the optimal threshold to get the compressed model with a targeted FLOPs.
Parametrization Following Schulz et al. [schulz2020iba], we do not directly optimize as this would require to use clipping to stay in the interval. Instead, we parametrize , where the elements of are in . Therefore, we can optimize without constraints.
Reduced training data We empirically observed that the training loss for the bottlenecks can converge quickly before the end of the first epoch. Therefore, it suggests that regardless of model size (i.e. FLOPs), the optimally pruned architecture can be efficiently estimated using only a small portion of the dataset.
|before finetuning||(Pruning Ratio)||(Pruning Ratio)|
|ResNet-50 [resNet]||–||76.13%||0.0%||92.87%||0.0%||4.11B (0.0%)||25.56M (0.0%)|
|ThiNet-50 [luo2017thinet]||–||72.04%||-4.09%||90.67%||-2.20%||– (36.8%)||– (33.72%)|
|FPGM [GM]||–||75.59%||-0.59%||92.27%||-0.60%||2.55B (37.5%)||14.74M (42.3%)|
|ABCPruner [ABCPruner]||✓||–||74.84%||-1.29%||92.31%||-0.56%||2.45B (40.8%)||16.92M (33.8%)|
|SFP [l2norm]||–||74.61%||-1.52%||92.06%||-0.81%||2.38B (41.8%)||–|
|HRank-74 [hrank]||–||74.98%||-1.15%||92.33%||-0.54%||2.30B (43.7%)||16.15M (36.8%)|
|Taylor [importance_estimation]||–||74.50%||-1.63%||–||–||– (44.5%)||– (44.9%)|
|DCFF [Minbao2021DCFF]||–||75.18%||-0.95%||92.56%||-0.31%||2.25B (45.3%)||15.16M (40.7%)|
|ITPruner [Info_flow21_PAMI]||✓||–||75.75%||-0.38%||–||–||2.23B (45.7%)||–|
|AutoPruner [luo2020autopruner]||✓||–||74.76%||-1.37%||92.15%||-0.72%||2.09B (48.7%)||–|
|RRBP [zhou2019accelerate]||–||73.00%||-3.13%||91.00%||-1.87%||–||– (54.5%)|
|AutoBot (Ours)||✓||47.51%||76.63%||+0.50%||92.95%||+0.08%||1.97B (52.0%)||16.73M (34.5%)|
|ITPruner [Info_flow21_PAMI]||✓||–||75.28%||-0.85%||–||–||1.94B (52.8%)||–|
|GDP-0.6 [globalPruningIJCAI]||✓||–||71.19%||-4.94%||90.71%||-2.16%||1.88B (54.0%)||–|
|SCOP [SCOP]||–||75.26%||-0.87%||92.53%||-0.33%||1.85B (54.6%)||12.29M (51.9%)|
|GAL-0.5-joint [GAL]||–||71.80%||-4.33%||90.82%||-2.05%||1.84B (55.0%)||19.31M (24.5%)|
|ABCPruner [ABCPruner]||✓||–||73.52%||-2.61%||91.51%||-1.36%||1.79B (56.6%)||11.24M (56.0%)|
|GAL-1 [GAL]||–||69.88%||-6.25%||89.75%||-3.12%||1.58B (61.3%)||14.67M (42.6%)|
|LFPC [Cpruning_variousCriteria]||–||74.18%||-1.95%||91.92%||-0.95%||1.60B (61.4%)||–|
|GDP-0.5 [globalPruningIJCAI]||✓||–||69.58%||-6.55%||90.14%||-2.73%||1.57B (61.6%)||–|
|DCFF [Minbao2021DCFF]||–||75.60%||-0.53%||92.55%||-0.32%||1.52B (63.0%)||11.05M (56.8%)|
|DCFF [Minbao2021DCFF]||–||74.85%||-1.28%||92.41%||-0.46%||1.38B (66.7%)||11.81M (53.8%)|
|AutoBot (Ours)||✓||14.71%||74.68%||-1.45%||92.20%||-0.66%||1.14B (72.3%)||9.93M (61.2%)|
|GAL-1-joint [GAL]||–||69.31%||-6.82%||89.12%||-3.75%||1.11B (72.8%)||10.21M (60.1%)|
|CURL [LuoW20]||✓||–||73.39%||-2.74%||91.46%||-1.41%||1.11B (73.2%)||6.67M (73.9%)|
|DCFF [Minbao2021DCFF]||–||73.81%||-2.32%||91.59%||-1.28%||1.02B (75.1%)||6.56M (74.3%)|
Pruning results on ResNet-50 with ImageNet, sorted by FLOPs. Scores in brackets of “FLOPs” and “Params” denote the pruning ratio of FLOPs and number of parameters in the compressed models.
4.1 Experimental Settings
To demonstrate the efficiency of AutoBot on a variety of experimental setups, experiments are conducted on two popular benchmark datasets and five common CNN architectures, 1) CIFAR-10 [cifar10] with VGG-16 [vgg], ResNet-56/110 [resNet], DenseNet [densenet], and GoogLeNet [googlenet]
, and 2) ILSVRC2012 (ImageNet)[deng2009imagenet] with ResNet-50.
Experiments are performed within the PyTorch and torchvision frameworks [paszke2017automatic] under Intel(R) Xeon(R) Silver 4210R CPU 2.40GHz and NVIDIA RTX 2080 Ti with 11GB for GPU processing.
For CIFAR-10, we trained the bottlenecks for 200 iterations with a batch size of 64, a learning rate of 0.6 and and equal to 6 and 0.4 respectively, and we finetuned the model for 200 epochs with the initial learning rate of 0.02 scheduled by cosine annealing scheduler and with a batch size of 256. For ImageNet, we trained the bottlenecks for 3000 iterations with a batch size of 32, a learning rate of 0.3 and and
equal to 10 and 0.6 respectively, and we finetuned the model for 200 epochs with a batch size of 512 and with the initial learning rate of 0.006 scheduled by cosine annealing scheduler. Bottlenecks are optimized via Adam optimizer. All networks are retrained via the Stochastic Gradient Descent (SGD) optimizer, with momentum of 0.9 and decay factor offor CIFAR-10 and with momentum of 0.99 and decay factor of for ImageNet.
4.2 Evaluation Metrics
In order to make a quantitative comparison, we first evaluate the Top-1 (and Top-5 for ImageNet) accuracy of the models. This comparison is done after finetuning, as is common in DNN pruning literature. Furthermore, we measure the Top-1 accuracy right after the pruning step (before finetuning) to prove that our method can effectively preserve the important filters which make a big impact on the model decision. Indeed, accuracy after finetuning depends on many parameters independent from the pruning method, such as data augmentation, learning rate, scheduler, etc. Therefore, we believe that it is not the most accurate way to compare performance across pruning methods.
We adopt commonly used metrics i.e. FLOPs as well as the number of parameters in order to measure the quality of the pruned models in terms of computational efficiency as well as model size. Note that our proposed method can freely compress the pre-trained model in any size with the given target FLOPs.
4.3 Automatic Pruning on CIFAR-10
To demonstrate the improvement of our method, we firstly conduct automatic pruning with some of the most popular convolutional neural networks, namely VGG-16, ResNet-56/110, GoogLeNet, and DenseNet-40. Tab.1 indicates experimental results with these architectures on CIFAR-10 for various number of FLOPs.
VGG-16 We performed on VGG-16 architecture with three different pruning ratios. VGG-16 is a very common convolutional neural network architecture that contains thirteen convolution layers and two fully-connected layers. Tab. 1 shows that our method can maintain a relatively higher accuracy before finetuning, even under the same FLOPs reduction (e.g. 82.73% (proposed method) vs. 10.00% (HRank) for 65.4% of FLOPs reduction), thus leading to a SOTA accuracy after finetuning. For instance, we get 71.24% and 93.62% accuracy before and after finetuning respectively when reducing the FLOPs by 76.9%. Our method even outperforms the baseline by 0.05% and 0.23% when reducing the FLOPs by 65.4% and 53.7%, respectively.
As emphasized in Fig. 2, the per-layer filter pruning ratio is automatically determined by our method, according to the target FLOPs.
GoogLeNet GoogLeNet is a large architecture (1.53 billion parameters) characterized by its parallel branches named inception blocks. In total, it contains 64 convolutions and one fully-connected layer. Our accuracy of 90.18% after pruning under a FLOPs reduction of 70.6% (against 10% for HRank for the same compression) leads to the SOTA accuracy of 95.23% after finetuning, outperforming recent papers such as DCFF and CC. Moreover, we also achieve a significant improvement in term of parameters reduction (73.1%), although it is not the primary focus of our method.
ResNet is an architecture characterized by its residual connections. We adopted ResNet-56 and ResNet-110 which consist of 55 and 109 convolution layers, respectively. Pruned model with our method can improve accuracy from 85.58% before finetuning to 93.76% after finetuning under a FLOPs reduction of 55.9% for ResNet-56, and from 84.37% before finetuning to 94.15% after finetuning under a FLOPs reduction of 66.6% for ResNet-110. Under similar or even smaller FLOPs, our approach accomplishes an excellent Top-1 accuracy compared to other existing magnitude-based or adaptive-based pruning methods and is beyond the baseline model’s performance (93.27% for ResNet-56 and 93.50% for ResNet-110).
DenseNet-40 As ResNet, DenseNet-40 is an architecture based on residual connections. It is made of 39 convolutions and one fully-connected layer. We experimented with two different target FLOPs, as shown in Tab. 1. Notably, we got an accuracy of 83.2% before finetuning and 94.41% after finetuning under a FLOPs reduction of 55.4%.
4.4 Automatic Pruning on ImageNet
To show the performance of our method on ILSVRC-2012, we chose the ResNet-50 architecture, which is made of 53 convolution layers followed by a fully-connected layer. Due to the complexity of this dataset (1,000 classes and millions of images) and the compact design of ResNet itself, this task is more challenging than the compression of models on CIFAR-10. While existing pruning methods requiring to manually define the pruning ratio for each layer achieve reasonable performance, our global pruning method allows competitive results in all evaluation metrics including Top-1 and Top-5 accuracy, FLOPs reduction as well as number of parameters reduction, as reported in Tab. 2. Under the high FLOPs compression of 72.3%, we obtain an accuracy of 74.68%, outperforming recent works including GAL (69.31%) and CURL (73.39%) with a similar compression. And under the reasonable compression of 52%, our method even outperforms the baseline by 0.5% and leaves all the previous methods behind by at least 1% by doing so. Therefore, the proposed method also works well on a complex dataset.
|Model||FLOPs||Jetson-Nano (GPU)||Raspberry Pi 4B (CPU)||Raspberry Pi 3B+ (CPU)||Raspberry Pi 2B (CPU)|
|VGG-16||73.71M||61.63 13.33 ( 4.62)||45.73 17.16 ( 2.66)||79.98 35.17 ( 2.27)||351.77 118.36 ( 2.97)|
|VGG-16||108.61M||61.63 13.77 ( 4.48)||45.73 19.95 ( 2.29)||79.98 39.99 ( 2.00)||351.77 143.95 ( 2.44)|
|VGG-16||145.55M||61.63 19.24 ( 3.20)||45.73 24.33 ( 1.88)||79.98 50.27 ( 1.59)||351.77 184.47 ( 1.91)|
|ResNet-56||55.94M||16.47 13.71 ( 1.20)||21.95 15.88 ( 1.38)||60.42 39.78 ( 1.52)||170.46 101.70 ( 1.68)|
|ResNet-110||85.30M||28.10 26.36 ( 1.07)||41.35 27.90 ( 1.48)||112.57 72.71 ( 1.55)||331.60 179.91 ( 1.84)|
|GoogLeNet||0.45B||80.84 28.37 ( 2.85)||146.68 57.25 ( 2.56)||342.23 170.17 ( 2.01)||1,197.65 400.89 ( 2.99)|
|DenseNet-40||129.13M||35.25 33.46 ( 1.05)||71.87 44.73 ( 1.61)||171.86 102.75 ( 1.67)||432.03 252.63 ( 1.71)|
|DenseNet-40||168.26M||35.25 35.11 ( 1.00)||71.87 53.08 ( 1.35)||171.86 114.37 ( 1.50)||432.03 302.49 ( 1.43)|
|Raspberry Pi 4B||
|No GPGPU||4GB LPDDR4|
|Raspberry Pi 3B+||
|No GPGPU||1GB LPDDR2|
|Raspberry Pi 2B||
|No GPGPU||1GB SDRAM|
4.5 Ablation Study
4.5.1 Impact of Preserving the Accuracy
To highlight the impact of preserving the accuracy during the pruning process, we compare the accuracy before and after finetuning of AutoBot with different pruning strategies in Fig. 3. To show the superiority of an architecture found by preserving the accuracy compared to a manually designed architecture, a comparison study is conducted by manually designing three different strategies: 1) Same Pruning, Different Channels (SPDC), 2) Different Pruning, Different Channels (DPDC), and 3) Reverse.
DPDC has the same FLOPs as the architecture found by AutoBot but uses a different per-layer pruning ratio proposed by Lin et al. [hrank]. To show the impact of a bad initial accuracy for finetuning, we propose the SPDC strategy that has the same per-layer pruning ratio as the architecture found by AutoBot but with randomly selected filters. We also propose to reverse the order of importance of the filters selected by AutoBot such that only the less important filters are pruned. By doing so, we can better appreciate the importance of the scores returned by AutoBot. In Fig. 3, we define this strategy as Reverse. Note that this strategy gives a different per-layer pruning ratio than the architecture found by AutoBot. We evaluate the three strategies on VGG-16 with a pruning ratio of 65.4%, and we use the same finetuning conditions for all of them. We select the best accuracy among 3 runs. As shown in Fig. 3, these three different strategies give an initial accuracy of 10%. While the DPDC strategy gives an accuracy of 93.18% after finetuning, the SPDC strategy displays 93.38% accuracy, thus showing that an architecture found by preserving the initial accuracy gives better performance. Meanwhile, the Reverse strategy obtains 93.24%, which is surprisingly better than the hand-made architecture but, as expected, it underperforms the architecture found by AutoBot, even if we apply the SPDC strategy.
4.5.2 Deployment Test
To highlight the improvement in real situations, compressed models need to be tested on edge AI devices. Therefore, we compare the inference speed-up of our compressed networks deployed on GPU-based (NVIDIA Jetson Nano) and CPU-based (Raspberry Pi 4, Raspberry Pi 3, and Raspberry Pi 2) edge devices. Specifications of these devices are available in Tab. 4. The pruned models are converted into ONNX format. Fig. 4 shows the comparison study for inference times between the original pre-trained models and our compressed models. We can show that inference time for our pruned models has been improved in every target edge device (e.g. GoogleNet is 2.85 faster on Jetson-Nano and 2.56 faster on Raspberry Pi 4B while the accuracy improved by 0.22%). Especially, the speed is significantly better on GPU-based devices for single sequence of layers models (e.g. VGG-16 and GoogLeNet) whereas it improved the most on CPU-based devices for models with skip connections. More detailed results are available in Tab. 3.
In this paper, we introduced AutoBot, a novel automatic pruning method focusing on FLOPs reduction. To determine which filters to prune, AutoBot employs trainable bottlenecks designed to preserve the channels that maximize the model accuracy while minimizing the FLOPs in the model. Notably, these bottlenecks only require one epoch on 25.6% (CIFAR-10) or 7.49% (ILSVRC2012) of the dataset to be trained. Extensive experiments on various CNN architectures demonstrate that the proposed method is superior to previous channel pruning methods both before and after finetuning. To the best of our knowledge, our paper is the first to compare accuracy before finetuning. For future work, we plan to design the trainable bottleneck concept on NAS to optimally find out the best candidates.