1 Introduction
In the last decade, Deep Neural Networks (DNNs) popularity has grown exponentially as the method results improved, and they are now used in a variety of applications such as classification, detection, etc. However, these improvements are often faced with increasing model complexity, resulting in a need for more computational resources. Hence, even though hardware performances quickly improve, it is still hard to directly deploy models into targeted edge devices or smartphones. To this end, various attempts to make heavy models more compact have been proposed, based on different compression methods such as knowledge distillation [PolinoPA18, GuoWWYLHL20], pruning [l1norm, hrank, ABCPruner, LRP_pruning], quantization [QuZCT20], neural architecture search (NAS) [hournas], etc. Among these categories, network pruning, which removes redundant and unimportant connections, is one of the most popular and promising compression methods, and recently received great interest from the industry that seeks to compress their AI models and fit them on small target devices with resource constraints. Indeed, being able to run the models ondevice instead of using cloud computing brings numerous advantages such as reduced costs and energy consumption, increased speed, or data privacy.
As manually defining what percentage of each layer should be pruned is a timeconsuming process that requires human expertise, recent works have proposed methods that automatically prune the redundant filters throughout the network to meet a given constraint such as the number of parameters, FLOPs, or hardware platform [liunetworkslimming, you2019gate, Li_2021_CVPR, importance_estimation, ABCPruner, LRP_pruning, nisp, Info_flow18, Info_flow21_PAMI]. To automatically find the bestpruned architectures, these methods rely on various metrics such as the 2nd order Taylor expansions [importance_estimation], the layerwise relevance propagation score [LRP_pruning], etc. For further details, please find Sec. 2. Although these strategies improved over time, they usually do not explicitly aim to preserve the model accuracy, or they do it in a computationally expensive way.
In this paper, we make the hypothesis that the pruned architecture that can lead to the best accuracy after finetuning is the one that most efficiently preserves the accuracy during the pruning process (see Sec. 4.5.1). We therefore introduce an automatic pruning method, called AutoBot, that uses trainable bottlenecks to efficiently preserve the model accuracy while minimizing the FLOPs, as shown in Fig. 1. These bottlenecks only require one single epoch of training with 25.6% (CIFAR10) or 7.49% (ILSVRC2012) of the dataset to efficiently learn which filters to prune. We compare AutoBot with various pruning methods, and show a significant improvement of the pruned models before finetuning, leading to a SOTA accuracy after finetuning. We also perform a practical deployment test on several edge devices to demonstrate the speed improvement of the pruned models.
To summarize, our contributions are as follows:

We introduce AutoBot, a novel automatic pruning method that uses a trainable bottleneck to efficiently learn which filter to prune in order to maximize the accuracy while minimizing the FLOPs of the model. This method can easily and intuitively be implemented regardless of the dataset or model architecture.

We demonstrate that preserving the accuracy during the pruning process has a strong impact on the accuracy of the finetuned model (Sec. 4.5.1).

Extensive experiments show that AutoBot can efficiently preserve the accuracy after pruning (before finetuning), and outperforms all existing pruning methods once finetuned.
2 Related Works
Pruning redundant filters is a common CNN compression solution, as it is an intuitive method that proved its efficiency multiple times in the past [l1norm, hrank, LRP_pruning]. In this section, we summarize some related works compared to our proposed method.
Traditionally, magnitudebased pruning aims to exploit the inherent characteristics of the network to define a pruning criterion, without modifying the network parameters. Popular criteria include norm [l1norm, Minbao2021DCFF, MengCLLGLS20, Li2020CVPR, adaptive1], Taylor expansion [importance_estimation], Gradient [LiuWgradient19]
, Singular Value Decomposition
[hrank], sparsity of output feature maps [apoz], geometric median [GM], etc. Recently, Tang et al. [SCOP] proposed a scientific control pruning method, called SCOP, which introduces knockoff features as the control group. In contrast, adaptive pruning needs to retrain the networks from scratch with a modified training loss or architecture which adds new constraints. Several works [liunetworkslimming, luo2017thinet, YeRethinking18] add trainable parameters to each feature map channel to obtain datadriven channel sparsity, enabling the model to automatically identify redundant filters. Luo et al. [luo2017thinet] introduce Thinet that formally establishes filter pruning as an optimization problem and prunes filters based on statistical information computed from its next layer, not the current layer. Lin et al. [GAL] propose a structured pruning method that jointly prunes filters and other structures by introducing a soft mask with sparsity regularization. However, retraining the model from scratch is a time and resourceconsuming process that does not significantly improve the accuracy compared to magnitudebased pruning. Although these two pruning strategies are intuitive, the pruning ratio must be manually defined layerbylayer, which is a timeconsuming process that requires human expertise. Instead, in this paper, we focus on automatic pruning.As suggested by the name, automatic network pruning is to prune the redundant filters throughout the network automatically under any constraints such as a number of parameters, FLOPs, or hardware platform. In this respect, a large number of automatic pruning methods have been proposed. Liu et al. [liunetworkslimming] optimize the scaling factor in the batchnorm layer as a channel selection indicator to decide which channels are unimportant. You et al. [you2019gate] propose an automatic pruning method, called Gate Decorator, which transforms CNN modules by multiplying their output by channelwise scaling factors and adopt an iterative pruning framework called TickTock to boost pruning accuracy. Li et al. [Li_2021_CVPR]
propose a collaborative compression method that mutually combines channel pruning and tensor decomposition. Molchanov
et al. [importance_estimation]estimates the contribution of a filter to the final loss using 2nd order Taylor expansions and iteratively removes those with smaller scores. Lin et al. [ABCPruner] propose ABCPruner to find the optimal pruned structure automatically by updating the structure set and recalculating the fitness. Backpropagation methods [LRP_pruning, nisp] compute the relevance score of each filter by following the information flow from the model output. Dai et al. [Info_flow18] and Zheng et al. [Info_flow21_PAMI]adopt information theory to preserve the information between the hidden representation and input or output.
However, most existing methods are computationally and time expensive because they either require to retrain the model from scratch [liunetworkslimming], apply iterative pruning [you2019gate, importance_estimation, LRP_pruning, nisp, Li_2021_CVPR] or finetune the model while pruning [ABCPruner, Info_flow18]. When the model isn’t retrained or finetuned during the pruning process, they generally do not preserve the model accuracy after pruning [Info_flow21_PAMI, LRP_pruning, nisp], and thus require to be finetuned for a large number of epochs. In contrast to other automatic pruning methods, AutoBot stands out by its ability to efficiently preserve the accuracy of the model during the pruning process, in a simple and intuitive way.
3 Method
Motivated by several bottleneck approaches [info_bottleneck, deep_info_bottleneck, schulz2020iba], our method can efficiently control the information flow throughout the pretrained network using Trainable Bottlenecks that are injected into the model. The objective function of the trainable bottleneck is to maximize the information flow from input to output while minimizing the loss by adjusting the amount of information in a model under the given constraints. Note that during the training procedure, only the parameters of the trainable bottlenecks are updated while all the pretrained parameters of the model are frozen.
Compared to other pruning methods inspired by the information bottleneck [Info_flow18, Info_flow21_PAMI], we do not consider the compression of mutual information between the input/output and the hidden representations in order to evaluate the information flow. Such methods are orthogonal to AutoBot, which explicitly quantifies how much information is passed to the next layer during the forward pass. Furthermore, we optimize the trainable bottlenecks on a fraction of one single epoch only. Our AutoBot pruning process is summarized in the Alg. 1.
3.1 Trainable Bottleneck
We formally define the concept of trainable bottleneck as a module that can restrict the information flow throughout the network during the forward pass, using trainable parameters. Mathematically, it can be formulated as:
(1) 
where stands for the trainable bottleneck, denotes the bottleneck parameters of the module, and and denote the input and output feature map of the bottleneck at the module, respectively. For instance, Schulz et al. [schulz2020iba] control the amount of information into the model by injecting noise into it. In this case, is expressed as where denotes the noise.
Inspired by the information bottleneck concept [info_bottleneck, deep_info_bottleneck], we formulate a general bottleneck that is not limited to only information theory but can be optimized to satisfy any constraint as follow:
(2) 
where stands for the crossentropy loss, and stand for the model input and output, is the set of the bottleneck parameters () in the model, is a constraint function, and is the desired constraint.
3.2 Pruning Strategy
In this work, we propose trainable bottlenecks for automatic network pruning. To this end, we inject a bottleneck into each convolution block throughout the network such that the information flow of the estimated model to be pruned is quantified by restricting trainable parameters layerwisely.
Compared to previous works, our bottleneck function (Eq. 1) do not use noise to control the information flow:
(3) 
where . Therefore the range of is changing from to . For pruning, this is more relevant since replacing a module input by zeros is equivalent to pruning the module (i.e. pruning the corresponding output of the previous module).
Following the general objective function of the trainable bottleneck (Eq. 2), we introduce two regularizers and to obtain the following function:
(4) 
where is the target FLOPs (manually fixed). As we will explain more in detail in the next paragraphs, the role of is to indicate the constraint for the pruned architecture under while makes the parameters converge toward binary values (0 or 1).
As an evaluation metric, FLOPs is always linked with inference time. Therefore, when it comes to running a neural network on a device with limited computational resources, pruning to efficiently reduce the FLOPs is a common solution. Our work also tightly embraces this rule, as it can make any size of pruned models by constraining the FLOPs according to the targeted devices. Formally, given a neural network consisting of multiple convolutional blocks, we enforce the following condition:
(5) 
where
is the vector of the parameters at the information bottleneck following the
convolution block, is the function that computes the FLOPs of the module of the convolution block weighted by , is the total number of convolution blocks in the model and is the total number of modules in the convolution block. For instance, assume thatis for a convolutional module without bias and padding. Then it can be simply expressed as:
(6) 
where and are the height and width of the output feature map of the convolution, and is its kernel size. Notice that within the convolution block, all modules share . That is, at a block level all the modules belonging to the same convolution block are pruned together.
A key issue for pruning is that finding redundant filters is a discrete problem, i.e. they should either be pruned or not. In our case, this problem is manifested by the fact that cannot be binary as the optimization problem would be nondifferentiable, meaning that backpropagation would not work. In order to tackle this issue, we force the continuous parameters to converge toward a binary solution that indicates the presence (= 1) or absence (= 0) of a filter. This is the role of the constraint :
(7) 
Method  Automatic  Top1acc  Top1acc  FLOPs  Params  
before finetuning  (Pruning Ratio)  (Pruning Ratio)  
VGG16 [vgg]  –  93.96%  0.0%  314.29M (0.0%)  14.99M (0.0%)  
L1 [l1norm]  88.70%^{*}  93.40%  0.56%  206.00M (34.5%)  5.40M (64.0%)  
CC0.4 [Li_2021_CVPR]  ✓  –  94.15%  +0.19%  154.00M (51.0%)  5.02M (66.5%) 
AutoBot (Ours)  ✓  88.29%  94.19%  +0.23%  145.61M (53.7%)  7.53M (49.8%) 
CC0.5 [Li_2021_CVPR]  ✓  –  94.09%  +0.13%  123.00M (60.9%)  5.02M (73.2%) 
HRank65 [hrank]  10.00%^{**}  92.34%  1.62%  108.61M (65.4%)  2.64M (82.4%)  
AutoBot (Ours)  ✓  82.73%  94.01%  +0.05%  108.71M (65.4%)  6.44M (57.0%) 
ITPruner [Info_flow21_PAMI]  ✓  –  94.00%  +0.04%  98.80 (68.6%)  – 
ABCPruner [ABCPruner]  ✓  –  93.08%  0.88%  82.81M (73.7%)  1.67M (88.9%) 
DCFF [Minbao2021DCFF]  –  93.49%  0.47%  72.77M (76.8%)  1.06M (92.9%)  
AutoBot (Ours)  ✓  71.24%  93.62%  0.34%  72.60M (76.9%)  5.51M (63.24%) 
VIBNet [Info_flow18]  ✓  –  91.50%  2.46%  70.63M (77.5%)  – (94.7%) 
ResNet56 [resNet]  –  93.27%  0.0%  126.55M (0.0%)  0.85M (0.0%)  
L1 [l1norm]  –  93.06%  0.21%  90.90M (28.2%)  0.73M (14.1%)  
HRank50 [hrank]  10.78%^{**}  93.17%  0.10%  62.72M (50.4%)  0.49M (42.4%)  
SCP [KangH20]  –  93.23%  0.04%  61.89M (51.1%)  0.44M (48.2%)  
CC [Li_2021_CVPR]  ✓  –  93.64%  +0.37%  60.00M (52.6%)  0.44M (48.2%) 
ITPruner [Info_flow21_PAMI]  ✓  –  93.43%  +0.16%  59.50 (53.0%)  – 
FPGM [GM]  –  93.26%  0.01%  59.40M (53.0%)  –  
LFPC [Cpruning_variousCriteria]  –  93.24%  0.03%  59.10M (53.3%)  –  
ABCPruner [ABCPruner]  ✓  –  93.23%  0.04%  58.54M (53.7%)  0.39M (54.1%) 
DCFF [Minbao2021DCFF]  –  93.26%  0.01%  55.84M (55.9%)  0.38M (55.3%)  
AutoBot (Ours)  ✓  85.58%  93.76%  +0.49%  55.82M (55.9%)  0.46M (45.9%) 
SCOP [SCOP]  –  93.64%  +0.37%  – (56.0%)  – (56.3%)  
ResNet110 [resNet]  –  93.5%  0.0%  254.98M (0.0%)  1.73M (0.0%)  
L1 [l1norm]  –  93.30%  0.20%  155.00M (39.2%)  1.16M (32.9%)  
HRank58 [hrank]  –  93.36%  0.14%  105.70M (58.5%)  0.70M (59.5%)  
LFPC [Cpruning_variousCriteria]  –  93.07%  0.43%  101.00M (60.3%)  –  
ABCPruner [ABCPruner]  ✓  –  93.58%  +0.08%  89.87M (64.8%)  0.56M (67.6%) 
DCFF [Minbao2021DCFF]  –  93.80%  +0.30%  85.30M (66.5%)  0.56M (67.6%)  
AutoBot (Ours)  ✓  84.37%  94.15%  +0.65%  85.28M (66.6%)  0.70M (59.5%) 
GoogLeNet [googlenet]  –  95.05%  0.0%  1.53B (0.0%)  6.17M (0.0%)  
L1 [l1norm]  –  94.54%  0.51%  1.02B (33.3%)  3.51M (43.1%)  
Random  –  94.54%  0.51%  0.96B (37.3%)  3.58M (42.0%)  
HRank54 [hrank]  –  94.53%  0.52%  0.69B (54.9%)  2.74M (55.6%)  
CC [Li_2021_CVPR]  ✓  –  94.88%  0.17%  0.61M (60.1%)  2.26M (63.4%) 
ABCPruner [ABCPruner]  ✓  –  94.84%  0.21%  0.51B (66.7%)  2.46M (60.1%) 
DCFF [Minbao2021DCFF]  –  94.92%  0.13%  0.46B (69.9%)  2.08M (66.3%)  
HRank70 [hrank]  10.00%^{**}  94.07%  0.98%  0.45B (70.6%)  1.86M (69.9%)  
AutoBot (Ours)  ✓  90.18%  95.23%  +0.16%  0.45B (70.6%)  1.66M (73.1%) 
DenseNet40 [densenet]  –  94.81%  0.0%  287.71M (0.0%)  1.06M (0.0%)  
Network Slimming [liunetworkslimming]  ✓  –  94.81%  0.00%  190.00M (34.0%)  0.66M (37.7%) 
GAL0.01 [GAL]  –  94.29%  0.52%  182.92M (36.4%)  0.67M (36.8%)  
AutoBot (Ours)  ✓  87.85%  94.67%  0.14%  167.64M (41.7%)  0.76M (28.3%) 
HRank40 [hrank]  25.58%^{**}  94.24%  0.57%  167.41M (41.8%)  0.66M (37.7%)  
Variational CNN [zhao2019variational]  –  93.16%  1.65%  156.00M (45.8%)  0.42M (60.4%)  
AutoBot (Ours)  ✓  83.20%  94.41%  0.4%  128.25M (55.4%)  0.62M (41.5%) 
GAL0.05 [GAL]  –  93.53%  1.28%  128.11M (55.5%)  0.45M (57.5%)  
^{*}according to [neuron_merging]  
^{**}based on the code used in the corresponding paper 
To solve our optimization problem defined in Eq. 4, we introduce two loss terms, and , designed to satisfy the constraint and from Eq. 5 and Eq. 7, respectively. We formulate and as follow:
(8) 
where is the FLOPs of the original model, and is the predefined target FLOPs. And
(9) 
where is the total number of parameters. In contrast to and , these loss terms are normalized such that the scale of the loss is always the same. As a result, for a given dataset, the training parameters are stable across different architectures. The optimization problem to update the proposed information bottlenecks for automatic pruning can be summarized as follows:
(10) 
where and
are hyperparameters that indicate the relative importance of each objective.
Optimal threshold Once the bottlenecks are trained, can be directly used as a pruning criterion. Therefore, we propose a way to quickly find the threshold under which neurons should be pruned. Since our bottleneck allows us to quickly and accurately compute the weighted FLOPs (Eq. 5), we can estimate the FLOPs of the model to be pruned without actual pruning. This is done by setting to zero for the filters to be pruned, or one otherwise. We call this process pseudopruning. In order to find the optimal threshold, we initialize a threshold to 0.5 and pseudoprune all filters with lower than this threshold. We then compute the weighted FLOPs, and adopt the dichotomy algorithm to efficiently minimize the distance between the current and targeted FLOPs. This process is repeated until the gap is small enough. Once we have found the optimal threshold, we cut out all bottlenecks from the model and finally prune all the filters with lower than the optimal threshold to get the compressed model with a targeted FLOPs.
Parametrization Following Schulz et al. [schulz2020iba], we do not directly optimize as this would require to use clipping to stay in the interval. Instead, we parametrize , where the elements of are in . Therefore, we can optimize without constraints.
Reduced training data We empirically observed that the training loss for the bottlenecks can converge quickly before the end of the first epoch. Therefore, it suggests that regardless of model size (i.e. FLOPs), the optimally pruned architecture can be efficiently estimated using only a small portion of the dataset.
4 Experiments
Method  Automatic  Top1acc  Top1acc  Top5acc  FLOPs  Params  
before finetuning  (Pruning Ratio)  (Pruning Ratio)  
ResNet50 [resNet]  –  76.13%  0.0%  92.87%  0.0%  4.11B (0.0%)  25.56M (0.0%)  
ThiNet50 [luo2017thinet]  –  72.04%  4.09%  90.67%  2.20%  – (36.8%)  – (33.72%)  
FPGM [GM]  –  75.59%  0.59%  92.27%  0.60%  2.55B (37.5%)  14.74M (42.3%)  
ABCPruner [ABCPruner]  ✓  –  74.84%  1.29%  92.31%  0.56%  2.45B (40.8%)  16.92M (33.8%) 
SFP [l2norm]  –  74.61%  1.52%  92.06%  0.81%  2.38B (41.8%)  –  
HRank74 [hrank]  –  74.98%  1.15%  92.33%  0.54%  2.30B (43.7%)  16.15M (36.8%)  
Taylor [importance_estimation]  –  74.50%  1.63%  –  –  – (44.5%)  – (44.9%)  
DCFF [Minbao2021DCFF]  –  75.18%  0.95%  92.56%  0.31%  2.25B (45.3%)  15.16M (40.7%)  
ITPruner [Info_flow21_PAMI]  ✓  –  75.75%  0.38%  –  –  2.23B (45.7%)  – 
AutoPruner [luo2020autopruner]  ✓  –  74.76%  1.37%  92.15%  0.72%  2.09B (48.7%)  – 
RRBP [zhou2019accelerate]  –  73.00%  3.13%  91.00%  1.87%  –  – (54.5%)  
AutoBot (Ours)  ✓  47.51%  76.63%  +0.50%  92.95%  +0.08%  1.97B (52.0%)  16.73M (34.5%) 
ITPruner [Info_flow21_PAMI]  ✓  –  75.28%  0.85%  –  –  1.94B (52.8%)  – 
GDP0.6 [globalPruningIJCAI]  ✓  –  71.19%  4.94%  90.71%  2.16%  1.88B (54.0%)  – 
SCOP [SCOP]  –  75.26%  0.87%  92.53%  0.33%  1.85B (54.6%)  12.29M (51.9%)  
GAL0.5joint [GAL]  –  71.80%  4.33%  90.82%  2.05%  1.84B (55.0%)  19.31M (24.5%)  
ABCPruner [ABCPruner]  ✓  –  73.52%  2.61%  91.51%  1.36%  1.79B (56.6%)  11.24M (56.0%) 
GAL1 [GAL]  –  69.88%  6.25%  89.75%  3.12%  1.58B (61.3%)  14.67M (42.6%)  
LFPC [Cpruning_variousCriteria]  –  74.18%  1.95%  91.92%  0.95%  1.60B (61.4%)  –  
GDP0.5 [globalPruningIJCAI]  ✓  –  69.58%  6.55%  90.14%  2.73%  1.57B (61.6%)  – 
DCFF [Minbao2021DCFF]  –  75.60%  0.53%  92.55%  0.32%  1.52B (63.0%)  11.05M (56.8%)  
DCFF [Minbao2021DCFF]  –  74.85%  1.28%  92.41%  0.46%  1.38B (66.7%)  11.81M (53.8%)  
AutoBot (Ours)  ✓  14.71%  74.68%  1.45%  92.20%  0.66%  1.14B (72.3%)  9.93M (61.2%) 
GAL1joint [GAL]  –  69.31%  6.82%  89.12%  3.75%  1.11B (72.8%)  10.21M (60.1%)  
CURL [LuoW20]  ✓  –  73.39%  2.74%  91.46%  1.41%  1.11B (73.2%)  6.67M (73.9%) 
DCFF [Minbao2021DCFF]  –  73.81%  2.32%  91.59%  1.28%  1.02B (75.1%)  6.56M (74.3%) 
Pruning results on ResNet50 with ImageNet, sorted by FLOPs. Scores in brackets of “FLOPs” and “Params” denote the pruning ratio of FLOPs and number of parameters in the compressed models.
4.1 Experimental Settings
To demonstrate the efficiency of AutoBot on a variety of experimental setups, experiments are conducted on two popular benchmark datasets and five common CNN architectures, 1) CIFAR10 [cifar10] with VGG16 [vgg], ResNet56/110 [resNet], DenseNet [densenet], and GoogLeNet [googlenet]
, and 2) ILSVRC2012 (ImageNet)
[deng2009imagenet] with ResNet50.Experiments are performed within the PyTorch and torchvision frameworks [paszke2017automatic] under Intel(R) Xeon(R) Silver 4210R CPU 2.40GHz and NVIDIA RTX 2080 Ti with 11GB for GPU processing.
For CIFAR10, we trained the bottlenecks for 200 iterations with a batch size of 64, a learning rate of 0.6 and and equal to 6 and 0.4 respectively, and we finetuned the model for 200 epochs with the initial learning rate of 0.02 scheduled by cosine annealing scheduler and with a batch size of 256. For ImageNet, we trained the bottlenecks for 3000 iterations with a batch size of 32, a learning rate of 0.3 and and
equal to 10 and 0.6 respectively, and we finetuned the model for 200 epochs with a batch size of 512 and with the initial learning rate of 0.006 scheduled by cosine annealing scheduler. Bottlenecks are optimized via Adam optimizer. All networks are retrained via the Stochastic Gradient Descent (SGD) optimizer, with momentum of 0.9 and decay factor of
for CIFAR10 and with momentum of 0.99 and decay factor of for ImageNet.4.2 Evaluation Metrics
In order to make a quantitative comparison, we first evaluate the Top1 (and Top5 for ImageNet) accuracy of the models. This comparison is done after finetuning, as is common in DNN pruning literature. Furthermore, we measure the Top1 accuracy right after the pruning step (before finetuning) to prove that our method can effectively preserve the important filters which make a big impact on the model decision. Indeed, accuracy after finetuning depends on many parameters independent from the pruning method, such as data augmentation, learning rate, scheduler, etc. Therefore, we believe that it is not the most accurate way to compare performance across pruning methods.
We adopt commonly used metrics i.e. FLOPs as well as the number of parameters in order to measure the quality of the pruned models in terms of computational efficiency as well as model size. Note that our proposed method can freely compress the pretrained model in any size with the given target FLOPs.
4.3 Automatic Pruning on CIFAR10
To demonstrate the improvement of our method, we firstly conduct automatic pruning with some of the most popular convolutional neural networks, namely VGG16, ResNet56/110, GoogLeNet, and DenseNet40. Tab.
1 indicates experimental results with these architectures on CIFAR10 for various number of FLOPs.VGG16 We performed on VGG16 architecture with three different pruning ratios. VGG16 is a very common convolutional neural network architecture that contains thirteen convolution layers and two fullyconnected layers. Tab. 1 shows that our method can maintain a relatively higher accuracy before finetuning, even under the same FLOPs reduction (e.g. 82.73% (proposed method) vs. 10.00% (HRank) for 65.4% of FLOPs reduction), thus leading to a SOTA accuracy after finetuning. For instance, we get 71.24% and 93.62% accuracy before and after finetuning respectively when reducing the FLOPs by 76.9%. Our method even outperforms the baseline by 0.05% and 0.23% when reducing the FLOPs by 65.4% and 53.7%, respectively.
As emphasized in Fig. 2, the perlayer filter pruning ratio is automatically determined by our method, according to the target FLOPs.
GoogLeNet GoogLeNet is a large architecture (1.53 billion parameters) characterized by its parallel branches named inception blocks. In total, it contains 64 convolutions and one fullyconnected layer. Our accuracy of 90.18% after pruning under a FLOPs reduction of 70.6% (against 10% for HRank for the same compression) leads to the SOTA accuracy of 95.23% after finetuning, outperforming recent papers such as DCFF and CC. Moreover, we also achieve a significant improvement in term of parameters reduction (73.1%), although it is not the primary focus of our method.
ResNet
ResNet is an architecture characterized by its residual connections. We adopted ResNet56 and ResNet110 which consist of 55 and 109 convolution layers, respectively. Pruned model with our method can improve accuracy from 85.58% before finetuning to 93.76% after finetuning under a FLOPs reduction of 55.9% for ResNet56, and from 84.37% before finetuning to 94.15% after finetuning under a FLOPs reduction of 66.6% for ResNet110. Under similar or even smaller FLOPs, our approach accomplishes an excellent Top1 accuracy compared to other existing magnitudebased or adaptivebased pruning methods and is beyond the baseline model’s performance (93.27% for ResNet56 and 93.50% for ResNet110).
DenseNet40 As ResNet, DenseNet40 is an architecture based on residual connections. It is made of 39 convolutions and one fullyconnected layer. We experimented with two different target FLOPs, as shown in Tab. 1. Notably, we got an accuracy of 83.2% before finetuning and 94.41% after finetuning under a FLOPs reduction of 55.4%.
4.4 Automatic Pruning on ImageNet
To show the performance of our method on ILSVRC2012, we chose the ResNet50 architecture, which is made of 53 convolution layers followed by a fullyconnected layer. Due to the complexity of this dataset (1,000 classes and millions of images) and the compact design of ResNet itself, this task is more challenging than the compression of models on CIFAR10. While existing pruning methods requiring to manually define the pruning ratio for each layer achieve reasonable performance, our global pruning method allows competitive results in all evaluation metrics including Top1 and Top5 accuracy, FLOPs reduction as well as number of parameters reduction, as reported in Tab. 2. Under the high FLOPs compression of 72.3%, we obtain an accuracy of 74.68%, outperforming recent works including GAL (69.31%) and CURL (73.39%) with a similar compression. And under the reasonable compression of 52%, our method even outperforms the baseline by 0.5% and leaves all the previous methods behind by at least 1% by doing so. Therefore, the proposed method also works well on a complex dataset.
Hardware (Processor)  
Model  FLOPs  JetsonNano (GPU)  Raspberry Pi 4B (CPU)  Raspberry Pi 3B+ (CPU)  Raspberry Pi 2B (CPU) 
VGG16  73.71M  61.63 13.33 ( 4.62)  45.73 17.16 ( 2.66)  79.98 35.17 ( 2.27)  351.77 118.36 ( 2.97) 
VGG16  108.61M  61.63 13.77 ( 4.48)  45.73 19.95 ( 2.29)  79.98 39.99 ( 2.00)  351.77 143.95 ( 2.44) 
VGG16  145.55M  61.63 19.24 ( 3.20)  45.73 24.33 ( 1.88)  79.98 50.27 ( 1.59)  351.77 184.47 ( 1.91) 
ResNet56  55.94M  16.47 13.71 ( 1.20)  21.95 15.88 ( 1.38)  60.42 39.78 ( 1.52)  170.46 101.70 ( 1.68) 
ResNet110  85.30M  28.10 26.36 ( 1.07)  41.35 27.90 ( 1.48)  112.57 72.71 ( 1.55)  331.60 179.91 ( 1.84) 
GoogLeNet  0.45B  80.84 28.37 ( 2.85)  146.68 57.25 ( 2.56)  342.23 170.17 ( 2.01)  1,197.65 400.89 ( 2.99) 
DenseNet40  129.13M  35.25 33.46 ( 1.05)  71.87 44.73 ( 1.61)  171.86 102.75 ( 1.67)  432.03 252.63 ( 1.71) 
DenseNet40  168.26M  35.25 35.11 ( 1.00)  71.87 53.08 ( 1.35)  171.86 114.37 ( 1.50)  432.03 302.49 ( 1.43) 
Platform  CPU  GPU  Memory  
JetsonNano 


4GB LPDDR4  
Raspberry Pi 4B 

No GPGPU  4GB LPDDR4  
Raspberry Pi 3B+ 

No GPGPU  1GB LPDDR2  
Raspberry Pi 2B 

No GPGPU  1GB SDRAM 
4.5 Ablation Study
4.5.1 Impact of Preserving the Accuracy
To highlight the impact of preserving the accuracy during the pruning process, we compare the accuracy before and after finetuning of AutoBot with different pruning strategies in Fig. 3. To show the superiority of an architecture found by preserving the accuracy compared to a manually designed architecture, a comparison study is conducted by manually designing three different strategies: 1) Same Pruning, Different Channels (SPDC), 2) Different Pruning, Different Channels (DPDC), and 3) Reverse.
DPDC has the same FLOPs as the architecture found by AutoBot but uses a different perlayer pruning ratio proposed by Lin et al. [hrank]. To show the impact of a bad initial accuracy for finetuning, we propose the SPDC strategy that has the same perlayer pruning ratio as the architecture found by AutoBot but with randomly selected filters. We also propose to reverse the order of importance of the filters selected by AutoBot such that only the less important filters are pruned. By doing so, we can better appreciate the importance of the scores returned by AutoBot. In Fig. 3, we define this strategy as Reverse. Note that this strategy gives a different perlayer pruning ratio than the architecture found by AutoBot. We evaluate the three strategies on VGG16 with a pruning ratio of 65.4%, and we use the same finetuning conditions for all of them. We select the best accuracy among 3 runs. As shown in Fig. 3, these three different strategies give an initial accuracy of 10%. While the DPDC strategy gives an accuracy of 93.18% after finetuning, the SPDC strategy displays 93.38% accuracy, thus showing that an architecture found by preserving the initial accuracy gives better performance. Meanwhile, the Reverse strategy obtains 93.24%, which is surprisingly better than the handmade architecture but, as expected, it underperforms the architecture found by AutoBot, even if we apply the SPDC strategy.
4.5.2 Deployment Test
To highlight the improvement in real situations, compressed models need to be tested on edge AI devices. Therefore, we compare the inference speedup of our compressed networks deployed on GPUbased (NVIDIA Jetson Nano) and CPUbased (Raspberry Pi 4, Raspberry Pi 3, and Raspberry Pi 2) edge devices. Specifications of these devices are available in Tab. 4. The pruned models are converted into ONNX format. Fig. 4 shows the comparison study for inference times between the original pretrained models and our compressed models. We can show that inference time for our pruned models has been improved in every target edge device (e.g. GoogleNet is 2.85 faster on JetsonNano and 2.56 faster on Raspberry Pi 4B while the accuracy improved by 0.22%). Especially, the speed is significantly better on GPUbased devices for single sequence of layers models (e.g. VGG16 and GoogLeNet) whereas it improved the most on CPUbased devices for models with skip connections. More detailed results are available in Tab. 3.
5 Conclusion
In this paper, we introduced AutoBot, a novel automatic pruning method focusing on FLOPs reduction. To determine which filters to prune, AutoBot employs trainable bottlenecks designed to preserve the channels that maximize the model accuracy while minimizing the FLOPs in the model. Notably, these bottlenecks only require one epoch on 25.6% (CIFAR10) or 7.49% (ILSVRC2012) of the dataset to be trained. Extensive experiments on various CNN architectures demonstrate that the proposed method is superior to previous channel pruning methods both before and after finetuning. To the best of our knowledge, our paper is the first to compare accuracy before finetuning. For future work, we plan to design the trainable bottleneck concept on NAS to optimally find out the best candidates.