An Improved Trade-off Between Accuracy and Complexity with Progressive Gradient Pruning

06/20/2019 ∙ by Le Thanh Nguyen-Meidine, et al. ∙ Genetec Inc. Ecole De Technologie Superieure (Ets) 0

Although deep neural networks (NNs) have achieved state-of-the-art accuracy in many visual recognition tasks ,the growing computational complexity and energy consumption of networks remains an issue, especially for applications on platforms with limited resources and requiring real-time processing. Channel pruning techniques have recently shown promising results for the compression of convolutional NNs (CNNs). However, these techniques can result in low accuracy and complex optimisations because some only prune after training CNNs, while others prune from scratch during training by integrating sparsity constraints or modifying the loss function. The progressive soft filter pruning technique provides greater training efficiency, but its soft pruning strategy does no thandle the backward pass which is needed for better optimization. In this paper, a new Progressive Gradient Pruning (PGP) technique is proposed for iterative channel pruning during training. It relies on a criterion that measures the change in channel weights that improves existing progressive pruning, and on an effective hard and soft pruning strategies to adapt momentum tensors during the backward propagation pass. Experimental results obtained after training various CNNs on the MNIST and CIFAR10 datasets indicate that the PGP technique canachieve a better tradeoff between classification accuracy and network (time and memory) complexity than state-of-the-art channel pruning techniques

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) learn discriminant feature representations from labeled training data, and have achieved state-of-the-art accuracy across a wide range of visual recognition tasks, e.g., image classification, object detection, and assisted medical diagnosis. Since the breakthrough results achieved with AlexNet for the 2012 ImageNet Challenge [24], CNN’s accuracy has been continually improved with architectures like VGG [43], ResNet [12] and DenseNet [20], at the expense of growing complexity (deeper and wider networks) that require more training samples and computational resources [21]. In particular, the speed of the CNNs can significantly degrades with such increased complexity.

In order to deploy these powerful CNN architectures on compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices) and for real-time processing (e.g., video surveillance and monitoring, virtual reality), the time and memory complexity and energy consumption of CNNs should be reduced. For instance, the application of CNN-based architectures to real-time face detection in video surveillance remains a challenging task

[35]– while the more accurate detectors such as region proposal networks are too slow for real-time applications [40, 8], faster detectors such as single-shot detectors are less accurate [38, 29]. Consequently, effective methods to accelerate and compress deep networks, in general, and CNNs in particular, are required to provide a reasonable trade-off between accuracy and efficiency.

Several techniques have recently been proposed to reduce the complexity of CNNs, ranging from the design of specialized compact architectures like MobileNet [17], to the distillation of knowledge from larger architectures to smaller ones [15]. Among these, pruning techniques provide an automated approach to remove insignificant network elements, e.g., filters, channels, etc. This paper focuses on channel-level pruning techniques which can significantly reduce the number of CNN parameters while preserving network accuracy [27, 34]. These techniques attempt to remove the output and input channels at each convolution layer using various criteria based on, e.g., L1 norm [27], or the product of feature maps and gradients computed from a validation dataset [34].

Pruning techniques can be applied under two different scenarios: either (1) from a pre-trained network, or (2) from scratch. In the first scenario, pruning is applied as a post-processing procedure, once the network has already been trained, through an one-time pruning (followed by fine-tuning) [27] or complex iterative [34] process using a validation dataset [27, 31], or by minimizing the reconstruction error [32]. In the second scenario, pruning is applied from scratch by introducing sparsity constraints and/or modifying the loss function to train the network [53, 30, 47]. The later scenario can have more difficulty converging to accurate network solutions (due to the modified loss function), and thereby increase the computational complexity required for the optimisation process. For greater training efficiency, the progressive soft filter pruning (PSFP) method was introduced [13]

, allowing for iterative pruning from scratch, where channels are set to zero (instead of removing them) so that the network can preserve a greater learning capacity. This method, however, does not account for the optimization of soft pruned weights. Not handling the momentum tensor can have an impact a negative impact on accuracy, because pruned weights are still being optimized with old momentum values accumulated from previous epochs.

In this paper, a new Progressive Gradient-based Pruning (PGP) technique is proposed for iterative channel pruning to provide a better accuracy-complexity trade-off. To this end, the channels are efficiently pruned in a progressive fashion while training a network from scratch, and accuracy is maintained without requiring validation data and additional optimisation constraints. In particular, PGP integrates effective hard and soft pruning strategies to adapt the momentum tensor during the backward propagation pass. It also integrates an improved version of the Taylor criterion [34] that relies on the gradient with respect to weights (instead of to output feature maps), which translates to a more suitable for a progressive weight-based pruning strategy. For performance evaluation, the accuracy and complexity of proposed and state-of-the-art techniques are compared using Resnet, LeNet and VGG networks trained on MNIST and CIFAR10 image classification datasets.

2 Compression and Acceleration of CNNs

In general, time complexity of a CNN depends more on the convolutional layers, while the fully connected layers contain the most of the number of parameters. Therefore, the CNN acceleration methods typically target lowering the complexity of the convolutional layers, while the compression methods usually target reduced complexity of the fully connected layers[10, 11]. This section provides an overview of the recent acceleration and compression approaches for CNNs, namely, quantization, low-rank approximation, knowledge distillation, compact network design and pruning. Finally, a brief survey on the channel pruning methods and challenges is presented.

2.1 Overview of methods:

Quantization:

A deep neural network can be accelerated by reducing the precision of its parameters. Such techniques are often used on general embedded systems, where low-precision, e.g., 8-bit integer, provides faster processing than the higher-precision representation, e.g., 32-bit floating point. There are two main approaches to quantizing a neural network – the first focuses on quantizing using weights[10, 52], and the second uses both weights and activations for quantization [9, 6]. These techniques can be either scalable [10, 52] or non-scalable [3, 9, 6, 37], where scalable techniques means that an already quantized network can be further compressed.

Low-rank decomposition:

Low-rank approximation (LRA) can accelerate CNNs by approximating the result of one tensor using lower rank tensor [22, 44, 26].There are different ways of decomposing convolution filters. Techniques like [22, 44] focus on approximating filters/output channels by low rank filters that can be obtained either in a layer by layer fashion [22] or by scanning the whole network [44]. [48] proposes to force filers to coordinate more information into a lower rank space during training and then decompose it once the model is trained. Another technique employed the CP-Decomposition (Canonical Polyadic Decomposition), where a good trade-off between accuracy and efficiency is achieved [26].

Knowledge distillation:

This family of techniques focuses on training a small network, student, using a larger model, called teacher [16]

. Unlike, traditional supervised learning method, the student is trained by the teacher. These methods could obtain considerable improvements in term of sparsity and generalization of the produced networks. Most of distillation techniques use large pretrained models as teachers

[16, 41]. More recently, there has been interest in developing online student-teacher models on the fly [25, 51] or using GANs in order to increase the training speed and accuracy [46]. Knowledge distillation has been applied to multiple problems including object detection [4], NLP [23] and differential privacy [45].

Compact network design:

Compact model design is an alternative way to produce fast deep neural networks. The aim of these techniques is to produce light models for high-speed processing. Different methods were applied to produce compact models, for instance, MobileNet [17], MobileNetV2 [42] and Xception [5] can achieve real-time speed using depth-wise convolution in order to reduce computation. Other architectures like ShuffleNet [50, 33] and CondenseNet [19] use another convolution locally connected in groups for reducing computation.

Pruning:

Pruning is a family of techniques that removes non-useful parameters from a neural network. There are several approaches of pruning for deep neural networks. The first is weight pruning, where individual weights are pruned. This approach has proven to significantly compress and accelerates deep neural networks [10, 49, 11]. Weight pruning techniques usually employ sparse convolution algorithms [28, 39].The other approach is filter or channel pruning, where complete filters or channels are pruned [27, 32, 13, 53]. Since this paper proposes a method for channel-pruning, we provide more details on this approach in the next section.

2.2 Channel pruning

Channel-level pruning techniques attempt to remove the output and input channels at each convolution layer using various criteria, such as L1-norm [27], Entropy[31], L2, APoZ[18] or using a combination of feature maps and gradients [34]. The channel-based pruning methods has the advantage of being independent of a sparse convolution algorithm by keeping the convolution dense, which provides a platform-independent speed-up (since a sparse algorithm may not be implemented on all platforms).

Following the work of Optical Brain Damage [7], one of the first papers that showed the efficiency of channel-level pruning was [27], where the weight norm is used to identify and to prune weak channels, channels that do not contribute much to network. Afterwards, several works proposed pruning procedures and channel importance metrics. These methods can be organized in five pruning approaches: 1) Pruning as one time post processing and then fine tune– this approach is simple and easy to apply [27], 2) Pruning in an iterative way once the model was trained– the iterative pruning and fine-tune increase the chance of recovering directly after a channel is pruned [34], 3) Pruning by minimizing the reconstruction error– minimizing the reconstruction error at each layer allows the model to approximate the original performance [32, 14], 4) Pruning by using sparse constraints with a modified objective function– to let the network consider pruning during optimization [53, 30, 2, 1], 5) Pruning progressively while training from scratch or pre-trained model – soft pruning was applied where channels are set to zero instead of actually removing them (hard manner), which leaves the network with more capacity to learn [13].

While first three approaches are capable of reducing the complexity of a model, they are only applied when the model is already trained, it would certainly be more beneficial to be able to start pruning from scratch during training. While, the fourth approach can start the pruning from scratch by adding sparse constraints and by modifying the optimization objective, this can add more complexity and computation for the training process. This can be potentially complicated when the original loss function is hard to optimize since this type of approach modifies the original loss function therefore making it potentially harder for the model to converge to a good solution. The fifth approach eases this process by not removing channels and uses the original loss function. However, we think that this approach can be improved since, currently, this approach does not handle pruning in the backward pass and only set the weak channels to zero. Also, the current approach calculates the criterion separately from when the parameters are updated, i.e. not when we are iterating inside an epoch. For our approach we want to directly compute the criterion during update, i.e. when we are iterating in an epoch and updating parameters.

Another important part of pruning channels is the capacity to evaluate the importance of a channel. Currently, in literature, there has been a lot of criteria that has been used to evaluate the importance of channels, e.g. L1[27], APoz[18], Entropy[31], L2[13] and Taylor[34]. Among these, we think that the Taylor criterion[34], has the most potential for pruning during training since the criterion is the result of trying to minimize the impact of having a channel pruned, although we can argue that it can be improved for progressive pruning.

3 Progressive Gradient Pruning

3.1 Pruning procedure

In a regular CNN, the weight tensor of a convolutional layer can be defined as , where and are the number of input and output channels, respectively. A weight tensor of output channel can be then defined as . In order to select the weak channels of a layer, we evaluate the importance of an output channel using a criterion function , is usually defined as . Given an output channel, it yields a scalar that represents the rank, e.g. L1 [27] or gradient norm in our case.

In order to prune convolution layer progressively, an exponential decay function is defined such that there is always a solution in . (It is slightly different than in [13], where the decay function can have solutions in .) This decay function allows to select the number of weak channels at each epoch. The decay function is defined as the ratio of output channels remaining after the training on epoch :

(1)

where is a hyper-parameter that defines the ratio of output channels to be pruned, and is the epoch. Since we progressively prune layer by layer and epoch by epoch, we calculate the the number of weak channels or the number of remaining channels at each layer, . Given ratio at epoch , the number of weak output channels for any layer is defined as:

(2)

where can be the original number of output channels of any layers. Using the the number of weak channels and a pruing criterion function , we end up having a subset of output channels with the smallest value. This subset is further divided into two subsets, using a hyper-parameter that decides the ratio of hard-removed output channels. The subset is removed completely, while the subset will be reset to zero while keeping and as indexes for the backward pass. Additionally, hard pruning is performed on the input channels of the next layer using .

Since we are setting some weights back to zero and we still apply back-propagation on those weights, we would need to handle the pruning of the Momentum tensor. The momentum tensor is defined as , same dimension as a weight tensor, currently in existing work of progressive pruning during training [13], the authors only set the weights to zero without handling the momentum accumulated from before which is critical for the optimization. Using the indexes of , we set to zero the subset and hard prune the subset using indexes . Figure 1 illustrates how the locally procedure of PGP for hard and soft pruning between two successive convolutional layers.

Figure 1: Illustration of the pruning procedure of the proposed PGP approach between two convolutional layers.

3.2 Criteria for progressive pruning

Molchanov at al. [34] proposed the following criterion to measure the importance of a feature map from an output channel , computed at each layer, and for each output channel:

(3)

The term mrefers to the loss of a model when a labeled dataset is given with a pruned feature map . is the original loss before the model has been pruned. In summary, the criterion of Equation 3 is the difference between the loss of a pruned model and the original model. The criterion grows with the impact the feature map. This criterion has been shown to works well on some datasets and networks. However, in the scenario where the network is pruned from scratch, we argue that the information measured from feature map is not informative since the model is not trained. Empirical results in Section 4 also support that the criterion of Equation 3 is not effective at other criteria for progressive pruning.

Instead of using to prune a feature map or filter, we can replace with since setting an output channel to zero is the same as pruning it [13]. The same Taylor expansion from [34] then can applied with , resulting in:

(4)

Equation 4 can be further simplified when taking in account the soft pruning nature. We can decomposed this equation because is an element-wise multiplication:

(5)

where is the absolute value of the weight of channel . This meant that can be or very close to zero if was one of the channel that was soft-pruned. In this case, has little chance to recover, since it will likely be pruned. In order to have more recovery on soft prune channels, we propose to remove the term:

(6)

where is the criterion for our approach for channel. There are two ways of calculating our criterion:

  • PGP: perform a training epoch without updating the model, and compute the pruning criterion.

  • RPGP: compute the pruning criterion directly during a forward/backward pass of the training loop while updating.

In the first case, it amounts to a batch gradient descent without updating the parameters at then end. This approach can lead to better performance since the optimization is less noisy than SGD. The second approach use a SGD optimizer and calculate the criterion directly during the optimization and update of the model.

In either case, the criterion is carried over several iterations, that means there’s two way of interpreting Equation 6. One way of interpreting is the natural way of accumulating gradient, where the gradients are summed up and since we want the total gradient of an output channel, We can use an L1 norm in order to sum up how much variation inside an output channel, this translates to follow Equation:

(7)

where is the gradient tensor of an output channel at iteration inside an epoch. This Equation 7 measures how much changes an output channel is getting globally at the end of an epoch, this makes it very suitable for PGP since we go thought the whole epoch without updates, we will refer to this criterion as GN_G. The second way of interpreting is to accumulate the actual changes of an output channel at each updates, this meant the Equation will change to:

(8)

Equation 8 calculates the L1 norm of a gradient tensor of an output channel at each iteration during an epoch, this would meant that, instead of measuring the global change only at the end like Equation 7, this measure the gradual changes during an epoch, we will refer this as GN_S. This Equation would works better for RPGP since we are updating the weight at the same time as we accumulate our gradient. In summary, the PGP can be summarized in algorithm 1 and the algorithm for RPGP is similar but the criterion is calculated directly at the Train step.

input : A non-trained model , a target percent of pruned away , remove ratio , number of epochs
output : Pruned trained model
1 for  to T do
2       Train the model for one epoch
3       foreach convolution layer  do
4             Calculate the number of weak channels (2)
5             Calculate the pruning criterion using GN_G (7) or GN_S (8)
6             Partition into indexes (hard remove channels) and (soft remove channels) using
7             Remove subset and set to zero
8             Remove the channels of momentum tensor M using the same index as
9             Set the channels of momentum tensor to zero using the same index as
10            
11       end foreach
12      Evaluate the model
13      
14 end for
Algorithm 1 Progressive Gradient Pruning method.

4 Experimental Results

In this section, we compare the proposed technique with some baseline techniques like L1-norm Pruning[27], Taylor Pruning [34] which represent an prune once approach and iterative approach. We also compare with state of the arts techniques like DCP[53] and PSFP[13] that prune while training. For techniques like ours, PSFP[13], DCP[53] and L1[27] , it is possible to set a target pruning rate/percent hyper parameter, and we compare them at each level of target pruning using 0.3, 0.5, 0.7, 0.9 (or 30, 50, 70, 90 percents prune away) ratio as our comparison points. For techniques like Taylor[34], we prune until the end and we select the points 0.3, 0.5, 0.7, 0.9 of number of output channels pruned away. Currently, for our experiments, we are considering two datasets MNIST and CIFAR, for the supplemental material, we plan to add at least one more dataset.

Implementation details

One of the problem of pruning during training is how to handle the shape of the gradient tensor and momentum tensor during backward pass. Usually, and specifically in the case of PyTorch

[36], the shape of the gradient tensor and momentum tensor is handled by the optimizer, which does not necessary update the shape during forward pass. Also, redefining a new optimizer with the new pruned model in a trivial way is not good enough since we would lose all values accumulated in the momentum buffer. One of the way to overcome this, is to prune also the gradient tensor and momentum tensor, using indexes that we used to prune the weight tensor, and then transfer them to a newly defined optimizer. As for the pruning of ResNet, we decided to follow the popular pruning strategy proposed in [27], meaning we prune the downsampling layer and then use the same indexes to prune the last convolution of the residual. Our method was implemented on PyTorch[36], the source code for our paper will be available at https://github.com/Anon6627/Pruning-PGP

Result on MNIST dataset:

For the comparison on MNIST, we use the same hyper-parameters as the original papers. For LeNet5, we use a learning rate 0.01, momentum 0.9, 40 epochs with a remove rate of 50% for PGP, our progressive pruning that does not compute the criterion during an epoch and RPGP that does. For PSFP, we used the same settings as mentionned before except for removal rate of 50%. For Taylor[34], we remove in an iterative way 5 channels each time and fine-tune for 5 epochs after that. We slightly changed this from the original procedure because this configuration does not collapse and return the best result. For L1 pruning, we use a 20 epochs fine-tuning after pruned. We use the same settings for ResNet20. For the DCP[53] algorithm, we took the available code and ran it on MNIST using 40 epochs, with 20 epochs for the channel pruning and 20 epochs for fine-tuning.

Methods Params FLOPS Error % ( gap)
Baseline LeNet5   0 61K 446k 0.84        ( 0)
L1[27] 30% 34.1K 304K 0.9    ( +0.06)
50% 18K 152K 1.05   ( +0.21)
70% 84K 82K 2.22   ( +1.38)
90% 8.8K 31K 8.17   ( +7.33)
Taylor[34] 30% 38K 404K 0.9     ( +0.06)
50% 24K 387K 1.05   ( +0.21)
70% 13K 374K 1.22   ( +0.38)
90% 3K 286K 7.73   ( +6.89)
DCP[53] 30% - - 4.27
50% - - 4.18
70% - - 6.28
90% - - 6.81
PSFP[13] 30% 34.1K 304K 1.32   ( +0.48)
50% 18K 152K 2.27   ( +1.43)
70% 84K 82K 2.99   ( +2.15)
90% 8.8K 31K 32.28 ( +31.44)
PGP(TW) 30% 34.1K 304K 0.82   ( -0.02))
50% 18K 152K 1.06   ( +0.22))
70% 84K 82K 1.52   ( +0.68))
90% 8.8K 31K 4.7   ( +3.86))
PGP(GN_G) 30% 34.1K 304K 0.87   ( +0.03)
50% 18K 152K 1.08   ( +0.24)
70% 84K 82K 1.74     ( +0.9)
90% 8.8K 31K 4.25   ( +3.41)
RPGP(GN_S) 30% 34.1K 304K 0.9     ( +0.06)
50% 18K 152K 1.25   ( +0.41)
70% 84K 82K 1.75   ( +0.91)
90% 8.8K 31K 8.14     ( +7.3)
Table 1: Performance of proposed and baseline pruning methods for training LeNet5 on the MNIST dataset.
Methods Params FLOPS Error % ( gap)
Baseline Resnet20   0 272K 41M 0.74        ( 0)
L1 [27] 30% 137K 22M 0.75   ( +0.01)
50% 68K 10M 1.09   ( +0.35)
70% 27K 4.2M 2.02   ( +1.28)
90% 3K 714K 8.17   ( +7.43)
Taylor [34] 30% 268K 40.9M 0.87   ( +0.13)
50% 260K 40.5M 0.95   ( +0.21)
70% 250K 39.8M 1.04    ( +0.3)
90% 241K 39.3M 7.73    (6.99)
DCP [53] 30% 193K 30.3M 1.11    (0.37)
50% 138K 21.1M 0.62   ( -0.12)
70% 87.7K 13.5M 1.19   ( +0.45)
90% 34K 5M 1.13   ( +0.39)
PSFP [13] 30% 137K 22M 0.5    ( -0.24)
50% 68K 10M 0.61   ( -0.13)
70% 27K 4.2M 0.72    ( -0.02)
90% 3K 714K 0.73     ( -0.01)
PGP(TW) 30% 137K 22M 0.45    ( -0.29)
50% 68K 10M 0.35    ( -0.39)
70% 27K 4.2M 0.52    ( -0.22)
90% 3K 714K 1.52    ( +0.78)
PGP(GN_G) 30% 137K 22M 0.4     ( -0.34)
50% 68K 10M 0.51   ( -0.23)
70% 27K 4.2M 0.57   ( -0.17)
90% 3K 714K 0.86  ( +0.12)
RPGP(GN_S) 30% 137K 22M 0.4   ( -0.34)
50% 68K 10M 0.48   ( -0.29)
70% 27K 4.2M 0.5   ( -0.24)
90% 3K 714K 1.87  ( +1.13)
Table 2: Performance of proposed and baseline pruning methods for training ResNet20 on the MNIST dataset.

Results in Table 1 show that our methods compare favorable against baseline techniques like L1[27] and Taylor[34]. It also has better performance than state of the art technique like PSFP[13]. Also, the model used by this experiment is small which makes it harder to prune. As for the Table 2, we can see the same thing happens here. Also, in this settings, we find that our methods performs slightly better than DCP[53]

in some settings. As for the difference between PGP(GN) and RPGP(GN), since both algorithms have the same criterion, it’s the procedure that differs, the slight better performance of PGP(GN) can probably be explained by the fact that the pruning criterion is calculated using Batch Gradient Descent instead of Stochastic Gradient Descent.

Result on the CIFAR10 dataset:

For the comparison on CIFAR10, we mostly use the same hyper-parameters as the considered papers. For VGG, we use a VGG19 for CIFAR10, with learning rate 0.1, momentum 0.9, 400 epochs and we decrease the learning rate by a factor of 10 at 160 and 240 epochs. For PSFP, PGP and RPGP, we set the remove rate hyper-parameter to 0.5 (50%) and we fine-tune them for 100 epochs after pruned and keep the best score. For Taylor[34], we remove in an iterative way 5 channels each time and fine-tune on 5 epochs after that. We slightly changed the procedure compared to the original paper because the original procedure pruned one feature map each iteration which is not very efficient on a large model. Also, empirically we found that 5 feature maps has the best accuracy. For L1 pruning, we use a 100 epochs fine-tuning after pruned and keep the best score. For PSFP[13] and DCP[53], we used the same settings provided by the authors in order to have the best possible performance. For Resnet, we use a Resnet56 adapted to CIFAR10, we keep the same settings for this experiment, except the number of epochs for our techniques is now 500.

Methods Params FLOPS Error % ( gap)
Baseline VGG19   0% 20M 400M 6.23         (0)
Li [27] 30% 9M 198M 16.94   (  +8.41)
50% 5M 100M 16.51   (  +7.98)
70% 1M 37M 16.17   (  +7.64)
90% 209K 4M 19.91   (+11.38)
Taylor [34] 30% 156M 156M 9.82    (  +1.29)
50% 5M 72M 11.94   (  +3.41)
70% 1.9M 24M 16.85 (  +8.32)
90% 249K 2.2M 41.58 (+33.05)
DCP [53] 30% - - -
50% - 139M -   (  -0.58)
70% - - -
90% - - -
PSFP [13] 30% 9M 198M  8.98    (  +2.75)
50% 5M 100M 11.2    (  +4.97)
70% 1M 37M 12.06   (  +5.83)
90% 209K 4M 18.73* (  +12.5)
PGP(TW) 30% 9M 198M  8.78    (  +0.25)
50% 5M 100M  9.89    (  +1.36)
70% 1M 37M 11.97   (  +3.44)
90% 209K 4M 21.08   (+12.55)
PGP(GN_G) 30% 9M 198M  7.37     (  +1.14)
50% 5M 100M  8.38    (  +2.15)
70% 1M 37M  9.7      (  +3.47)
90% 209K 4M 16.46   (+10.33)
RPGP(GN_S) 30% 9M 198M  7.65    (  +1.42)
50% 5M 100M  8.79    (  +2.56)
70% 1M 37M 10.56   (  +4.33)
90% 209K 4M 17.53     (+11.3)
Table 3: Performance of proposed and baseline pruning methods for training VGG19 on the CIFAR10 dataset.
Methods Params FLOPS Error % ( gap)
Baseline Resnet56   0 855K 128M  6.02        ( 0)
L1[27] 30% 431K 198M 13.34   ( +7.32)
50% 215K 100M 15.57   ( +9.55)
70% 84K 37M 17.89 ( +11.87)
90% 11K 4M 40.24 ( +34.22)
Taylor[34] 40% 491K 51M 13.9     ( +7.88)
50% 268K 23M 15.34   ( +9.32)
70% 100k 8M 22.1   ( +16.08)
90% 10k 1M 45.69 ( +39.67)
DCP[53] 0.3 600K 90M 5.67     ( -0.35)
50% 430K 65M 6.43    ( +0.41)
70% 270K 41M 7.18    ( +1.16)
90% 100K 17M 9.42      ( +3.4)
PSFP[13] 0.3 431K 198M 8.94    ( +2.92)
50% 215K 100M 10.93   ( +4.91)
70% 84K 37M 14.18   ( +8.16)
90% 11K 4M 28.09 ( +22.07)
PGP(TW) 30% 431K 198M 10.38   ( +4.36)
50% 215K 100M 11.95   ( +5.93)
70% 84K 37M 13.63     ( +7.61)
90% 11K 4M 30.68     ( +24.66)
PGP(GN_G) 30% 431K 198M 8.95     ( +2.93)
50% 215K 100M 10.59   ( +4.57)
70% 84K 37M 13.02      ( +7)
90% 11K 4M 26.02     ( +20)
RPGP(GN_S) 30% 431K 198M 9.37   ( +3.35)
50% 215K 100M 10.46   ( +4.44)
70% 84K 37M 14.16   ( +8.14)
90% 11K 4M 36.65 ( +30.63)
Table 4: Performance of proposed and baseline pruning methods for training ResNet56 on the CIFAR10 dataset.

From Tables 3 and 4, our technique consistently perform better than the baseline techniques. It also has slightly better performance than state of the art techniques PSFP[13] on VGGNet. For ResNet, PSFP[13] has a different pruning strategy than ours on ResNet. PSFP does not prune the downsample layer and therefore it also does not prune the last convolutional layer of the residual. This translates into a slight better accuracy on some settings. In our supplemental material, we also provide a comparison using the same pruning strategy on ResNet in order to have a fair comparison between approaches. DCP[53] performs better than ours on this dataset, however, it is difficult to compare since both techniques do not end up with the same number of FLOPS and parameters.

Training and pruning time:

The training and pruning time of model are important factors of a technique, for instance for deploying or adapting a model in an operational environment. One of advantage of progressive pruning techniques is the reduction of processing time at each epoch since filters are removed while training, at each epoch. Table 5 presents the training and pruning time pruning for the evaluated techniques. For progressive pruning and DCP technique, values represent both pruning and training times, while for L1 and Iterative pruning, values represent (training time) + pruning and retrain times. Experiments are conducted on the CIFAR10 dataset with the same settings as above, running on an isolated computer (Intel Xeon Gold 5118, @2.3GHZ) with an Nvidia Tesla P-100 GPU card.

Methods VGG19 Resnet56
   0.5 0.9 0.5 0.9
Baseline 219m 219m 307m 307m
L1 [27] (219)   + 32m (219)   + 32m (307)   + 48m (307)   + 48m
Taylor [34] (219)   + 254m (219)   + 457m (307)   + 488m (307)   + 878m
DCP [53] - - 489m 443m
PSFP [13] 219m 219m 307m 307m
PGP 329m 329m 441m 441m
RPGP 211m 168m 263m 241m
Table 5: Training and pruning time for pruning techniques with and .

From Tab. 5, the fastest pruning method (without considering training time) is currently the L1 [27]. However, it should be noted that the original training of the model takes around 219 mins for VGG and 307 mins for Resnet56. So, taking into account also training time L1 is slower than our approach. Other techniques likes Taylor [34] prune in a iterative way composed of multiple feature maps and fine-tuning, this method can be very slow, depending on the number of channels pruned at each iteration. DCP [53] is also slow for pruning due to the channel pruning optimization process and the fine-tuning after pruning. For PSFP [13], this algorithm has similar time to the original training since it does not technically change the size of the model during training. Between PGP and RPGP, the difference is the use of an entire epoch to compute the pruning criterion with PGP, and the direct computation of the criterion during a training epoch with RPGP. Also, since we hard-prune channels at each epoch, the epoch time will become faster as the model is pruned/trained. Overall, the progressive pruning methods train and prune in considerably less time than other methods.

Pruning criteria:

For this comparison, we compare our pruning criterion with other approaches. This experiment is realized on CIFAR10 using the same pruning strategy and method: we use our progressive pruning with the L2-norm criterion[13] on the weight and also the Taylor criterion [34]. The configuration for this experiment is the same as the general comparison except we set the target pruning rate to 50% and we use the RPGP pruning strategy for all criteria.

Networks L2 Taylor TW GN_G GN_S
VGG19  8.47%  9.27%  8.78%  8.47%  8.79%
ResNet56 10.30% 10.97% 10.46% 10.24% 10.28%
Table 6: Error rate for RPGP with different pruning criteria.

In Tab. 6, we can see that our criterion performs better than others in the context of progressive pruning, except for the case of the L2-norm. The comparison between Taylor Weight (TW), and Gradient Norm (GN) shows that a small gradient norm during training may be a good indicator about the importance of a channel. From the table we can also see that Taylor Weights performs better than the original Taylor criterion. Overall , which promotes the capture of the variation, seems to work the best with progressive pruning. Finally, we notice that L2 and our criteria have similar performance. This can be understood considering the following:

(9)

Where represents the weight of an output channel at iteration in an epoch, is the learning rate, and denotes here the loss function at iteration . From this equation we can observe the difference between L2 and Gradient Norm is the initial values of . Taking in account the partial soft pruning nature of our approach, can be zero when it is soft pruned. Therefore the two approaches tends to have similar values (since is a scalar, it is not important in this context).

Progressive pruning procedure:

We compare PSFP and RPGP when using the same pruning criterion procedure, the same and the same settings. This experiment shows the importance of handling the pruning of the backward pass during training.

Method VGG19 ResNet56
PSFP 11.20% 10.93%
RPGP  8.47% 10.13%
Table 7: Error rates for RPGP and PSFP with L2.

From the Table 7, on VGG19, the proposed pruning procedure outperforms the original PSFP[13]. On Resnet, the error rate of the two approaches is much closer. This is because, as mentioned earlier, PSFP[13]

does not perform pruning on the downsample layers and on the last layer of the residual connection which helps to improve its performance.

Pruning from scratch vs after pre-training:

In Tab. 8, we compare the performance obtained by our method on a model that was randomly initialized (Scratch) and a model that was already trained (Pre-trained) on CIFAR10. We set the target prune to 50% and use a removal rate of 0.5% on VGGNet and Resnet56 with the same settings as before.

Training Scenario VGG19 ResNet56
Scratch 8.79 % 10.46 %
Pre-trained 8.23 %  9.51 %
Table 8: Error rate for RPGP when trained from scratch compared to a pre-trained model.

From this experiment, we can see that the difference in terms of accuracy between a network pruned starting from scratch and a network pruned after training is quite reduced and can vary depending on the architectures. This shows that, instead of starting from a trained model and prune, the proposed technique can attain similar performance starting from a randomly initialized model, thus, with a reduced training and pruning time.

Hard vs soft pruning:

We also want to compare, the impact of having different remote rate. For this experiment, we use RPGP with our gradient criterion and fixes the target prune rate at 50% using the same hyper-parameters as before. We varies the removal rate in order to see the impact of having more recovery versus the opposite.

Networks
VGG19 8.74% 8.79% 8.99% 8.92%
ResNet56 10.57% 10.46% 11.03% 10.78%
Table 9: Error rate for RPGP for different removal rates .

The result shown in Table 9 indicate that a remove rate of 0.3(30%)or 0.5(50%) has the best balance between the amount of hard pruning soft pruning. It is also interesting to see that, without any soft pruning (=1.0), the performance of the approach is still close to others removal rate.

5 Conclusion

The PGP is a new progressive pruning technique that measures change in channel weights, and applies effective hard and soft pruning strategies. In this paper, we show that it is possible to prune a deep learning model from scratch with the PGP technique while improving the trade-off between compression and accuracy. We proposed a criterion that is well adapted for progressive pruning from scratch that considers the norm of the gradient. Experimental result obtained after pruning various CNNs on the MNIST and CIFAR10 datasets show that the proposed method can maintain a high level of accuracy with compact neural networks. Future research will involve analyzing the performance of different CNNs pruned using the proposed method on larger real-world datasets, and for other visual recognition tasks (e.g., person and face detection in video surveillance).

References

Appendix A Additional Experimental Results

a.1 Comparison of PSFP and RPGP with ResNet:

As described in previous experiments, PSFP [13] does not prune the downsampling layer of ResNet56. Therefore, it does not prune the last layer of the residual connection. The common strategy with ResNet consists in pruning the downsampling layer, and then pruning the last layer of the residual connection. In this section, the performance of PSFP is compared with our proposed RPGP technique using the same strategy on ResNet56, i.e., the downsampling layer and last layer of residual connection are not pruned. We employ the CIFAR10 dataset and the same hyper-parameters as in previous experiments in our paper.

The results in Table 11 indicate that the proposed RPGP approach typically performs better than PSFP. Interestingly, when no pruning is performed on the downsampling layer and last layer of the residual connection, our method performs much better. Results suggest that the residual connection is very sensitive to pruning, and we may require a different pruning strategy.

%
Methods 30% 50% 70% 90%
PSFP 8.94 10.93 14.18 28.09
RPGP(GN_S) 8.87 10.09 11.02 13.94
Table 10: Error rates of PSFP and RPGP with different pruning rates, when downsampling and last layers of residual connection are not pruned.

a.2 Graphical comparison on CIFAR10 with VGG:

The results presented in this section are similar to the ones shown in Tables 1 to 4 of our paper. In the main paper, we could only compare the performance of methods with 4 pruning rates due to space constraints. In this section, we compare the performance of methods using the same experimental settings (as in our paper), but with 10 data points () on L1 [27], Taylor [34], PSFP [13] and our approach. Since the number of remaining parameters can differ slightly from one algorithm to the other, some of the value on X-axis are rounded up for a better visualization.

Results in Figure 2 show the proposed PGP and RPGP pruning methods consistently outperforming the other methods. Note that the proposed methods allow to maintain a low lever of error event with an important increase in the pruning rate.

Figure 2: Error rate versus the number of remaining parameters with the proposed and baseline pruning methods for VGG19 on the CIFAR10 dataset.

a.3 Comparison on Faster R-CNN

As stated previously, object detection is an important field and there has been a lot of efforts in order to reduce computational complexity of current existing detection. For this comparison, we set out to compare our pruning algorithm with other algorithm on a every popular object detection algorithm, Faster R-CNN. The setting of this experiment is as follow, due to the complexity and attaining good performance on object detection, we start this experiment with a COCO pretrained model.

Methods VGG16 Resnet101
PGP 65.5%
RPGP 64.0
Table 11: Comparison of pruning algorithm on Faster R-CNN