Online Filter Clustering and Pruning for Efficient Convnets

05/28/2019 ∙ by Zhengguang Zhou, et al. ∙ USTC 0

Pruning filters is an effective method for accelerating deep neural networks (DNNs), but most existing approaches prune filters on a pre-trained network directly which limits in acceleration. Although each filter has its own effect in DNNs, but if two filters are the same with each other, we could prune one safely. In this paper, we add an extra cluster loss term in the loss function which can force filters in each cluster to be similar online. After training, we keep one filter in each cluster and prune others and fine-tune the pruned network to compensate for the loss. Particularly, the clusters in every layer can be defined firstly which is effective for pruning DNNs within residual blocks. Extensive experiments on CIFAR10 and CIFAR100 benchmarks demonstrate the competitive performance of our proposed filter pruning method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved state-of-art performance in many computer vision tasks

[1][2] and grown deeper and deeper. However, these high capacity networks suffer high complexity in both storage and computation especially when used in resource-limited platforms, such as mobile phones and embedded devices [3]. Thus, many researchers have a significant interest in network compression methods for reducing the storage and computation costs.

With the observation that DNNs have a significant parameter redundancy, pruning methods have been widely studied for reducing the number of parameters in networks. In the earlier years, researchers prune weights in a network, but it has a limitation in accelerating network inference [4][5]. Afterwards, more and more works focus on pruning filters or channels which can result in a thinner model and significant accelerations [6][7][8][9][10][11][12][13]. The main step of the above pruning methods is to measure the importance score of each filter. After that, they prune the least important filters followed by a fine-tuning process to recover the performance of the network. Some methods are based on the magnitude of the filter [9]

and some approaches exploit the information of feature maps to estimate the importance of the corresponding filter

[7][11]. Moreover, if two weights are similar, one of them is redundant and can be pruned [13]

, but it only considers the fully-connected layers and limits in accelerating convolutional neural networks.

Figure 1: The flow chart of our proposed method (e.g., a convolutional layer including 8 filters and 4 clusters).

In this paper, we propose a new filter level pruning method. Our method is based on the fact that if two filters are similar, one of them is redundant and can be effectively removed [13]

. But two filter sets are always dissimilar in a trained DNN. In order to force the two filter sets similar, we propose a new training algorithm which is achieved by adding a “cluster loss” on the original loss function. The network can learn compact representations during training with backpropagation algorithm. As shown in Fig

1,our method consists of three main steps: (1) Given a DNN, we define the clusters which each filter belong to and train the network with our proposed training algorithm. (2) After training, filters in each cluster are similar and one of them can be removed, besides, the corresponding filter channels can also be pruned. (3) At last, we fine-tune the pruned network to compensate the performance degradation.

We evaluate our proposed method on two bechmark datasets (CIFAR10 and CIFAR100) and demonstrate its competitive performance compared with state-of-the-art approaches. For VGG-16, our method shows 2 speedup without loss in accuracy. With WRN-16-4, it achieves about 3.4 and 1.74 speedup within 1% accuracy drop on CIFAR10 and CIFAR100, respectively.

2 Related Work

As one of the most popular methods for accelerating network inference, pruning has been widely studied recently. In the earlier work, Optimal Brain Damage [4] prunes unimportant weights to reduce the number of parameters and prevent over-fitting. Recently, Han et al. [5] prune the weights which are below a threshold results in a very sparse model with no loss of accuracy. But such a non-structured sparse model has limitations of accelerating inference without specific hardware [11]. Latter, researchers focus on filter level pruning which can not only reduce the memory footprint dramatically but also speedup network inference by any off-the-shelf library. Li et al. [9] prune the less useful filters based on sum of absolute weights directly.

But the filter of small magnitude does not mean it is not important. Thus, the methods based on the information of activations are studied. Hu et al. [8]

calculate the sparsity of activations after the ReLU function and prune the corresponding filters if the sparsity is high. Molchanov

et al. [12] adopt a first-order Taylor expansion to approximate the change to loss function induced by pruning each filter. He et al. [7]

prune channels by a LASSO regression based channel selection and least square reconstruction. Yu

et al. [14]

prune the entire network jointly to minimize the reconstruction error of important responses in the “final response layer” and propagate the importance scores of final responses to every neuron.

Besides the above methods which prune filters on a pre-trained network, several methods add regularization or other modifications during training. For example, Liu et al. [10]

enforce channel sparsity by imposing L1 regularization on the scaling factors in batch normalization. McDanel

et al. [15] utilize incomplete dot products to dynamically adjust the number of input channels used in each layer. Zhou et al. [6] introduce a scaling factor to each filter to weaken the weights step by step and prune the filters after training.

Meanwhile, other methods are well explored to compress networks. Knowledge distillation [16] first trains a big network (i.e., teacher network) and then trains a shallow one (i.e., student network) to mimic the output distributions of the teacher. Network quantization reduces the number of bits for representing each parameter and some low-bit networks are proposed [17][18][19]. Low-rank factorization decompose weights into several pieces [20][21]. Note that our pruning method can be integrated with the above techniques to achieve a more compact and efficient model.

Figure 2: Pruning similar filters in a convolutional layer. Whenever two filters sets are similar, we remove one of them for A, while the corresponding channels of the output activations B can also be pruned.Moreover, the channels of filter in next layer can also be removed, but we need to add one channel to the other in each sets (i.e., from C to D) as shown in Eq (3).

3 Our Method

In this section, we first show how we prune the similar filters for a singe layer, then propose a new training algorithm for forcing filters in each cluster to be similar. Finally, we analyze the acceleration and compression of our method.

3.1 Pruning Similar Filters

We use a triplet to denote the convolution process in layer i, where

is the input tensor, which has

channels, rows and columns. And is a filter bank with kernel size, denote the 3D convolution operation which maps to using , where is the input tensor in layer (which is also the output tensor of layer ), Note that the fully connected operation is a special case of convolution operation.

Formally, the channel of can be computed with the filter and the input tensor using the 2D convolution operation as follows :

(1)

where is the nonlinear function, such as sigmoid or ReLU, is the channel and is the channel of the filter in layer . Note that we ignore the corresponding bias for simplicity. In the same way, can be computed as:

(2)

Now let us suppose that . This means that the corresponding feature maps are the same, e.g., according to Eq (2). For filter, replacing by in Eq (1), we get the Eq (3).

(3)

where 0 is the channel of all zero-value. This means that whenever two filter sets in layer are equal, we can prune one of them safely. At the same time, we modify the filter channels in layer (i.e., for each filter, we add one channel to the other). Fig. 2 exhibits the operation clearly.

3.2 Proposed Training Algorithm

However, two filter sets could never be exactly equal in a trained CNN, and it’s hard to find such two similar filter sets. To cope with it, we propose a new training algorithm to force several two filter sets to be equal. Let denote the weights in the network and is our loss function, the optimization target can be formulated as:

(4)
(5)

where and refer to the cross-entropy loss and the regularization loss, respectively, and are the tunable parameters to balance the loss terms, denotes the “cluster loss” of the cluster in layer . Thus, the filters in each cluster are forced to be similar using the cluster loss and just one filter is kept after training.

If the clusters are changing during training, it may cause the training process difficult due to the high dimension of each filter and the non-balanced clusters. Therefore, in this work, we can define the clusters which the filters belong to firstly and keep the clusters not change online. And the number of cluster in each layer is specified with the compression rate.

Specifically, let denote the number of filters in layer and is the number of filters are pruned ( is the compression rate). We restrict , because the performance degrades severely when over half of filters are pruned. The size of each cluster is no larger than two for balance. Thus, there are clusters, where each cluster contains two filters and each cluster contains just one filter. Note that one of the filters in each cluster whose size is larger than one is pruned and the clusters of size equal to one are preserved.

Analysis of Compression and Acceleration. According to Eq (3) and Fig. 2, if a filter in layer is removed, its corresponding channel of filters in layer is also discarded. The size of each feature map is kept the same. Since we preserve filters in layer . The speedup ratio and compression ratio of the pruned network for layer compared to the original network can be computed by . Note that the feature maps in layer are also compressed times which saves a large proportion of runtime memory.

Figure 3: Accuracy after pruning filters (left) and the final “cluster loss” under different pruned ratio and different (Eq (4)) for the WRN-16-4 model on CIFAR10 (right).

4 Experiments

We evaluate our proposed method on several typical networks and two datasets (CIFAR-10 and CIFAR-100) [22]

. We use TensorFlow

[23] framework to implement network pruning and evaluate on Nvidia GTX 1080Ti GPU.

Figure 4: A comparison of the performance with different filter selection criteria and different pruned ratio. The first row is the VGG-16, ResNet-34 and WRN-16-4 network on CIFAR10, respectively. The second row is the models on CIFAR100. Our approach is consistently better (larger is better). Besides, the last column is the speedup and compression ratio of models.

4.1 Implementation Details and filter selection criteria

CIFAR10 and CIFAR100 both consist of a training set of 50000 and a test set of 10000 color images of size

with 10 classes and 100 classes, respectively. The training images are padded by 4 pixels and randomly flipped.

We evaluate three DNNs (i.e.,

VGG-16, ResNet-34 and WRN-16-4) on the two datasets, respectively. All networks are trained using SGD with batch size 128 and 300 epochs. The weight decay is 0.0005 and momentum is 0.9. The other hyper-parameters for the models are in the following. (1)The VGG-16 network is derived from

[9]. The initial learning rate is set to 0.02, and is divided by 5 at each 60 epochs. We get the baseline accuracy on CIFAR10 and CIFAR100 of 93.55% and 73.23%, respectively. (2) The ResNet-34 model replaces shortcut layer with a convolutional layer of ResNet-32 [24]. The initial learning rate is set to 0.1, and is divided by 5 at each 60 epochs. We get the baseline accuracy on CIFAR10 and CIFAR100 of 93.56% and 69.82%, respectively. (3) The WRN-16-4 network is adopted from [25]. The initial learning rate is set to 0.2, and is divided by 5 at each 60 epochs. We get the baseline accuracy on CIFAR10 and CIFAR100 of 95.01% and 75.99%, respectively.

In addition, we compare our method with other state-of-the-art criteria, including (1) Weight sum [9]. This criterion measure the importance of a filter in each layer by calculating the sum of its absolute weights ( ); (2) Average Percentage of Zeros (APoZ) [8]. APoZ measures the percentage of zero activations of a filter after the ReLU mapping. The APoZ of the neuron is , where is the number of training examples (we set ); (3) Randomly pruning. During pruning, it randomly selects and removes filters. Moreover, we also compare the methods with Train from scratch which trains a model from scratch, where the model has the same structure as the pruned network.

4.2 The effect of

Firstly, we explore the influence of (Eq (4)) on the performance of our pruned method. We set the pruned ratio the same in every layer of the WRN-16-4 network. Fig. 3 demonstrates the accuracy of the pruned model with different pruned ratio and and the “cluster loss” after training the model with our proposed training algorithm. As expected, cluster loss increases as filter pruned ratio and increases. In addition, the losses are always small, so one filter in each cluster can be pruned with minimal degradation of accuracy. The performance of pruned network is similar with different . In the rest of experiment, we set the to 0.05.

4.3 Comparison with other filter selection criteria

Fig. 4 shows the pruned results for VGG-16, ResNet-34 and WRN-16-4 on CIFAR10 and CIFAR100 with different filter selection criteria and different pruned ratios. Our approach is consistently better than other filter selection criteria under different pruned ratio. The method based on data (i.e., APoZ) is similar to other data-independent approaches. This may because the number of filters pruned is small and the gap of these methods is not obvious. Training the pruned network from scratch is not always worse than other methods especially when the model is deep and wide. We can accelerate VGG-16 network 2 without loss in accuracy and speedup WRN-16-4 model on CIFAR10 about 3.4 with 1% degradation on accuracy. But the ResNet-34 is hard to compress, which may because the model is already compact and efficient.

Dataset Model Error (%) Speedup
CIFAR10 VGG-16 [9] 6.75 (-0.15) 1.52
VGG-16 [10] 6.34 (-0.14) 2.04
VGG-16 (ours) 6.45 (-0.21) 2.77
ResNet-56 [9] 6.96 (-0.02) 1.38
ResNet-58 (ours) 6.18 (-0.01) 1.50
CIFAR100 VGG-16 [10] 26.74 (-0.22) 1.59
VGG-16 (ours) 26.77 (-0.32) 2.03
WRN-16-4 [24] 24.53 (+0.30) 1.18
WRN-16-4 (ours) 24.01 (+0.06) 1.34
Table 1: A comparison of speedup with other compression methods. Values in parentheses are the increased error.

4.4 Comparison with other compression methods

Besides filter pruning methods, we compare the acceleration of our approach with other network compression methods in Table 1. In general, different layer has different importance and sparsity [24], and the method based training [10] can automatically find it. Even though, our approach can outperform other methods.

5 Conclusion

In this work, we introduce the cluster loss on the original loss function to force filters in each cluster to be similar during training, and prune one filter in every cluster safely. The compact model is inference efficient and requires no special hardware. Extensive experiments on two datasets demonstrate the competitive performance of our proposed method. In the future, we would like to evaluate our method on larger dataset and more vision tasks.

References