Deep neural networks (DNNs) have achieved state-of-art performance in many computer vision tasks and grown deeper and deeper. However, these high capacity networks suffer high complexity in both storage and computation especially when used in resource-limited platforms, such as mobile phones and embedded devices . Thus, many researchers have a significant interest in network compression methods for reducing the storage and computation costs.
With the observation that DNNs have a significant parameter redundancy, pruning methods have been widely studied for reducing the number of parameters in networks. In the earlier years, researchers prune weights in a network, but it has a limitation in accelerating network inference . Afterwards, more and more works focus on pruning filters or channels which can result in a thinner model and significant accelerations . The main step of the above pruning methods is to measure the importance score of each filter. After that, they prune the least important filters followed by a fine-tuning process to recover the performance of the network. Some methods are based on the magnitude of the filter 
and some approaches exploit the information of feature maps to estimate the importance of the corresponding filter. Moreover, if two weights are similar, one of them is redundant and can be pruned 
, but it only considers the fully-connected layers and limits in accelerating convolutional neural networks.
In this paper, we propose a new filter level pruning method. Our method is based on the fact that if two filters are similar, one of them is redundant and can be effectively removed 
. But two filter sets are always dissimilar in a trained DNN. In order to force the two filter sets similar, we propose a new training algorithm which is achieved by adding a “cluster loss” on the original loss function. The network can learn compact representations during training with backpropagation algorithm. As shown in Fig1,our method consists of three main steps: (1) Given a DNN, we define the clusters which each filter belong to and train the network with our proposed training algorithm. (2) After training, filters in each cluster are similar and one of them can be removed, besides, the corresponding filter channels can also be pruned. (3) At last, we fine-tune the pruned network to compensate the performance degradation.
We evaluate our proposed method on two bechmark datasets (CIFAR10 and CIFAR100) and demonstrate its competitive performance compared with state-of-the-art approaches. For VGG-16, our method shows 2 speedup without loss in accuracy. With WRN-16-4, it achieves about 3.4 and 1.74 speedup within 1% accuracy drop on CIFAR10 and CIFAR100, respectively.
2 Related Work
As one of the most popular methods for accelerating network inference, pruning has been widely studied recently. In the earlier work, Optimal Brain Damage  prunes unimportant weights to reduce the number of parameters and prevent over-fitting. Recently, Han et al.  prune the weights which are below a threshold results in a very sparse model with no loss of accuracy. But such a non-structured sparse model has limitations of accelerating inference without specific hardware . Latter, researchers focus on filter level pruning which can not only reduce the memory footprint dramatically but also speedup network inference by any off-the-shelf library. Li et al.  prune the less useful filters based on sum of absolute weights directly.
But the filter of small magnitude does not mean it is not important. Thus, the methods based on the information of activations are studied. Hu et al. 
calculate the sparsity of activations after the ReLU function and prune the corresponding filters if the sparsity is high. Molchanovet al.  adopt a first-order Taylor expansion to approximate the change to loss function induced by pruning each filter. He et al. 
prune channels by a LASSO regression based channel selection and least square reconstruction. Yuet al. 
prune the entire network jointly to minimize the reconstruction error of important responses in the “final response layer” and propagate the importance scores of final responses to every neuron.
Besides the above methods which prune filters on a pre-trained network, several methods add regularization or other modifications during training. For example, Liu et al. 
enforce channel sparsity by imposing L1 regularization on the scaling factors in batch normalization. McDanelet al.  utilize incomplete dot products to dynamically adjust the number of input channels used in each layer. Zhou et al.  introduce a scaling factor to each filter to weaken the weights step by step and prune the filters after training.
Meanwhile, other methods are well explored to compress networks. Knowledge distillation  first trains a big network (i.e., teacher network) and then trains a shallow one (i.e., student network) to mimic the output distributions of the teacher. Network quantization reduces the number of bits for representing each parameter and some low-bit networks are proposed . Low-rank factorization decompose weights into several pieces . Note that our pruning method can be integrated with the above techniques to achieve a more compact and efficient model.
3 Our Method
In this section, we first show how we prune the similar filters for a singe layer, then propose a new training algorithm for forcing filters in each cluster to be similar. Finally, we analyze the acceleration and compression of our method.
3.1 Pruning Similar Filters
We use a triplet to denote the convolution process in layer i, where
is the input tensor, which haschannels, rows and columns. And is a filter bank with kernel size, denote the 3D convolution operation which maps to using , where is the input tensor in layer (which is also the output tensor of layer ), Note that the fully connected operation is a special case of convolution operation.
Formally, the channel of can be computed with the filter and the input tensor using the 2D convolution operation as follows :
where is the nonlinear function, such as sigmoid or ReLU, is the channel and is the channel of the filter in layer . Note that we ignore the corresponding bias for simplicity. In the same way, can be computed as:
where 0 is the channel of all zero-value. This means that whenever two filter sets in layer are equal, we can prune one of them safely. At the same time, we modify the filter channels in layer (i.e., for each filter, we add one channel to the other). Fig. 2 exhibits the operation clearly.
3.2 Proposed Training Algorithm
However, two filter sets could never be exactly equal in a trained CNN, and it’s hard to find such two similar filter sets. To cope with it, we propose a new training algorithm to force several two filter sets to be equal. Let denote the weights in the network and is our loss function, the optimization target can be formulated as:
where and refer to the cross-entropy loss and the regularization loss, respectively, and are the tunable parameters to balance the loss terms, denotes the “cluster loss” of the cluster in layer . Thus, the filters in each cluster are forced to be similar using the cluster loss and just one filter is kept after training.
If the clusters are changing during training, it may cause the training process difficult due to the high dimension of each filter and the non-balanced clusters. Therefore, in this work, we can define the clusters which the filters belong to firstly and keep the clusters not change online. And the number of cluster in each layer is specified with the compression rate.
Specifically, let denote the number of filters in layer and is the number of filters are pruned ( is the compression rate). We restrict , because the performance degrades severely when over half of filters are pruned. The size of each cluster is no larger than two for balance. Thus, there are clusters, where each cluster contains two filters and each cluster contains just one filter. Note that one of the filters in each cluster whose size is larger than one is pruned and the clusters of size equal to one are preserved.
Analysis of Compression and Acceleration. According to Eq (3) and Fig. 2, if a filter in layer is removed, its corresponding channel of filters in layer is also discarded. The size of each feature map is kept the same. Since we preserve filters in layer . The speedup ratio and compression ratio of the pruned network for layer compared to the original network can be computed by . Note that the feature maps in layer are also compressed times which saves a large proportion of runtime memory.
We evaluate our proposed method on several typical networks and two datasets (CIFAR-10 and CIFAR-100) 
. We use TensorFlow framework to implement network pruning and evaluate on Nvidia GTX 1080Ti GPU.
4.1 Implementation Details and filter selection criteria
CIFAR10 and CIFAR100 both consist of a training set of 50000 and a test set of 10000 color images of size
with 10 classes and 100 classes, respectively. The training images are padded by 4 pixels and randomly flipped.
We evaluate three DNNs (i.e.,
VGG-16, ResNet-34 and WRN-16-4) on the two datasets, respectively. All networks are trained using SGD with batch size 128 and 300 epochs. The weight decay is 0.0005 and momentum is 0.9. The other hyper-parameters for the models are in the following. (1)The VGG-16 network is derived from. The initial learning rate is set to 0.02, and is divided by 5 at each 60 epochs. We get the baseline accuracy on CIFAR10 and CIFAR100 of 93.55% and 73.23%, respectively. (2) The ResNet-34 model replaces shortcut layer with a convolutional layer of ResNet-32 . The initial learning rate is set to 0.1, and is divided by 5 at each 60 epochs. We get the baseline accuracy on CIFAR10 and CIFAR100 of 93.56% and 69.82%, respectively. (3) The WRN-16-4 network is adopted from . The initial learning rate is set to 0.2, and is divided by 5 at each 60 epochs. We get the baseline accuracy on CIFAR10 and CIFAR100 of 95.01% and 75.99%, respectively.
In addition, we compare our method with other state-of-the-art criteria, including (1) Weight sum . This criterion measure the importance of a filter in each layer by calculating the sum of its absolute weights ( ); (2) Average Percentage of Zeros (APoZ) . APoZ measures the percentage of zero activations of a filter after the ReLU mapping. The APoZ of the neuron is , where is the number of training examples (we set ); (3) Randomly pruning. During pruning, it randomly selects and removes filters. Moreover, we also compare the methods with Train from scratch which trains a model from scratch, where the model has the same structure as the pruned network.
4.2 The effect of
Firstly, we explore the influence of (Eq (4)) on the performance of our pruned method. We set the pruned ratio the same in every layer of the WRN-16-4 network. Fig. 3 demonstrates the accuracy of the pruned model with different pruned ratio and and the “cluster loss” after training the model with our proposed training algorithm. As expected, cluster loss increases as filter pruned ratio and increases. In addition, the losses are always small, so one filter in each cluster can be pruned with minimal degradation of accuracy. The performance of pruned network is similar with different . In the rest of experiment, we set the to 0.05.
4.3 Comparison with other filter selection criteria
Fig. 4 shows the pruned results for VGG-16, ResNet-34 and WRN-16-4 on CIFAR10 and CIFAR100 with different filter selection criteria and different pruned ratios. Our approach is consistently better than other filter selection criteria under different pruned ratio. The method based on data (i.e., APoZ) is similar to other data-independent approaches. This may because the number of filters pruned is small and the gap of these methods is not obvious. Training the pruned network from scratch is not always worse than other methods especially when the model is deep and wide. We can accelerate VGG-16 network 2 without loss in accuracy and speedup WRN-16-4 model on CIFAR10 about 3.4 with 1% degradation on accuracy. But the ResNet-34 is hard to compress, which may because the model is already compact and efficient.
|CIFAR10||VGG-16 ||6.75 (-0.15)||1.52|
|VGG-16 ||6.34 (-0.14)||2.04|
|VGG-16 (ours)||6.45 (-0.21)||2.77|
|ResNet-56 ||6.96 (-0.02)||1.38|
|ResNet-58 (ours)||6.18 (-0.01)||1.50|
|CIFAR100||VGG-16 ||26.74 (-0.22)||1.59|
|VGG-16 (ours)||26.77 (-0.32)||2.03|
|WRN-16-4 ||24.53 (+0.30)||1.18|
|WRN-16-4 (ours)||24.01 (+0.06)||1.34|
4.4 Comparison with other compression methods
Besides filter pruning methods, we compare the acceleration of our approach with other network compression methods in Table 1. In general, different layer has different importance and sparsity , and the method based training  can automatically find it. Even though, our approach can outperform other methods.
In this work, we introduce the cluster loss on the original loss function to force filters in each cluster to be similar during training, and prune one filter in every cluster safely. The compact model is inference efficient and requires no special hardware. Extensive experiments on two datasets demonstrate the competitive performance of our proposed method. In the future, we would like to evaluate our method on larger dataset and more vision tasks.
-  Xiabing Liu, Wei Liang, Yumeng Wang, Shuyang Li, and Mingtao Pei, “3d head pose estimation with convolutional neural network trained on synthetic images,” in ICIP, 2016.
Aya F Khalaf, Inas A Yassine, and Ahmed S Fahmy,
“Convolutional neural networks for deep feature learning in retinal vessel segmentation,”in ICIP, 2016.
-  Shaoyan Sun, Wengang Zhou, Qi Tian, and Houqiang Li, “Scalable object retrieval with compact image representation from generic object regions,” ACM TOMM, vol. 12, no. 2, pp. 29, 2016.
-  Yann LeCun, John S Denker, and Sara A Solla, “Optimal brain damage,” in NIPS, 1990.
-  Song Han, Jeff Pool, John Tran, and William Dally, “Learning both weights and connections for efficient neural network,” in NIPS, 2015.
-  Zhengguang Zhou, Wengang Zhou, Richang Hong, and Houqiang Li, “Online filter weakening and pruning for efficient convnets,” in ICME, 2018.
-  Yihui He, Xiangyu Zhang, and Jian Sun, “Channel pruning for accelerating very deep neural networks,” in ICCV, 2017.
-  Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang, “Network trimming: A data-driven neuron pruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250, 2016.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf, “Pruning filters for efficient convnets,” arXiv preprint arXiv:1608.08710, 2016.
-  Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang, “Learning efficient convolutional networks through network slimming,” arxiv preprint, vol. 1708, 2017.
-  Jian-Hao Luo, Jianxin Wu, and Weiyao Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
-  Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz, “Pruning convolutional neural networks for resource efficient transfer learning,” arXiv preprint arXiv:1611.06440, 2016.
-  Suraj Srinivas and R Venkatesh Babu, “Data-free parameter pruning for deep neural networks,” arXiv preprint arXiv:1507.06149, 2015.
-  Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis, “Nisp: Pruning networks using neuron importance score propagation,” arXiv preprint arXiv:1711.05908, 2017.
-  Bradley McDanel, Surat Teerapittayanon, and HT Kung, “Incomplete dot products for dynamic computation scaling in neural network inference,” arXiv preprint arXiv:1710.07830, 2017.
-  Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
-  Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
-  Fengfu Li, Bo Zhang, and Bin Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
-  Xiaotian Zhu, Wengang Zhou, and Houqiang Li, “Adaptive layerwise quantization for deep neural network compression,” in ICME, 2018.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky, “Speeding-up convolutional neural networks using fine-tuned cp-decomposition,” arXiv preprint arXiv:1412.6553, 2014.
-  Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., University of Toronto, 2009.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan, “More is less: A more complicated network with less inference complexity,” arXiv preprint arXiv:1703.08651, 2017.
-  Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.