Log In Sign Up

Three Dimensional Convolutional Neural Network Pruning with Regularization-Based Method

In recent years, three-dimensional convolutional neural network (3D CNN) is intensively applied in video analysis and receives good performance. However, 3D CNN leads to massive computation and storage consumption, which hinders its deployment on mobile and embedded devices. In this paper, we propose a three-dimensional regularization-based pruning method to assign different regularization parameters to different weight groups based on their importance to the network. Experiments show that the proposed method outperforms other popular methods in this area.


page 1

page 2

page 3

page 4


Structured Deep Neural Network Pruning by Varying Regularization Parameters

Convolutional Neural Networks (CNN's) are restricted by their massive co...

Structured Pruning for Efficient ConvNets via Incremental Regularization

Parameter pruning is a promising approach for CNN compression and accele...

High-dimensional Bayesian Optimization for CNN Auto Pruning with Clustering and Rollback

Pruning has been widely used to slim convolutional neural network (CNN) ...

PRUNIX: Non-Ideality Aware Convolutional Neural Network Pruning for Memristive Accelerators

In this work, PRUNIX, a framework for training and pruning convolutional...

Build a Compact Binary Neural Network through Bit-level Sensitivity and Data Pruning

Convolutional neural network (CNN) has been widely used for vision-based...

Make $\ell_1$ Regularization Effective in Training Sparse CNN

Compressed Sensing using 𝓁1 regularization is among the most powerful an...

SECS: Efficient Deep Stream Processing via Class Skew Dichotomy

Despite that accelerating convolutional neural network (CNN) receives an...

1 Introduction

In recent years, convolutional neural network (CNN) has developed rapidly and has achieved remarkable success in computer vision tasks such as identification, classification and segmentation. However, due to the lack of motion modeling, this image-based end-to-end feature can not directly apply to videos. In 

Xu20123D ; Karpathy2014Large , the authors use three-dimensional convolutional networks (3D CNN) to identify human actions in videos. Tran et al. proposed a 3D CNN for action recognition which contains 1.75 million parameters Du2015Learning . The development of 3D CNN also brings challenges because of its higher dimensions. This leads to massive computing and storage consumption, which hinders its deployment on mobile and embedded devices.

In order to reduce the computation cost, researchers propose methods to compress CNN models, including knowledge distillation Hin2015Knowledge , parameter quantization Courbariaux2015BinaryConnect ; Rastegari2016XNOR , matrix decomposition Zhang2015Accelerating and parameter pruning Han2015Deep . However, all of the above methods are based on two-dimensional convolution. In this paper, we expand the idea of 2018arXiv180409461W to 3D CNN acceleration. The main idea is to add group regularization items to the objective function and prune weight groups gradually, where the regularization parameters for different weight groups are differently assigned according to some importance criteria.

2 The Proposed Method

For a three-dimensional convolutional neural network with  layers, the weights of the th () convolutional layer 

is a sequences of 5-D tensors. Here 

, , , and are the dimensions along the axes of filter, channel, spatial height, spatial width and spatial depth. The proposed objective function for structured sparsity regularization is defined by Eqn.(1).


Here  is the loss on data; is the non-structured regularization ( norm in this paper). is the structured sparsity regularization on each layer. In Wen2015Learning ; Lebedev2016Fast , the authors used the same  for all groups and adopted Group LASSO for . Recently Wang et al. 2018arXiv180409461W use the squared  norm for  and vary the regularization parameters  for different groups. We build on top of that approach but extend it from two dimensions to three dimensions.

The structure learned is determined by the way of splitting groups of .There are normally filer-wise, channel-wise, shape-wise, and depth-wise structured sparsity with different ways of grouping Wen2015Learning . Pruning of different weight groups for 3D CNN is shown in Fig.1.

Figure 1: The im2col implementation of 3D CNN is to expand tensors into matrices, so that convolutional operations are transformed to matrix multiplication. The weights at the blue squares are to be pruned. (a) Pruning a filter. (b) Pruning all the weights at the same position. (c) Pruning a channel.

In 2018arXiv180409461W , Wang et al. theoretically proved that by increasing the regularization parameter  , the magnitude of weights tends to be minimized. The more  increases, the more magnitude of weights are compressed to zero. Therefore, we can assign different  for the weight groups based on their importance to the network. Here, we use the norm as a criterion of importance.

Our goal is to prune weight groups in the network, where is the pruning ratio to each layer and is total number of weight groups in the layer. In other words, we need to prune weight groups which ranks lower in the network. We sort the weight groups in ascending order of the norms. In order to remove the oscillation of ranks during one training iteration, we averaged the rank through training iterations to obtain the average rank in training iterations:

The final average rank  is obtained by sorting  of different weight groups in ascending order, making its range from  to . The update of is determined by the following formula:

Here is the function of average rank , we follow the formula proposed by Wang 2018arXiv180409461W as follows:



is a hyperparameter which controls the speed of convergence. According to Eqn.(

2), we can see that  is zero when because we need to increase the regularization parameters of the weight groups whose ranks are below to further decrease their norms; and for those with greater norms and rank above , we need to decrease their regularization parameters to further increase their norms. Thus, we can ensure that exactly weight groups are pruned at the final stage of the algorithm. When we obtain , the weights can be updated through back-propagation deduced from Eqn.(1). Further details can be found in 2018arXiv180409461W .

Method Increased err. (%)
TP (our impl.)
FP (our impl.)
Table 2: The increased error when accelerating 3D-ResNet18 on UCF101(baseline: 72.50%).
Method Increased err. (%)
TP (our impl.)
FP (our impl.)
Table 1: The increased error when accelerating C3D on UCF101(baseline: 79.94%).

3 Experiments

Our experiments are carried out by Caffe

JiaSheDonEtAl14 . We set the weight decay factor  to be the same as the baseline and set hyper-parameter  to half of . We only compress the weights in convolutional layers and leave the fully connected layers unchanged because we focus on network acceleration. The pruning ratios of the convolutional layers are set to the same for convenience. The methods used for comparison are Taylor Pruning (TP) Molchanov2016Pruning and Filter Pruning (FP)  Li2016Pruning . For all experiments, the ratio of speedup is calculated by GFLOPS reduction.

3.1 C3D on UCF101

We apply the proposed method to C3D Du2015Learning

, which is composed of 8 convolution layers, 5 max-pooling layers, and 2 fully connected layers. We download the open Caffe model as our pre-trained model, whose accuracy on UCF101 dataset is 79.94%. UCF101 contains 101 types of actions and a total of 13320 videos with a resolution of

. All videos are decoded into image files with 25 fps rate. Frames are resized into and randomly cropped to . Then frames are split into non-overlapped 16-frame clips which are then used as input to the networks.The results are shown in Table 2. With different speedup ratios, our approach is always better than TP and FP.

3.2 3D-ResNet18 on UCF101

We further demonstrate our method on 3D-ResNet18 Du2015Learning , which has 17 convolution layers and 1 fully-connected layer. The network is initially trained on the Sport-1M database. We download the model and then fine-tune it by UCF101 for 30000 iterations, obtaining an accuracy of 72.50%. The video preprocessing method is the same as above. The training settings are similar to that of C3D.

Experimental results are shown in Table 2. Our approach only suffers 0.91% increased error while achieving acceleration, obtaining better results than TP and FP. Fig.2

shows the loss during the pruning process for different methods. As the number of iterations increases, the losses of TP and FP change dramatically, while the loss of our method remains at a lower level consistently. This is probably because the proposed method imposes gradual regularization, making the network changes little-by-little in the parameter space, while both the TP and FP direct prune less important weights once for all.

Figure 2: The training losses on 3D-ResNet18 for TP, FP and the proposed method.

4 Conclusion

In this paper, we implement the regularization based method for 3D CNN acceleration. By assigning different regularization parameters to different weight groups according to the importance estimation, we gradually prune weight groups in the network. The proposed method achieves better performance than other two popular methods in this area.


This work is supported by the Fundamental Research Funds for the Central Universities under Grant 2017FZA5007, Natural Science Foundation of Zhejiang Province under Grant LY16F010004 and Zhejiang Public Welfare Research Program under Grant 2016C31062.