1 Introduction
In recent years, convolutional neural network (CNN) has developed rapidly and has achieved remarkable success in computer vision tasks such as identification, classification and segmentation. However, due to the lack of motion modeling, this imagebased endtoend feature can not directly apply to videos. In
Xu20123D ; Karpathy2014Large , the authors use threedimensional convolutional networks (3D CNN) to identify human actions in videos. Tran et al. proposed a 3D CNN for action recognition which contains 1.75 million parameters Du2015Learning . The development of 3D CNN also brings challenges because of its higher dimensions. This leads to massive computing and storage consumption, which hinders its deployment on mobile and embedded devices.In order to reduce the computation cost, researchers propose methods to compress CNN models, including knowledge distillation Hin2015Knowledge , parameter quantization Courbariaux2015BinaryConnect ; Rastegari2016XNOR , matrix decomposition Zhang2015Accelerating and parameter pruning Han2015Deep . However, all of the above methods are based on twodimensional convolution. In this paper, we expand the idea of 2018arXiv180409461W to 3D CNN acceleration. The main idea is to add group regularization items to the objective function and prune weight groups gradually, where the regularization parameters for different weight groups are differently assigned according to some importance criteria.
2 The Proposed Method
For a threedimensional convolutional neural network with layers, the weights of the th () convolutional layer
is a sequences of 5D tensors. Here
, , , and are the dimensions along the axes of filter, channel, spatial height, spatial width and spatial depth. The proposed objective function for structured sparsity regularization is defined by Eqn.(1).(1) 
Here is the loss on data; is the nonstructured regularization ( norm in this paper). is the structured sparsity regularization on each layer. In Wen2015Learning ; Lebedev2016Fast , the authors used the same for all groups and adopted Group LASSO for . Recently Wang et al. 2018arXiv180409461W use the squared norm for and vary the regularization parameters for different groups. We build on top of that approach but extend it from two dimensions to three dimensions.
The structure learned is determined by the way of splitting groups of .There are normally filerwise, channelwise, shapewise, and depthwise structured sparsity with different ways of grouping Wen2015Learning . Pruning of different weight groups for 3D CNN is shown in Fig.1.
In 2018arXiv180409461W , Wang et al. theoretically proved that by increasing the regularization parameter , the magnitude of weights tends to be minimized. The more increases, the more magnitude of weights are compressed to zero. Therefore, we can assign different for the weight groups based on their importance to the network. Here, we use the norm as a criterion of importance.
Our goal is to prune weight groups in the network, where is the pruning ratio to each layer and is total number of weight groups in the layer. In other words, we need to prune weight groups which ranks lower in the network. We sort the weight groups in ascending order of the norms. In order to remove the oscillation of ranks during one training iteration, we averaged the rank through training iterations to obtain the average rank in training iterations:
The final average rank is obtained by sorting of different weight groups in ascending order, making its range from to . The update of is determined by the following formula:
Here is the function of average rank , we follow the formula proposed by Wang 2018arXiv180409461W as follows:
(2) 
Here
is a hyperparameter which controls the speed of convergence. According to Eqn.(
2), we can see that is zero when because we need to increase the regularization parameters of the weight groups whose ranks are below to further decrease their norms; and for those with greater norms and rank above , we need to decrease their regularization parameters to further increase their norms. Thus, we can ensure that exactly weight groups are pruned at the final stage of the algorithm. When we obtain , the weights can be updated through backpropagation deduced from Eqn.(1). Further details can be found in 2018arXiv180409461W .Method  Increased err. (%)  

TP (our impl.)  
FP (our impl.)  
Ours 
Method  Increased err. (%)  

TP (our impl.)  
FP (our impl.)  
Ours 
3 Experiments
Our experiments are carried out by Caffe
JiaSheDonEtAl14 . We set the weight decay factor to be the same as the baseline and set hyperparameter to half of . We only compress the weights in convolutional layers and leave the fully connected layers unchanged because we focus on network acceleration. The pruning ratios of the convolutional layers are set to the same for convenience. The methods used for comparison are Taylor Pruning (TP) Molchanov2016Pruning and Filter Pruning (FP) Li2016Pruning . For all experiments, the ratio of speedup is calculated by GFLOPS reduction.3.1 C3D on UCF101
We apply the proposed method to C3D Du2015Learning
, which is composed of 8 convolution layers, 5 maxpooling layers, and 2 fully connected layers. We download the open Caffe model as our pretrained model, whose accuracy on UCF101 dataset is 79.94%. UCF101 contains 101 types of actions and a total of 13320 videos with a resolution of
. All videos are decoded into image files with 25 fps rate. Frames are resized into and randomly cropped to . Then frames are split into nonoverlapped 16frame clips which are then used as input to the networks.The results are shown in Table 2. With different speedup ratios, our approach is always better than TP and FP.3.2 3DResNet18 on UCF101
We further demonstrate our method on 3DResNet18 Du2015Learning , which has 17 convolution layers and 1 fullyconnected layer. The network is initially trained on the Sport1M database. We download the model and then finetune it by UCF101 for 30000 iterations, obtaining an accuracy of 72.50%. The video preprocessing method is the same as above. The training settings are similar to that of C3D.
Experimental results are shown in Table 2. Our approach only suffers 0.91% increased error while achieving acceleration, obtaining better results than TP and FP. Fig.2
shows the loss during the pruning process for different methods. As the number of iterations increases, the losses of TP and FP change dramatically, while the loss of our method remains at a lower level consistently. This is probably because the proposed method imposes gradual regularization, making the network changes littlebylittle in the parameter space, while both the TP and FP direct prune less important weights once for all.
4 Conclusion
In this paper, we implement the regularization based method for 3D CNN acceleration. By assigning different regularization parameters to different weight groups according to the importance estimation, we gradually prune weight groups in the network. The proposed method achieves better performance than other two popular methods in this area.
Acknowledgments
This work is supported by the Fundamental Research Funds for the Central Universities under Grant 2017FZA5007, Natural Science Foundation of Zhejiang Province under Grant LY16F010004 and Zhejiang Public Welfare Research Program under Grant 2016C31062.
References
 (1) S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2012.

(2)
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li,
“Largescale video classification with convolutional neural networks,” in
Proceedings of the International Conference on Computer Vision and Pattern Recognition, CVPR’14
, 2014, pp. 1725–1732.  (3) D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D: generic features for video analysis,” ArXiv preprint: 1412.0767, 2014.
 (4) G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” ArXiv preprint: 1503.02531, 2015.
 (5) M. Courbariaux, Y. Bengio, and J. P. David, “Binaryconnect: training deep neural networks with binary weights during propagations,” in Proceedings of the International Conference on Neural Information Processing Systems, NIPS’15, 2015, pp. 3123–3131.

(6)
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in
Proceedings of the European Conference on Computer Vision, ECCV’16, 2016, pp. 525–542.  (7) X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–1955, 2015.
 (8) S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” ArXiv preprint: 1510.00149, 2015.
 (9) H. Wang, Q. Zhang, Y. Wang, and R. Hu, “Structured Deep Neural Network Pruning by Varying Regularization Parameters,” ArXiv preprint: 1804.09461, 2018.
 (10) W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” ArXiv preprint: 1608.03665, 2016.
 (11) V. Lebedev and V. Lempitsky, “Fast convnets using groupwise brain damage,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, CVPR’16, 2016, pp. 2554–2564.
 (12) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrel, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint, vol. arXiv:1408.5093, 2014.
 (13) P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient transfer learning,” ArXiv preprint: 1611.06440, 2016.
 (14) H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” ArXiv preprint: 1608.08710, 2016.