I Introduction
Since the Convolutional Neural Networks (CNNs) were exemplified by the performance improvements obtained by AlexNet [1]
in 2012, neural network based computer vision has achieved superhuman performance. Mobile devices are becoming an important carrier for deep learning tasks. However, realtime execution is the most critical requirement given computation/storage resource constraints on mobiles for deep learning tasks.
Recently, many efforts [2, 3, 4, 5, 6, 7, 8] aim to accelerate CNN execution on offtheshelf mobile devices and some of them achieve significant advancements. However, most of these optimizations focus on traditional 2D CNNs in the image domain. On the other hand, 3D CNNs have been proposed for video domain tasks such as video classification, and action recognition/detection [9, 10, 11, 12]. It is still an open problem to execute 3D CNNs on offtheshelf mobile devices targeting for realtime performance. For example, C3D [13]
, a mainstream 3D CNN takes over 2.5 seconds to complete the inference (of 16 frames) on a representative mobile CPU (Kryo 585 in Qualcomm Snapdragon platform) with Pytorch Mobile
[8], which is clearly far from realtime execution.^{1}^{1}1Realtime performance requires to compute 30 frames/second according to stateoftheart industry standard. The extra dimension in 3D convolution (CONV) significantly increases storage size and computation workload comparing with 2D CONV.^{2}^{2}22D CONV is a special case of 3D CONV with the temporal dimension size equal to 1. The large memory footprint of 3D CNN models often exceeds the onchip cache size of offtheshelf mobile devices. As a result, 3D CNNs are currently supported only by very few mobile acceleration frameworks (PyTorch Mobile [8] and Alibaba MNN [6]) with relatively low computation efficiency, let alone realtime performance.A natural way to bridge the gap is to turn to model compression techniques, particularly weight pruning [14, 15, 16, 17, 18, 19, 20]
which has demonstrated its efficacy on accelerating 2D CNN executions. Nevertheless, generalizing weight pruning methods from 2D to 3D CNNs is more than a straightforward task owing to the higher dimensionality of weight tensors and thus the larger search space of weight pruning. It is especially challenging to derive the bestsuited weight pruning method in order to achieve realtime performance on offtheshelf mobile devices. Two fundamental problems need to be solved: the
sparsity scheme and the pruning algorithm. The former refers to the regularity in pruning, i.e., the specific structural characteristics of CNNs after pruning. The two representative cases for 2D CNNs are the most flexible, irregular pruning scheme that can prune arbitrary weights [15, 16], and the computing platformfriendly filter/channel pruning scheme that prunes whole filters/channels [14, 20, 19]. The latter refers to the appropriate algorithm to determine the target weights to remove and train the remaining, nonzero weights. For 2D CNNs, there is also rich literature in heuristic pruning
[15, 16, 17] or regularizationbased pruning algorithms [14, 19, 20].This work develops RT3D framework, including the derivation of bestsuited (structured) sparsity scheme and pruning algorithm of 3D CNNs, and the design of the associated compileraided acceleration, for offtheshelf mobile devices. We propose and investigate two structured sparsity schemes that are highly mobile acceleration friendly. The first vanilla sparsity scheme achieves sparsity by removing kernel groups in 3D CONV layers. It can achieve straightforward acceleration for ondevice inference with the aid of compiler code generation, but it suffers from relatively high accuracy loss as whole kernel groups are pruned. The second, more optimized one is the kernel group structured (KGS) sparsity scheme. It is a more finegrained structured sparsity that enjoys higher flexibility, and will result in a higher accuracy under the same pruning rate. Moreover, it is important to note that the KGS sparsity scheme is beyond a mere tradeoff of accuracy and mobile performance. In fact, with proper support of compiler code generation, the KGS sparsity can achieve almost the same mobile acceleration (e.g., in Frames/second) as the vanilla sparsity, under the same pruning rate. This is owing to the delicate design of KGS sparsity to match the parallelization mechanism in compilerassisted mobile acceleration, such that the full ondevice parallelism can be exploited.
We further present three pruning algorithms to achieve the proposed structured sparsity schemes for 3D CNNs. The first two, i.e., the heuristic algorithm and regularizationbased algorithm, are natural generalization from stateoftheart algorithms on 2D CNN weight pruning. However, they are either greedy algorithm, or suffer from the limitation that all weights will be equally penalized even after convergence of the pruning process. Both result in potential accuracy loss. To overcome these shortcomings, we propose a novel reweighted regularization pruning
algorithm. The basic idea is to systematically and dynamically reweight the penalties, reducing the penalties on weights with large magnitudes (which are likely to be more critical), and increasing the penalties on weights with smaller magnitudes. It possesses other advantages, such as not introducing additional hyperparameters, and being flexible for either parameter reduction or FLOPs (floatingpoint operations) reduction, etc.
Seamlessly integrated with above innovations, RT3D also develops the first endtoend, compilerassisted acceleration framework of 3D CNNs on both mobile CPUs and GPUs (the few prior work are limited to mobile CPUs), and also the first to support different structured sparsity schemes. RT3D achieves up to speedup in endtoend inference time comparing with current mobile frameworks supporting 3D CNNs, with moderate 1%1.5% accuracy loss, on representative CNNs (C3D, R(2+1)D, S3D). The endtoend inference time for 16 video frames could be within 150 ms.
A brief contribution summary is: (a) sparsity schemes for 3D CNNs which are both flexible and mobile acceleration friendly, (b) highly effective pruning algorithm to achieve such sparsity schemes, (c) compilerassisted mobile acceleration framework, and (d) for the first time, realtime performance of 3D CNNs can be achieved on offtheshelf mobile devices using a pure software solution.
Ii Related Work
Iia Weight Pruning for 2D CNNs
The rich literature in weight pruning for 2D CNNs can be categorized into heuristic pruning algorithms and regularizationbased pruning algorithms. The former starts from the early work on irregular, unstructured weight pruning where arbitrary weights can be pruned. [15] adopts an iterative algorithm to eliminate weights with small magnitude and perform retraining to regain accuracy. [16] incorporates connection splicing into the pruning process to dynamically recover the pruned connections that are found to be important. Later, heuristic pruning algorithms have been generalized to the more hardwarefriendly structured sparsity schemes. In [17], Transformable Architecture Search (TAS) is adopted to realize the pruned network and knowledge is transferred from the unpruned network to the pruned version. The work [21] leverages a greedy algorithm to guide the pruning of the current layer with input information of the next layer. The work [19]
defines a “neuron importance score" and propagates this score to conduct the weight pruning process.
Regularizationbased pruning algorithms, on the other hand, are more mathematicsoriented and have the unique advantage for dealing with structured pruning problems through group Lasso regularization [22, 23]. Early work [14, 24] incorporate or
regularization in loss function to solve filter/channel pruning problems. However, there is also one limitation of the direct application of regularization terms – all weights will be penalized equally even after pruning convergence, resulting in potential accuracy loss. A number of subsequent work are dedicated to making the regularization penalty a dynamic and "soft" term. The method in
[25] selects filters based on norm and updates the filters that have been previously pruned. [26, 27] incorporate the advanced optimization solution framework ADMM (Alternating Direction Methods of Multipliers) to achieve dynamic regularization penalty, thereby improving accuracy. [20]proposes to adopt Geometric Median, a classic robust estimator of centrality for data in Euclidean spaces. A common limitation of these improved versions is that the pruning rate for each layer needs to be manually set, which is difficult to derive in prior.
Motivated by the prosperous research on network architecture search (NAS) [28, 29, 30, 31, 32, 33, 34, 35], there are recent research [36, 37, 17, 38] of automatic search of hyperparameters in weight pruning. This direction is orthogonal to the proposed research in this work, and can be combined for achieving even higher performance and degree of automation.
IiB Mobile Acceleration Frameworks of CNNs
TVM [5], TFLite [7], Alibaba Mobile Neural Network (MNN) [6] and PyTorch Mobile (PyTorch) [8] are representative compilerassisted deep learning acceleration frameworks on mobile devices. They mainly focus on endtoend acceleration for 2D CNNs. Only MNN and PyTorch support 3D CONV on mobile CPUs (no mobile GPU support); while other popular ones (like TVM and TFLite) do not support 3D CONV computation. To the best of our knowledge, our RT3D is the first endtoend deep learning acceleration framework for 3D CNNs on both mobile CPUs and GPUs. More than that, it is also the first to support the acceleration of various sparsity schemes of 3D CNNs.
Iii Structured Sparsity Schemes for 3D CNNs
This section proposes two structured sparsity schemes of 3D CNNs. We focus on the most computationally intensive convolutional (CONV) layers of 3D CNNs. Let denote the 5dimensional weight tensor of the th CONV layer of a 3D CNN, where is the number of filters; is the number of input channels; , , and are the width, height, and depth, respectively, of the 3D CONV kernels. Different from the 2D CONV kernel, the 3D CONV kernel has an additional dimension on the kernel depth, making a 5dimensional tensor.
Figure 1 demonstrates the proposed two structured sparsity schemes for 3D CNNs: Vanilla Structured Sparsity Scheme and Kernel Group Structured (KGS) Sparsity Scheme. The weight tensor is first partitioned into groups of kernels along the filter and input channel dimensions. Each kernel group consists of ( in Figure 1) 3D kernels. The Vanilla sparsity scheme is shown in Figure 1 (a), where whole kernel groups are determined to be pruned or not. On the other hand, our proposed KGS sparsity scheme as shown in Figure 1 (b) is that for the kernels in the same group, weights are pruned at the same locations. This is illustrated better on the right of Figure 1
(b), where 3D kernels are reshaped into vectors with
weights. Consider the kernels in a group, i.e., kernels . Weights at the same location in these kernels i.e., are determined to be pruned or not together, where describes the same location (coordinate) in kernels.The Vanilla sparsity scheme is a relatively straightforward generalization from structured sparsity schemes [14, 36, 21] for 2D CNNs. It can achieve straightforward acceleration for ondevice inference with the aid of compiler code generation, but it will obviously suffer from relatively high accuracy loss as whole kernel groups are pruned. On the other hand, the proposed KGS sparsity scheme is a more finegrained structured sparsity that enjoys higher flexibility. In fact, the Vanilla sparsity scheme is just a special case of KGS sparsity, and therefore, one can confidently state that the KGS sparsity will result in a higher accuracy under the same pruning rate, as long as effective pruning algorithm has been developed and employed.
It is important to note that the KGS sparsity scheme is beyond a mere tradeoff of accuracy and mobile performance. In fact, with proper support of compiler code generation, the KGS sparsity can achieve almost the same mobile acceleration performance (e.g., in Frames/second) as Vanilla sparsity, under the same pruning rate. This is owing to the delicate design of KGS sparsity to match compilerassisted mobile acceleration. For effective mobile acceleration, the whole kernel group will be transformed into matrix multiplication (with input feature map) [39] as shown in the reshaping step of Figure 1 (b). Accordingly, the KGS sparsity is equivalent to whole column removals in the weight matrix of a kernel group. The computation overhead in whole column removal is minor and can be mitigated by compilers, and the remaining computation is still based on full matrices (albeit smaller). A key observation is that the parallelism degree on offtheshelf mobile devices is limited, and thus the smaller matrices of remaining weights have enough size to fully exploit the parallelism provided by mobile devices. As an illustrative example, suppose that the mobile device can execute 10 operations in parallel while the matrix contains 100 remaining operations. Then the reducedsize matrix can be executed in 10 iterations, achieving full parallelism. As the hardware parallelism can be fully exploited in both Vanilla and KGS schemes (if compiler overhead is negligible), the mobile acceleration performance in terms of FLOPs/second will be almost the same for both pruning schemes, and so does the Frames/second performance under the same pruning rate (and FLOPs count). As a result, the proposed KGS sparsity can fully enjoy the benefit of high flexibility in terms of higher accuracy or higher pruning rate.
Please note that the group size needs to be determined in Vanilla and KGS sparsity schemes, in order to achieve the maximum ondevice parallelism with low computation overhead. The group size is determined offline with actual mobile testings using synthesized CNN layers. In other words, it will NOT become a hyperparameter in the pruning algorithm. and or are preferred to match the SIMD (Single Instruction, Multiple Data) parallelism provided by current mobile CPUs and GPUs. These values are large enough to exploit the ondevice parallelism and small enough to provide enough pruning flexibility and accuracy, as shall be seen in the experimental results.
Iv Structured Sparsity Learning Algorithms
This section describes three pruning algorithms to achieve the proposed structured sparsity schemes for 3D CNNs. The first two are natural generalization from stateoftheart algorithms on 2D CNN weight pruning, and the last one is specifically designed to address the limitations in the prior two. Consider a general 3D CNN consisting of convolutional (CONV) layers. Besides the th CONV layer weight tensor , the bias is denoted by . The loss function associated with a 3D CNN can be denoted by . To achieve the proposed groupwise sparsity schemes, weight tensor is partitioned into a set of kernel groups along the dimensions of filters and channels, i.e., , for and , where , , and denotes the integer set .
1. Heuristic Pruning Algorithm: As discussed in Section 2, the prior work has investigated heuristic pruning for 2D CNNs, for both irregular and structured sparsity schemes. The prior work [21, 19] are mostly relevant to this work as we also focus on structured sparsity. Motivated by these work, we assign a similar “neuron importance score" to each kernel group (or the same location of kernels in the group), and perform pruning on the current layer with input information of the next layer in a back propagated manner (similar procedure as [21]). This serves as our heuristic pruning algorithm for the proposed sparsity schemes of 3D CNNs.
2. Regularizationbased Pruning Algorithm: adds an additional regularization term to the loss function to achieve the Vanilla or KGS sparsity scheme. Then, the regularizationbased pruning can be formulated as
(1) 
where is the regularization term for the Vanilla or KGS sparsity and is the penalty measuring its importance. Motivated by group Lasso [22], the regularization term can be defined as , where denotes kernel group norm. We can choose from norm [36], norm [25, 27] or their combination for this groupwise regularization.
In the following we focus on the KGS sparsity scheme. In more details, the regularizationbased pruning can be achieved by
(2)  
3. Reweighted Regularization Pruning Algorithm: As discussed in Section IIA, the fixed regularizationbased pruning algorithm has a limitation, – all weights will be equally penalized even after convergence of the pruning process, resulting in potential accuracy loss. We propose a novel reweighted regularization pruning algorithm to overcome this limitation. The basic idea is to systematically and dynamically reweight the penalties. Especially, we will reduce the penalties on weights with large magnitudes (which are likely to be more critical), and increase the penalties on weights with smaller magnitudes. This shall be performed in a systematic, gradual way to avoid the greedy solution which prunes a large number of weights at the early stage. Moreover, our proposed algorithm does not need to manually set the pruning rate for each layer, as a limitation in prior works based on ADMM or Geometric Medianbased regularizations.
For reweighted regularization, we minimize the following objective function:

(3) 
where denotes elementwise multiplication. is the collection of penalty parameters and is updated in every iteration to facilitate the degree of sparsity. In each iteration, the instance of is denoted by and we update by setting
where is a small positive number avoiding the zero denominator. The reweighted regularization process updates penalty parameters based on the current weight values, will not incur extra hyperparameters, and has a fast convergence rate as analyzed in [40]. After 3
4 iterations, we will prune the weights that converge to zero, and perform a slight retraining on the nonzero weights (with a few epochs) to regain accuracy.
While overcoming the limitation in fixed regularizationbased algorithms, the advantage and flexibility in such algorithms will be preserved. There is only one as the major hyperparameter, without the need of manually deciding perlayer pruning rate. Also similar to fixed regularization algorithms, we can multiply the perlayer FLOPs value to each layer in the above optimization function. In this way we can target at the overall FLOPs reduction, which is more relevant to the actual acceleration. In the experiments, we set the FLOPs reduction as the optimization target, and report the corresponding FLOPs reduction rates and actually measured mobile accelerations.
V Experimental Results
Va Evaluation on Sparsity Schemes and Pruning Algorithms
Experimental Setup. We test the proposed two structured sparsity schemes i.e., Vanilla and KGS sparsity and three pruning algorithms on 3D CNN models (including one (2+1)D CNN): C3D [13], R(2+1)D [41], and S3D [42]. Besides the two proposed sparsity schemes, a filter sparsity scheme is also implemented, where the filters may be pruned as a whole, and which is a direct generalization of the filter pruning of 2D CNNs. The models are all pretrained on the Kinetics dataset [11] and transferred onto the UCF101 [43] and HMDB51 [44] datasets as the pretrained dense models. The hyperparameter settings are the same for all pruning algorithms and sparsity schemes for fair comparisons. The batch size is fixed to 32, and the video clip length is 16 frames. The initial learning rate is 5e3 when training the dense model, and is reduced to 2e4 in the weight pruning and retraining for stability. The learning rate is fixed in the pruning process, while adjusted in retraining with a scheduler following the cosine function. For different types of sparsity schemes and pruning algorithms, the total number of epochs is fixed to 240 epochs.^{3}^{3}3Although the reweighted pruning algorithm is iterative, its latter iterations require significantly fewer epochs. Thus we can set the total epochs the same for different algorithms. For the pruning of all the models, we have used the best combination of and norms in the regularization term. The penalty factor is set to 5e4. The pruning and retraining processes are carried out with eight NVIDIA GeForce GTX 1080 Ti GPUs and the PyTorch framework.
Results. The pruning results on C3D and R(2+1)D models on UCF101 dataset with various pruning algorithms and sparsity schemes are provided in Table I. For each pruning algorithm, the three sparsity schemes are compared under the same pruning rate (FLOPs reduction on the overall model), and KGS results of two pruning configurations are compared. As can be observed in the table, the KGS sparisity scheme consistently outperforms the vanilla sparsity, and these two schemes both perform better than filter pruning. The reweighted regularization algorithm consistently outperforms the other two pruning algorithms. The advantages of KGS sparsity and reweighted regularization are stated in Section 3 and Section 4. With reweighted regularization and KGS sparsity scheme, both C3D and R(2+1)D could achieve only 1%1.5% accuracy loss under pruning rate of 2.6.
Model  Pruning  Sparsity  Overall FLOPs  Pruning Rate  Base Top1  Pruning Top1 
Algorithm  Scheme  after Pruning  of FLOPs  Accuracy  Accuracy  
C3D (299MB)  Heuristic  Filter  15.2G  2.6  81.6%  78.6% 
Vanilla  15.2G  2.6  78.8%  
KGS  15.2G  2.6  79.0%  
KGS  10.8G  3.6  78.5%  
Regularization  Filter  15.2G  2.6  81.6%  78.8%  
Vanilla  15.2G  2.6  79.0%  
KGS  15.2G  2.6  79.6%  
KGS  10.8G  3.6  79.3%  
Filter  15.2G  2.6  81.6%  79.3%  
Reweighted  Vanilla  15.2G  2.6  79.7%  
Regularization  KGS  15.2G  2.6  80.5%  
KGS  10.8G  3.6  80.2%  
R(2+1)D (120MB)  Heuristic  Filter  15.9G  2.6  94.0%  89.0% 
Vanilla  15.9G  2.6  89.4%  
KGS  15.9G  2.6  90.4%  
KGS  12.7G  3.2  89.9%  
Regularization  Filter  15.9G  2.6  94.0%  89.8%  
Vanilla  15.9G  2.6  90.8%  
KGS  15.9G  2.6  91.7%  
KGS  12.7G  3.2  91.3%  
Filter  15.9G  2.6  94.0%  90.5%  
Reweighted  Vanilla  15.9G  2.6  91.7%  
Regularization  KGS  15.9G  2.6  92.5%  
KGS  12.7G  3.2  92.0% 
VB Evaluation on Mobile Acceleration Performance
Mobile Acceleration Framework Implementation.
We design and implement an endtoend, compilerassisted CNN acceleration framework that supports 3D CNNs. Without any pruningrelated optimizations, RT3D is already faster than stateoftheart CNN execution frameworks (such as MNN and PyTorch Mobile) on mobile CPUs, because we include more advanced optimizations like finetuned highefficient SIMD (Single Instruction, Multiple Data) execution, finetuned weight layout organization, etc. Our framework is also the first to support 3D CNN executions on mobile GPUs. It is general, supporting both 2D and 3D CNNs. Comparing to other popular CNN acceleration frameworks that support 2D CONV (like TVM and MNN) on standard 2D benchmarks like VGGNet, ResNet, MobileNetV2, etc., our developed framework also yields consistently better performance.
Moreover, RT3D is also the first to support various sparsity schemes, including Filter, and proposed Vanilla and KGS sparsity. Based on the sparsity scheme, it employs a compilerbased automatic code generation approach to reorganize the model weights, regularize the computations, tune the computation configuration, and generate the optimized model inference codes. Our framework can automatically generate both optimized CPU (vectorized C++) and GPU (OpenCL) codes to support both dense and sparse 3D CNN executions.
Testbed and Evaluation Setup.
The evaluations are conducted on a Samsung Galaxy S20 cellphone with the latest Qualcomm Snapdragon 865 platform consisting of a Qualcomm Kryo 585 Octacore CPU and a Qualcomm Adreno 650 GPU. All experiments run 50 times with 8 threads on mobile CPU, and all pipelines on mobile GPU. Because different runs do not vary severely, only the average inference execution time is reported for readability. All models are tuned to their best configurations, e.g., with computational graph optimizations, the best tiling size, unrolling size, etc. 32bit floating point is applied for CPU runs, and 16bit floating point is used for GPU runs. This is the same for both baseline mobile acceleration frameworks and our RT3D framework for a fair comparison, as quantization is not supported by baseline frameworks.
Mobile Acceleration Results.
We next evaluate RT3D by comparing it with MNN [6] and PyTorch Mobile (PyTorch) [8].^{4}^{4}4Other popular mobile CNN acceleration frameworks like TVM and TFLite do not support 3D CNNs. Table II compares the endtoend 3D CNN inference time (latency). RT3D supports both dense (original) and sparse 3D CNNs on both mobile CPU and mobile GPU, PyTorch supports dense models on CPU only, and MNN supports dense C3D on CPU only. For sparse models, RT3D uses pruned models by reweighted regularization pruning algorithms with KGS sparsity with the pruning rate of for C3D, for R(2+1)D, and for S3D, and the accuracy of , , and ^{5}^{5}5The base accuracy of S3D is 90.6%., respectively. In the table, the RT3D speedups are compared with PyTorch. RT3D outperforms MNN and PyTorch on mobile CPU for all cases. RT3D on mobile GPU performs even better than on CPU. For example, for C3D, the fully optimized RT3D (Sparse) outperforms the CPU version of PyTorch and MNN with the speedup of and on CPU, and and on GPU, respectively. Notably, on mobile GPU, the fully optimized RT3D can infer 16 frames by using C3D, R(2+1)D, and S3D within 142 ms, 141 ms, and 293 ms, respectively, achieving realtime execution (say 30 frames per second) of 3D CNNs on mobile devices.
Importantly, although RT3D’s dense implementations have already been fully optimized, our sparse implementations especially on mobile GPU can fully transform the pruning rate of FLOPs into inference latency speedup. For example, on C3D, from RT3D (dense) to RT3D (sparse) on GPU, the improvement on inference latency is , while the pruning rate of the sparse model is . This validates the statement in Section 3 that the proposed KGS sparsity scheme can exploit the parallelism degree on device. Moreover, 3D CONV is memoryintensive, bounded by both memory bandwidth and latency (which is more severe on mobile GPU due to its even limited cache capacity), and our pruning/compilation codesign is able to mitigate this issue with incurring negligible overhead. Our cache access count results validate this.
Ablation Study.
We also compare two sparsity schemes, Vanilla and KGS in terms of pruning rate and inference latency on mobiles by controlling the same pruning top1 accuracy (as shown in Table III). It shows that KGS scheme achieves both higher pruning rate (in FLOPs) and lower inference latency under the same pruning accuracy on both C3D and R(2+1)D due to KGS’s high flexibility and seamless match with compilerlevel optimizations.
Framework  MNN [6]  PyTorch [8]  RT3D (Dense)  RT3D (Sparse)  
Device  CPU (ms)  CPU (ms)  CPU (ms)  Speedup  GPU (ms)  Speedup  CPU (ms)  Speedup  GPU (ms)  Speedup 
C3D  948  2544  902  2.8  488  5.2  357  7.1  142  17.9 
R(2+1)D    4104  1074  3.8  513  8.0  391  10.5  141  29.1 
S3D    6617  1139  5.8  565  11.7  611  10.8  293  22.6 
Model  Sparsity  Base Top1  Pruning Top1  FLOPs  Pruning Rate  Latency (ms)  
Scheme  Accuracy  Accuracy  after Pruning  of FLOPs  CPU  GPU  
C3D  Vanilla  81.6%  80.0%  16.4G  2.4  525  236 
KGS  9.7G  4.0  329  134  
R(2+1)D  Vanilla  94.0%  91.8%  15.5G  2.5  523  225 
KGS  10.2G  4.0  360  127 
Vi Conclusion
This paper presents RT3D, a mobile acceleration framework for 3D CNNs that consists of two novel, mobilefriendly structured sparsity schemes (Vanilla and KGS) and bestsuited pruning algorithms, and a compilerassisted code generation framework to transform pruning benefits to performance gains. The evaluation results show that RT3D beats two stateoftheart acceleration frameworks with speedup up to . RT3D can infer 16 video frames within 150 ms, for the first time, achieving realtime inference of 3D CNNs on offtheshelf mobile devices with a pure software solution.
References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems (NeurIPS), 2012, pp. 1097–1105.  [2] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishnamurthy, “Mcdnn: An approximationbased execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys). ACM, 2016, pp. 123–136.
 [3] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, “Deepsense: A unified deep learning framework for timeseries mobile sensing data processing,” in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 351–360.
 [4] L. N. Huynh, Y. Lee, and R. K. Balan, “Deepmon: Mobile gpubased deep learning framework for continuous vision applications,” in Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys). ACM, 2017, pp. 82–95.
 [5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated endtoend optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
 [6] https://github.com/alibaba/MNN.
 [7] https://www.tensorflow.org/mobile/tflite/.
 [8] https://pytorch.org/mobile/home.
 [9] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence (T PAMI), vol. 35, no. 1, pp. 221–231, 2012.
 [10] Z. Wang, Q. Lan, H. He, and C. Zhang, “Winograd algorithm for 3d convolution neural networks,” in International Conference on Artificial Neural Networks (ICANN). Springer, 2017, pp. 609–616.

[11]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and
the kinetics dataset,” in
Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
, 2017, pp. 6299–6308.  [12] Z. Qiu, T. Yao, and T. Mei, “Learning spatiotemporal representation with pseudo3d residual networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5533–5541.
 [13] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
 [14] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in neural information processing systems (NeurIPS), 2016, pp. 2074–2082.
 [15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations (ICLR), 2016.
 [16] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in neural information processing systems (NeurIPS), 2016, pp. 1379–1387.
 [17] X. Dong and Y. Yang, “Network pruning via transformable architecture search,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 759–770.
 [18] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discriminationaware channel pruning for deep neural networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 875–886.
 [19] R. Yu, A. Li, C.F. Chen, J.H. Lai, V. I. Morariu, X. Han, M. Gao, C.Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203.
 [20] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for deep convolutional neural networks acceleration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4340–4349.
 [21] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017, pp. 5058–5066.
 [22] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.
 [23] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
 [24] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1389–1397.

[25]
Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for
accelerating deep convolutional neural networks,” in
International Joint Conference on Artificial Intelligence (IJCAI)
, 2018.  [26] T. Zhang, S. Ye, Y. Zhang, Y. Wang, and M. Fardad, “Systematic weight pruning of dnns using alternating direction method of multipliers,” arXiv preprint arXiv:1802.05747, 2018.
 [27] T. Li, B. Wu, Y. Yang, Y. Fan, Y. Zhang, and W. Liu, “Compressing convolutional neural networks via factorized convolutional filters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3977–3986.

[28]
B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in
International Conference on Learning Representations (ICLR), 2017.  [29] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platformaware neural architecture search for mobile,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2820–2828.
 [30] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” in International Conference on Learning Representations (ICLR), 2017.

[31]
E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in
Proceedings of the aaai conference on artificial intelligence, vol. 33, 2019, pp. 4780–4789. 
[32]
G. Bender, P.J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le, “Understanding
and simplifying oneshot architecture search,” in
Proceedings of the 35th International Conference on Machine Learning (ICML)
, 2018, pp. 550–559.  [33] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search via parameters sharing,” in Proceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 4095–4104.
 [34] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” in International Conference on Learning Representations (ICLR), 2019.
 [35] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardwareaware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 734–10 742.
 [36] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2736–2744.
 [37] Y. He, J. Lin, Z. Liu, H. Wang, L.J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 784–800.
 [38] N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye, “Autocompress: An automatic dnn structured pruning framework for ultrahigh compression rates,” arXiv preprint arXiv:1907.03141, 2019.
 [39] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
 [40] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted 1 minimization,” Journal of Fourier analysis and applications, vol. 14, no. 56, pp. 877–905, 2008.
 [41] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450–6459.
 [42] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speedaccuracy tradeoffs in video classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 305–321.
 [43] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
 [44] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2556–2563.
Comments
There are no comments yet.