Achieving Real-Time Execution of 3D Convolutional Neural Networks on Mobile Devices

Mobile devices are becoming an important carrier for deep learning tasks, as they are being equipped with powerful, high-end mobile CPUs and GPUs. However, it is still a challenging task to execute 3D Convolutional Neural Networks (CNNs) targeting for real-time performance, besides high inference accuracy. The reason is more complex model structure and higher model dimensionality overwhelm the available computation/storage resources on mobile devices. A natural way may be turning to deep learning weight pruning techniques. However, the direct generalization of existing 2D CNN weight pruning methods to 3D CNNs is not ideal for fully exploiting mobile parallelism while achieving high inference accuracy. This paper proposes RT3D, a model compression and mobile acceleration framework for 3D CNNs, seamlessly integrating neural network weight pruning and compiler code generation techniques. We propose and investigate two structured sparsity schemes i.e., the vanilla structured sparsity and kernel group structured (KGS) sparsity that are mobile acceleration friendly. The vanilla sparsity removes whole kernel groups, while KGS sparsity is a more fine-grained structured sparsity that enjoys higher flexibility while exploiting full on-device parallelism. We propose a reweighted regularization pruning algorithm to achieve the proposed sparsity schemes. The inference time speedup due to sparsity is approaching the pruning rate of the whole model FLOPs (floating point operations). RT3D demonstrates up to 29.1× speedup in end-to-end inference time comparing with current mobile frameworks supporting 3D CNNs, with moderate 1 frames could be within 150 ms, when executing representative C3D and R(2+1)D models on a cellphone. For the first time, real-time execution of 3D CNNs is achieved on off-the-shelf mobiles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

01/01/2020

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

With the emergence of a spectrum of high-end mobile devices, many applic...
06/16/2021

Algorithm to Compilation Co-design: An Integrated View of Neural Network Sparsity

Reducing computation cost, inference latency, and memory footprint of ne...
01/31/2021

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

The growing energy and performance costs of deep learning have driven th...
11/07/2018

FLOPs as a Direct Optimization Objective for Learning Sparse Neural Networks

There exists a plethora of techniques for inducing structured sparsity i...
01/23/2019

Towards Compact ConvNets via Structure-Sparsity Regularized Filter Pruning

The success of convolutional neural networks (CNNs) in computer vision a...
05/02/2019

26ms Inference Time for ResNet-50: Towards Real-Time Execution of all DNNs on Smartphone

With the rapid emergence of a spectrum of high-end mobile devices, many ...
12/28/2021

Speedup deep learning models on GPU by taking advantage of efficient unstructured pruning and bit-width reduction

This work is focused on the pruning of some convolutional neural network...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Since the Convolutional Neural Networks (CNNs) were exemplified by the performance improvements obtained by AlexNet [1]

in 2012, neural network based computer vision has achieved superhuman performance. Mobile devices are becoming an important carrier for deep learning tasks. However, real-time execution is the most critical requirement given computation/storage resource constraints on mobiles for deep learning tasks.

Recently, many efforts [2, 3, 4, 5, 6, 7, 8] aim to accelerate CNN execution on off-the-shelf mobile devices and some of them achieve significant advancements. However, most of these optimizations focus on traditional 2D CNNs in the image domain. On the other hand, 3D CNNs have been proposed for video domain tasks such as video classification, and action recognition/detection [9, 10, 11, 12]. It is still an open problem to execute 3D CNNs on off-the-shelf mobile devices targeting for real-time performance. For example, C3D [13]

, a mainstream 3D CNN takes over 2.5 seconds to complete the inference (of 16 frames) on a representative mobile CPU (Kryo 585 in Qualcomm Snapdragon platform) with Pytorch Mobile 

[8], which is clearly far from real-time execution.111Real-time performance requires to compute 30 frames/second according to state-of-the-art industry standard. The extra dimension in 3D convolution (CONV) significantly increases storage size and computation workload comparing with 2D CONV.2222D CONV is a special case of 3D CONV with the temporal dimension size equal to 1. The large memory footprint of 3D CNN models often exceeds the on-chip cache size of off-the-shelf mobile devices. As a result, 3D CNNs are currently supported only by very few mobile acceleration frameworks (PyTorch Mobile [8] and Alibaba MNN [6]) with relatively low computation efficiency, let alone real-time performance.

A natural way to bridge the gap is to turn to model compression techniques, particularly weight pruning [14, 15, 16, 17, 18, 19, 20]

which has demonstrated its efficacy on accelerating 2D CNN executions. Nevertheless, generalizing weight pruning methods from 2D to 3D CNNs is more than a straightforward task owing to the higher dimensionality of weight tensors and thus the larger search space of weight pruning. It is especially challenging to derive the best-suited weight pruning method in order to achieve real-time performance on off-the-shelf mobile devices. Two fundamental problems need to be solved: the

sparsity scheme and the pruning algorithm. The former refers to the regularity in pruning, i.e., the specific structural characteristics of CNNs after pruning. The two representative cases for 2D CNNs are the most flexible, irregular pruning scheme that can prune arbitrary weights [15, 16], and the computing platform-friendly filter/channel pruning scheme that prunes whole filters/channels [14, 20, 19]

. The latter refers to the appropriate algorithm to determine the target weights to remove and train the remaining, non-zero weights. For 2D CNNs, there is also rich literature in heuristic pruning 

[15, 16, 17] or regularization-based pruning algorithms [14, 19, 20].

This work develops RT3D framework, including the derivation of best-suited (structured) sparsity scheme and pruning algorithm of 3D CNNs, and the design of the associated compiler-aided acceleration, for off-the-shelf mobile devices. We propose and investigate two structured sparsity schemes that are highly mobile acceleration friendly. The first vanilla sparsity scheme achieves sparsity by removing kernel groups in 3D CONV layers. It can achieve straightforward acceleration for on-device inference with the aid of compiler code generation, but it suffers from relatively high accuracy loss as whole kernel groups are pruned. The second, more optimized one is the kernel group structured (KGS) sparsity scheme. It is a more fine-grained structured sparsity that enjoys higher flexibility, and will result in a higher accuracy under the same pruning rate. Moreover, it is important to note that the KGS sparsity scheme is beyond a mere tradeoff of accuracy and mobile performance. In fact, with proper support of compiler code generation, the KGS sparsity can achieve almost the same mobile acceleration (e.g., in Frames/second) as the vanilla sparsity, under the same pruning rate. This is owing to the delicate design of KGS sparsity to match the parallelization mechanism in compiler-assisted mobile acceleration, such that the full on-device parallelism can be exploited.

We further present three pruning algorithms to achieve the proposed structured sparsity schemes for 3D CNNs. The first two, i.e., the heuristic algorithm and regularization-based algorithm, are natural generalization from state-of-the-art algorithms on 2D CNN weight pruning. However, they are either greedy algorithm, or suffer from the limitation that all weights will be equally penalized even after convergence of the pruning process. Both result in potential accuracy loss. To overcome these shortcomings, we propose a novel reweighted regularization pruning

algorithm. The basic idea is to systematically and dynamically reweight the penalties, reducing the penalties on weights with large magnitudes (which are likely to be more critical), and increasing the penalties on weights with smaller magnitudes. It possesses other advantages, such as not introducing additional hyperparameters, and being flexible for either parameter reduction or FLOPs (floating-point operations) reduction, etc.

Seamlessly integrated with above innovations, RT3D also develops the first end-to-end, compiler-assisted acceleration framework of 3D CNNs on both mobile CPUs and GPUs (the few prior work are limited to mobile CPUs), and also the first to support different structured sparsity schemes. RT3D achieves up to speedup in end-to-end inference time comparing with current mobile frameworks supporting 3D CNNs, with moderate 1%-1.5% accuracy loss, on representative CNNs (C3D, R(2+1)D, S3D). The end-to-end inference time for 16 video frames could be within 150 ms.

A brief contribution summary is: (a) sparsity schemes for 3D CNNs which are both flexible and mobile acceleration friendly, (b) highly effective pruning algorithm to achieve such sparsity schemes, (c) compiler-assisted mobile acceleration framework, and (d) for the first time, real-time performance of 3D CNNs can be achieved on off-the-shelf mobile devices using a pure software solution.

Ii Related Work

Ii-a Weight Pruning for 2D CNNs

The rich literature in weight pruning for 2D CNNs can be categorized into heuristic pruning algorithms and regularization-based pruning algorithms. The former starts from the early work on irregular, unstructured weight pruning where arbitrary weights can be pruned. [15] adopts an iterative algorithm to eliminate weights with small magnitude and perform retraining to regain accuracy. [16] incorporates connection splicing into the pruning process to dynamically recover the pruned connections that are found to be important. Later, heuristic pruning algorithms have been generalized to the more hardware-friendly structured sparsity schemes. In [17], Transformable Architecture Search (TAS) is adopted to realize the pruned network and knowledge is transferred from the unpruned network to the pruned version. The work [21] leverages a greedy algorithm to guide the pruning of the current layer with input information of the next layer. The work [19]

defines a “neuron importance score" and propagates this score to conduct the weight pruning process.

Regularization-based pruning algorithms, on the other hand, are more mathematics-oriented and have the unique advantage for dealing with structured pruning problems through group Lasso regularization [22, 23]. Early work [14, 24] incorporate or

regularization in loss function to solve filter/channel pruning problems. However, there is also one limitation of the direct application of regularization terms – all weights will be penalized equally even after pruning convergence, resulting in potential accuracy loss. A number of subsequent work are dedicated to making the regularization penalty a dynamic and "soft" term. The method in 

[25] selects filters based on norm and updates the filters that have been previously pruned. [26, 27] incorporate the advanced optimization solution framework ADMM (Alternating Direction Methods of Multipliers) to achieve dynamic regularization penalty, thereby improving accuracy. [20]

proposes to adopt Geometric Median, a classic robust estimator of centrality for data in Euclidean spaces. A common limitation of these improved versions is that the pruning rate for each layer needs to be manually set, which is difficult to derive in prior.

Motivated by the prosperous research on network architecture search (NAS) [28, 29, 30, 31, 32, 33, 34, 35], there are recent research [36, 37, 17, 38] of automatic search of hyperparameters in weight pruning. This direction is orthogonal to the proposed research in this work, and can be combined for achieving even higher performance and degree of automation.

Ii-B Mobile Acceleration Frameworks of CNNs

TVM [5], TFLite [7], Alibaba Mobile Neural Network (MNN) [6] and PyTorch Mobile (PyTorch) [8] are representative compiler-assisted deep learning acceleration frameworks on mobile devices. They mainly focus on end-to-end acceleration for 2D CNNs. Only MNN and PyTorch support 3D CONV on mobile CPUs (no mobile GPU support); while other popular ones (like TVM and TFLite) do not support 3D CONV computation. To the best of our knowledge, our RT3D is the first end-to-end deep learning acceleration framework for 3D CNNs on both mobile CPUs and GPUs. More than that, it is also the first to support the acceleration of various sparsity schemes of 3D CNNs.

Iii Structured Sparsity Schemes for 3D CNNs

This section proposes two structured sparsity schemes of 3D CNNs. We focus on the most computationally intensive convolutional (CONV) layers of 3D CNNs. Let denote the 5-dimensional weight tensor of the -th CONV layer of a 3D CNN, where is the number of filters; is the number of input channels; , , and are the width, height, and depth, respectively, of the 3D CONV kernels. Different from the 2D CONV kernel, the 3D CONV kernel has an additional dimension on the kernel depth, making a 5-dimensional tensor.

Fig. 1: Proposed structured sparsity schemes: (a) The Vanilla Structured Sparsity. (b) The Kernel Group Structured (KGS) Sparsity for 3D CNNs. A CONV weight tensor is first split into multiple kernel groups, each consisting of ( in the figure) 3D kernels. Within the same kernel group, kernels are pruned at the same locations (marked by the grey entries).

Figure 1 demonstrates the proposed two structured sparsity schemes for 3D CNNs: Vanilla Structured Sparsity Scheme and Kernel Group Structured (KGS) Sparsity Scheme. The weight tensor is first partitioned into groups of kernels along the filter and input channel dimensions. Each kernel group consists of ( in Figure 1) 3D kernels. The Vanilla sparsity scheme is shown in Figure 1 (a), where whole kernel groups are determined to be pruned or not. On the other hand, our proposed KGS sparsity scheme as shown in Figure 1 (b) is that for the kernels in the same group, weights are pruned at the same locations. This is illustrated better on the right of Figure 1

(b), where 3D kernels are reshaped into vectors with

weights. Consider the kernels in a group, i.e., kernels . Weights at the same location in these kernels i.e., are determined to be pruned or not together, where describes the same location (coordinate) in kernels.

The Vanilla sparsity scheme is a relatively straightforward generalization from structured sparsity schemes [14, 36, 21] for 2D CNNs. It can achieve straightforward acceleration for on-device inference with the aid of compiler code generation, but it will obviously suffer from relatively high accuracy loss as whole kernel groups are pruned. On the other hand, the proposed KGS sparsity scheme is a more fine-grained structured sparsity that enjoys higher flexibility. In fact, the Vanilla sparsity scheme is just a special case of KGS sparsity, and therefore, one can confidently state that the KGS sparsity will result in a higher accuracy under the same pruning rate, as long as effective pruning algorithm has been developed and employed.

It is important to note that the KGS sparsity scheme is beyond a mere tradeoff of accuracy and mobile performance. In fact, with proper support of compiler code generation, the KGS sparsity can achieve almost the same mobile acceleration performance (e.g., in Frames/second) as Vanilla sparsity, under the same pruning rate. This is owing to the delicate design of KGS sparsity to match compiler-assisted mobile acceleration. For effective mobile acceleration, the whole kernel group will be transformed into matrix multiplication (with input feature map) [39] as shown in the reshaping step of Figure 1 (b). Accordingly, the KGS sparsity is equivalent to whole column removals in the weight matrix of a kernel group. The computation overhead in whole column removal is minor and can be mitigated by compilers, and the remaining computation is still based on full matrices (albeit smaller). A key observation is that the parallelism degree on off-the-shelf mobile devices is limited, and thus the smaller matrices of remaining weights have enough size to fully exploit the parallelism provided by mobile devices. As an illustrative example, suppose that the mobile device can execute 10 operations in parallel while the matrix contains 100 remaining operations. Then the reduced-size matrix can be executed in 10 iterations, achieving full parallelism. As the hardware parallelism can be fully exploited in both Vanilla and KGS schemes (if compiler overhead is negligible), the mobile acceleration performance in terms of FLOPs/second will be almost the same for both pruning schemes, and so does the Frames/second performance under the same pruning rate (and FLOPs count). As a result, the proposed KGS sparsity can fully enjoy the benefit of high flexibility in terms of higher accuracy or higher pruning rate.

Please note that the group size needs to be determined in Vanilla and KGS sparsity schemes, in order to achieve the maximum on-device parallelism with low computation overhead. The group size is determined offline with actual mobile testings using synthesized CNN layers. In other words, it will NOT become a hyperparameter in the pruning algorithm. and or are preferred to match the SIMD (Single Instruction, Multiple Data) parallelism provided by current mobile CPUs and GPUs. These values are large enough to exploit the on-device parallelism and small enough to provide enough pruning flexibility and accuracy, as shall be seen in the experimental results.

Iv Structured Sparsity Learning Algorithms

This section describes three pruning algorithms to achieve the proposed structured sparsity schemes for 3D CNNs. The first two are natural generalization from state-of-the-art algorithms on 2D CNN weight pruning, and the last one is specifically designed to address the limitations in the prior two. Consider a general 3D CNN consisting of convolutional (CONV) layers. Besides the -th CONV layer weight tensor , the bias is denoted by . The loss function associated with a 3D CNN can be denoted by . To achieve the proposed group-wise sparsity schemes, weight tensor is partitioned into a set of kernel groups along the dimensions of filters and channels, i.e., , for and , where , , and denotes the integer set .

1. Heuristic Pruning Algorithm: As discussed in Section 2, the prior work has investigated heuristic pruning for 2D CNNs, for both irregular and structured sparsity schemes. The prior work [21, 19] are mostly relevant to this work as we also focus on structured sparsity. Motivated by these work, we assign a similar “neuron importance score" to each kernel group (or the same location of kernels in the group), and perform pruning on the current layer with input information of the next layer in a back propagated manner (similar procedure as [21]). This serves as our heuristic pruning algorithm for the proposed sparsity schemes of 3D CNNs.

2. Regularization-based Pruning Algorithm: adds an additional regularization term to the loss function to achieve the Vanilla or KGS sparsity scheme. Then, the regularization-based pruning can be formulated as

(1)

where is the regularization term for the Vanilla or KGS sparsity and is the penalty measuring its importance. Motivated by group Lasso [22], the regularization term can be defined as , where denotes kernel group norm. We can choose from norm [36], norm [25, 27] or their combination for this group-wise regularization.

In the following we focus on the KGS sparsity scheme. In more details, the regularization-based pruning can be achieved by

(2)

3. Reweighted Regularization Pruning Algorithm: As discussed in Section II-A, the fixed regularization-based pruning algorithm has a limitation, – all weights will be equally penalized even after convergence of the pruning process, resulting in potential accuracy loss. We propose a novel reweighted regularization pruning algorithm to overcome this limitation. The basic idea is to systematically and dynamically reweight the penalties. Especially, we will reduce the penalties on weights with large magnitudes (which are likely to be more critical), and increase the penalties on weights with smaller magnitudes. This shall be performed in a systematic, gradual way to avoid the greedy solution which prunes a large number of weights at the early stage. Moreover, our proposed algorithm does not need to manually set the pruning rate for each layer, as a limitation in prior works based on ADMM or Geometric Median-based regularizations.

For reweighted regularization, we minimize the following objective function:

(3)

where denotes element-wise multiplication. is the collection of penalty parameters and is updated in every iteration to facilitate the degree of sparsity. In each iteration, the instance of is denoted by and we update by setting

where is a small positive number avoiding the zero denominator. The reweighted regularization process updates penalty parameters based on the current weight values, will not incur extra hyperparameters, and has a fast convergence rate as analyzed in [40]. After 3

4 iterations, we will prune the weights that converge to zero, and perform a slight retraining on the non-zero weights (with a few epochs) to regain accuracy.

While overcoming the limitation in fixed regularization-based algorithms, the advantage and flexibility in such algorithms will be preserved. There is only one as the major hyperparameter, without the need of manually deciding per-layer pruning rate. Also similar to fixed regularization algorithms, we can multiply the per-layer FLOPs value to each layer in the above optimization function. In this way we can target at the overall FLOPs reduction, which is more relevant to the actual acceleration. In the experiments, we set the FLOPs reduction as the optimization target, and report the corresponding FLOPs reduction rates and actually measured mobile accelerations.

V Experimental Results

V-a Evaluation on Sparsity Schemes and Pruning Algorithms

Experimental Setup. We test the proposed two structured sparsity schemes i.e., Vanilla and KGS sparsity and three pruning algorithms on 3D CNN models (including one (2+1)D CNN): C3D [13], R(2+1)D [41], and S3D [42]. Besides the two proposed sparsity schemes, a filter sparsity scheme is also implemented, where the filters may be pruned as a whole, and which is a direct generalization of the filter pruning of 2D CNNs. The models are all pretrained on the Kinetics dataset [11] and transferred onto the UCF101 [43] and HMDB51 [44] datasets as the pretrained dense models. The hyperparameter settings are the same for all pruning algorithms and sparsity schemes for fair comparisons. The batch size is fixed to 32, and the video clip length is 16 frames. The initial learning rate is 5e3 when training the dense model, and is reduced to 2e4 in the weight pruning and retraining for stability. The learning rate is fixed in the pruning process, while adjusted in retraining with a scheduler following the cosine function. For different types of sparsity schemes and pruning algorithms, the total number of epochs is fixed to 240 epochs.333Although the reweighted pruning algorithm is iterative, its latter iterations require significantly fewer epochs. Thus we can set the total epochs the same for different algorithms. For the pruning of all the models, we have used the best combination of and norms in the regularization term. The penalty factor is set to 5e4. The pruning and retraining processes are carried out with eight NVIDIA GeForce GTX 1080 Ti GPUs and the PyTorch framework.

Results. The pruning results on C3D and R(2+1)D models on UCF101 dataset with various pruning algorithms and sparsity schemes are provided in Table I. For each pruning algorithm, the three sparsity schemes are compared under the same pruning rate (FLOPs reduction on the overall model), and KGS results of two pruning configurations are compared. As can be observed in the table, the KGS sparisity scheme consistently outperforms the vanilla sparsity, and these two schemes both perform better than filter pruning. The reweighted regularization algorithm consistently outperforms the other two pruning algorithms. The advantages of KGS sparsity and reweighted regularization are stated in Section 3 and Section 4. With reweighted regularization and KGS sparsity scheme, both C3D and R(2+1)D could achieve only 1%1.5% accuracy loss under pruning rate of 2.6.

Model Pruning Sparsity Overall FLOPs Pruning Rate Base Top-1 Pruning Top-1
Algorithm Scheme after Pruning of FLOPs Accuracy Accuracy
C3D (299MB) Heuristic Filter 15.2G 2.6 81.6% 78.6%
Vanilla 15.2G 2.6 78.8%
KGS 15.2G 2.6 79.0%
KGS 10.8G 3.6 78.5%
Regularization Filter 15.2G 2.6 81.6% 78.8%
Vanilla 15.2G 2.6 79.0%
KGS 15.2G 2.6 79.6%
KGS 10.8G 3.6 79.3%
Filter 15.2G 2.6 81.6% 79.3%
Reweighted Vanilla 15.2G 2.6 79.7%
Regularization KGS 15.2G 2.6 80.5%
KGS 10.8G 3.6 80.2%
R(2+1)D (120MB) Heuristic Filter 15.9G 2.6 94.0% 89.0%
Vanilla 15.9G 2.6 89.4%
KGS 15.9G 2.6 90.4%
KGS 12.7G 3.2 89.9%
Regularization Filter 15.9G 2.6 94.0% 89.8%
Vanilla 15.9G 2.6 90.8%
KGS 15.9G 2.6 91.7%
KGS 12.7G 3.2 91.3%
Filter 15.9G 2.6 94.0% 90.5%
Reweighted Vanilla 15.9G 2.6 91.7%
Regularization KGS 15.9G 2.6 92.5%
KGS 12.7G 3.2 92.0%
TABLE I: 3D CNN pruning results on the UCF101 dataset.

V-B Evaluation on Mobile Acceleration Performance

Mobile Acceleration Framework Implementation.

We design and implement an end-to-end, compiler-assisted CNN acceleration framework that supports 3D CNNs. Without any pruning-related optimizations, RT3D is already faster than state-of-the-art CNN execution frameworks (such as MNN and PyTorch Mobile) on mobile CPUs, because we include more advanced optimizations like fine-tuned high-efficient SIMD (Single Instruction, Multiple Data) execution, fine-tuned weight layout organization, etc. Our framework is also the first to support 3D CNN executions on mobile GPUs. It is general, supporting both 2D and 3D CNNs. Comparing to other popular CNN acceleration frameworks that support 2D CONV (like TVM and MNN) on standard 2D benchmarks like VGG-Net, ResNet, MobileNet-V2, etc., our developed framework also yields consistently better performance.

Moreover, RT3D is also the first to support various sparsity schemes, including Filter, and proposed Vanilla and KGS sparsity. Based on the sparsity scheme, it employs a compiler-based automatic code generation approach to reorganize the model weights, regularize the computations, tune the computation configuration, and generate the optimized model inference codes. Our framework can automatically generate both optimized CPU (vectorized C++) and GPU (OpenCL) codes to support both dense and sparse 3D CNN executions.

Test-bed and Evaluation Setup.

The evaluations are conducted on a Samsung Galaxy S20 cellphone with the latest Qualcomm Snapdragon 865 platform consisting of a Qualcomm Kryo 585 Octa-core CPU and a Qualcomm Adreno 650 GPU. All experiments run 50 times with 8 threads on mobile CPU, and all pipelines on mobile GPU. Because different runs do not vary severely, only the average inference execution time is reported for readability. All models are tuned to their best configurations, e.g., with computational graph optimizations, the best tiling size, unrolling size, etc. 32-bit floating point is applied for CPU runs, and 16-bit floating point is used for GPU runs. This is the same for both baseline mobile acceleration frameworks and our RT3D framework for a fair comparison, as quantization is not supported by baseline frameworks.

Mobile Acceleration Results.

We next evaluate RT3D by comparing it with MNN [6] and PyTorch Mobile (PyTorch) [8].444Other popular mobile CNN acceleration frameworks like TVM and TFLite do not support 3D CNNs. Table II compares the end-to-end 3D CNN inference time (latency). RT3D supports both dense (original) and sparse 3D CNNs on both mobile CPU and mobile GPU, PyTorch supports dense models on CPU only, and MNN supports dense C3D on CPU only. For sparse models, RT3D uses pruned models by reweighted regularization pruning algorithms with KGS sparsity with the pruning rate of for C3D, for R(2+1)D, and for S3D, and the accuracy of , , and 555The base accuracy of S3D is 90.6%., respectively. In the table, the RT3D speedups are compared with PyTorch. RT3D outperforms MNN and PyTorch on mobile CPU for all cases. RT3D on mobile GPU performs even better than on CPU. For example, for C3D, the fully optimized RT3D (Sparse) outperforms the CPU version of PyTorch and MNN with the speedup of and on CPU, and and on GPU, respectively. Notably, on mobile GPU, the fully optimized RT3D can infer 16 frames by using C3D, R(2+1)D, and S3D within 142 ms, 141 ms, and 293 ms, respectively, achieving real-time execution (say 30 frames per second) of 3D CNNs on mobile devices.

Importantly, although RT3D’s dense implementations have already been fully optimized, our sparse implementations especially on mobile GPU can fully transform the pruning rate of FLOPs into inference latency speedup. For example, on C3D, from RT3D (dense) to RT3D (sparse) on GPU, the improvement on inference latency is , while the pruning rate of the sparse model is . This validates the statement in Section 3 that the proposed KGS sparsity scheme can exploit the parallelism degree on device. Moreover, 3D CONV is memory-intensive, bounded by both memory bandwidth and latency (which is more severe on mobile GPU due to its even limited cache capacity), and our pruning/compilation co-design is able to mitigate this issue with incurring negligible overhead. Our cache access count results validate this.

Ablation Study.

We also compare two sparsity schemes, Vanilla and KGS in terms of pruning rate and inference latency on mobiles by controlling the same pruning top-1 accuracy (as shown in Table  III). It shows that KGS scheme achieves both higher pruning rate (in FLOPs) and lower inference latency under the same pruning accuracy on both C3D and R(2+1)D due to KGS’s high flexibility and seamless match with compiler-level optimizations.

Framework MNN [6] PyTorch [8] RT3D (Dense) RT3D (Sparse)
Device CPU (ms) CPU (ms) CPU (ms) Speedup GPU (ms) Speedup CPU (ms) Speedup GPU (ms) Speedup
C3D 948 2544 902 2.8 488 5.2 357 7.1 142 17.9
R(2+1)D - 4104 1074 3.8 513 8.0 391 10.5 141 29.1
S3D - 6617 1139 5.8 565 11.7 611 10.8 293 22.6
TABLE II: Inference latency comparison of RT3D, MNN, and PyTorch on mobile CPU and GPU. MNN does not support R(2+1)D and S3D yet. For RT3D (Sparse), all models are pruned by reweighted regularization algorithm with KGS sparsity. The pruning rate (in FLOPs) is for C3D, for R(2+1)D, and for S3D, and the accuracy is , , and , respectively.
Model Sparsity Base Top-1 Pruning Top-1 FLOPs Pruning Rate Latency (ms)
Scheme Accuracy Accuracy after Pruning of FLOPs CPU GPU
C3D Vanilla 81.6% 80.0% 16.4G 2.4 525 236
KGS 9.7G 4.0 329 134
R(2+1)D Vanilla 94.0% 91.8% 15.5G 2.5 523 225
KGS 10.2G 4.0 360 127
TABLE III: Comparison between Vanilla and KGS sparsity schemes: pruning rate, and inference latency with the same pruning Top-1 accuracy on the UCF101 dataset. Reweighted regularization pruning is applied for all models.

Vi Conclusion

This paper presents RT3D, a mobile acceleration framework for 3D CNNs that consists of two novel, mobile-friendly structured sparsity schemes (Vanilla and KGS) and best-suited pruning algorithms, and a compiler-assisted code generation framework to transform pruning benefits to performance gains. The evaluation results show that RT3D beats two state-of-the-art acceleration frameworks with speedup up to . RT3D can infer 16 video frames within 150 ms, for the first time, achieving real-time inference of 3D CNNs on off-the-shelf mobile devices with a pure software solution.

References

  • [1]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems (NeurIPS), 2012, pp. 1097–1105.
  • [2] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishnamurthy, “Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys).   ACM, 2016, pp. 123–136.
  • [3] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, “Deepsense: A unified deep learning framework for time-series mobile sensing data processing,” in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 351–360.
  • [4] L. N. Huynh, Y. Lee, and R. K. Balan, “Deepmon: Mobile gpu-based deep learning framework for continuous vision applications,” in Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys).   ACM, 2017, pp. 82–95.
  • [5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
  • [6] https://github.com/alibaba/MNN.
  • [7] https://www.tensorflow.org/mobile/tflite/.
  • [8] https://pytorch.org/mobile/home.
  • [9] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence (T PAMI), vol. 35, no. 1, pp. 221–231, 2012.
  • [10] Z. Wang, Q. Lan, H. He, and C. Zhang, “Winograd algorithm for 3d convolution neural networks,” in International Conference on Artificial Neural Networks (ICANN).   Springer, 2017, pp. 609–616.
  • [11] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    , 2017, pp. 6299–6308.
  • [12] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5533–5541.
  • [13] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
  • [14] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in neural information processing systems (NeurIPS), 2016, pp. 2074–2082.
  • [15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations (ICLR), 2016.
  • [16] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in neural information processing systems (NeurIPS), 2016, pp. 1379–1387.
  • [17] X. Dong and Y. Yang, “Network pruning via transformable architecture search,” in Advances in Neural Information Processing Systems (NeurIPS), 2019, pp. 759–770.
  • [18] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discrimination-aware channel pruning for deep neural networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2018, pp. 875–886.
  • [19] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203.
  • [20] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for deep convolutional neural networks acceleration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4340–4349.
  • [21] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision (ICCV), 2017, pp. 5058–5066.
  • [22] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.
  • [23] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018.
  • [24] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1389–1397.
  • [25] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” in

    International Joint Conference on Artificial Intelligence (IJCAI)

    , 2018.
  • [26] T. Zhang, S. Ye, Y. Zhang, Y. Wang, and M. Fardad, “Systematic weight pruning of dnns using alternating direction method of multipliers,” arXiv preprint arXiv:1802.05747, 2018.
  • [27] T. Li, B. Wu, Y. Yang, Y. Fan, Y. Zhang, and W. Liu, “Compressing convolutional neural networks via factorized convolutional filters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3977–3986.
  • [28]

    B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in

    International Conference on Learning Representations (ICLR), 2017.
  • [29] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2820–2828.
  • [30] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural network architectures using reinforcement learning,” in International Conference on Learning Representations (ICLR), 2017.
  • [31]

    E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in

    Proceedings of the aaai conference on artificial intelligence, vol. 33, 2019, pp. 4780–4789.
  • [32] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le, “Understanding and simplifying one-shot architecture search,” in

    Proceedings of the 35th International Conference on Machine Learning (ICML)

    , 2018, pp. 550–559.
  • [33] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search via parameters sharing,” in Proceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 4095–4104.
  • [34] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” in International Conference on Learning Representations (ICLR), 2019.
  • [35] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 734–10 742.
  • [36] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2736–2744.
  • [37] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 784–800.
  • [38] N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye, “Autocompress: An automatic dnn structured pruning framework for ultra-high compression rates,” arXiv preprint arXiv:1907.03141, 2019.
  • [39] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
  • [40] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted 1 minimization,” Journal of Fourier analysis and applications, vol. 14, no. 5-6, pp. 877–905, 2008.
  • [41] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450–6459.
  • [42] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 305–321.
  • [43] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • [44] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV).   IEEE, 2011, pp. 2556–2563.