Action recognition is one of the most active research fields in the computer vision community, and the broad applications include robotics, video surveillance , and medical caring .
For action recognition, in terms of feature extraction and feature representation, the two categories of approaches for action recognition are based on conventional handcrafted features and deep learning features, respectively. Both the conventional and deep learning based approaches have achieved remarkable improvements for action recognition recently. A migration from the handcrafted features based approaches to deep learning based methods happens since the emerge of AlexNet
. The reasons behind that are three folds. Firstly, rich features extracted from different levels are beneficial. Second, a large amount of training data are used to optimize the tremendous parameters, which is helpful to enhance learning ability of the network. Third, very deep neural networks are designed to enhance the fitting ability for the classification task. Roughly, deep learning based approaches can be cast into four categories according to the network modalities, which are 2D-CNNs, Recurrent Neural Network (RNNs), 2D-3D mixed models and 3D-CNNs.
2D-CNNs and recurrent neural network (RNNs) [5, 6, 7] are firstly employed in action recognition. Soon they were replaced by 3D-CNNs. 3D-CNNs can seamlessly extract features in the spatial-temporal domain. It proves that 3D-CNNs with spatial-temporal kernels perform better than 2D-CNNs with spatial kernels for action recognition . C3D is the first one that uses 3D network for action recognition. Later other variants emerges, and the overall evolution follow the rules from “shallow” 3D-CNNs to “deep” 3D-CNNs.
Despite the promising performance, two problems should be solved in state-of-the-art action recognition methods mentioned above.
1) Depth is beneficial to improve the performance of action classification as compared to using RGB video alone. Most existing approaches only take RGB images as inputs, and the appearance and motion cues are usually employed in video sequences to recognize human actions [7, 8, 9]. Although impressive advances have been achieved, the lack of 3D structures of the objects as well as the scenes make those approaches struggle to handle the scenarios with heavy occlusions and similar objects.
With the development of Microsoft Kinect , commodity range sensors make it feasible to generate depth at scale. Depth provides valuable sources including temporal correlation, emotion expression and motion patterns, which are the key factors for distinguishing human actions under some circumstances. Therefore, depth information can be viewed as a vital complementary to RGB sequence for improving the performance of action recognition.
Some work spent efforts on combining RGB and depth information for action recognition [11, 12, 13], and demonstrated the effectiveness of modality fusion. Existing networks which use both RGB and depth videos as inputs suffer one of the two drawbacks: the first one is that they usually carry a large amount of parameters and lead to heavy computation. Otherwise with light-weight networks, the performance is not competitive.
2) Although 3D-CNNs achieved state-of-the-art performance, they usually contain a large amount of parameters, leading to high computational cost and a demand for large training data. Among the few existing 3D-CNNs for action recognition, the most representative one is inflated 3D network (I3D) . It is trained on Kinetics , and uses a two-stream inflated 3D CNN to achieve the state-of-the-art performance for action recognition. The success of I3D provides some important hints, including: First, as the core parts, inception 3D modules provide possibilities for the network to learn more various and robust features. Second, with the large-scale training data, deep 3D networks are more likely to achieve performance gain in terms of classification accuracy.
As a milestone for action recognition, I3D possesses many superior qualities. There still are two disadvantages, which may be further improved.
A large amount of parameters.
Within the I3D network, each inception module is extending 2D-CNN of the inception V1 to 3D-CNN module. It contains a large amount parameters and is very computationally heavy. By inspecting the network structure, we find that nine inception modules taking up over 80% of the total parameters, and the computations mainly come from the convolutional operations with 3D kernels. In addition, tremendous parameters in network always accompany very demanding memory cost.
Most recently, a trend is to tailor a large network into light-weight network. Two representative studies are ShuffleNet  and MobileNet . However, this line of work mainly focuses on 2D-CNNs. It is worth mentioning that depth-wise convolution and group convolution play an important role in designing light-weight networks.
In fact, 3D-CNNs are of much stronger demand for light-weight design per the aforementioned reasons, yet little work exists for tackling this problem. 3D-CNNs are more complex compared to 2D-CNNs. So are the light-weight operations in 3D-CNNs than that in 2D-CNNs. Specifically, under the conditions that the depth and width are similar, 3D-CNNs are of higher computational cost and need larger memory storage than 2D-CNNs. Fig. 1 shows the comparison between the 2D convolution and the 3D counterpart. It can be found that their main differences include input dimensions, kernel dimensions and the ways of convolution. For example, for the kernel of size 3, a 2D kernel only has 9 parameters, whereas a 3D kernel has 27 parameters. When convolving a input of size with a kernel of size 3, the FLOPs of 2D convolution and 3D convolution are and , respectively. The computational costs of 3D-CNNs are at least one magnitude higher than that of 2D-CNNs, that is vs. .
Data hungry. In addition, due to its heavy structure, to achieve a better performance, enormous samples are required during training. Although the deeper 3D-CNNs generally have larger capability, 3D-CNNs may suffer from over-fitting, if there are not sufficiently large training datasets.
, most existing RGB-D datasets contain less than 10k videos. The magnitude is far less than that of RGB datasets, such as ImageNet and Kinetics . As a result, the lack of training data would inevitably limit the potential of deep neural network.
To summarize, despite the great efforts and the rapid progress in the past few years, action recognition is still a challenging task. There is much room for improvement. First, we can enrich the input data by employing multimodal data, such as RGB and depth videos. Second, we can optimize the structures of conventional 3D-CNNs by adopting light-weight design.
In this paper, in order to overcome the aforementioned problems in 3D-CNNs, and being inspired by the design of I3D and light-weight 2D-CNNs, we propose a set of light-weight 3D-CNN networks for action recognition using RGB-D data as inputs. The proposed networks include an inception with spatial and temporal convolution network (IST), a shuffle spatial and temporal convolution network (SST) and a group shuffle spatial and temporal convolution network (GSST), which optimize the 3D-CNNs at the 3D structure level as well as the channel level.
Our contributions as follows.
In order to reduce the parameter and computation complexity of 3D-CNNs of state-of-the-art action recognition framework, we propose a series of light-weight models for action recognition on RGB-D, where RGB information and depth information are jointly used.
We evaluate the proposed light-weight models on two RGB-D datasets, which are the largest RGB-D human activity dataset (NTU RGB+D dataset, NTU) and a small dataset (Northwestern-UCLA Multi-view Action 3D dataset, N-UCLA). Compared to I3D, the proposed models contain much fewer parameters, and are of much lower computational cost. More importantly, the proposed models achieve on par or even better accuracy on both datasets.
Ii Related work
Ii-a Action recognition
Action recognition is one of the fundamental problems in human action analysis and receives consistent attention in the computer vision community. Since action cues hide in the spatial-temporal domain, action recognition is typically composed of two ingredients. One is to extract the proper features of the appearance, and the other is to model the dynamic motions .
Conventional methods usually based on the handcrafted features such as HOG  and HOF [21, 22, 23, 24]. Among the conventional methods, improved dense trajectories (iDT)  achieves the best performance. Since AlexNet  was invented, deep learning has dominated many fields of computer vision, and action recognition is one of them. Deep learning based methods for action recognition experienced from 2D-CNNs to 3D-CNNs, and from shallow 3D-CNNs to deep 3D-CNNs.
2D-CNNs [25, 26, 27] usually employ a model that is pretrained on ImageNet  as the initialization. Random sampling or key-frames sampling policies are used to extract partial frames from the whole video sequences. To obtain the final classification score, the scores of the sampling frames are often averaged. For shallow 3D-CNNs, studies of [29, 30] are the pioneers that propose to leverage the 3D spatial-temporal kernels in video action recognition. The Kinetics dataset  makes it possible to train deep 3D-CNNs on large-scale datasets.
Intertwined 3D-2D networks [31, 32] are shown up as a hybrid between 2D-CNNs and 3D-CNNs. It uses 3D spatial-temporal kernels to extract motion details and temporal information in the low-level layers and employs the 2D kernels to conduct classification and action recognition in the high-level layers. A benefit of using such 3D-2D networks, compared to the 3D-CNNs, is that the parameters involved in the networks are much reduced. Similarly, pseudo-3D  employs three parallel 2D residual blocks to extract the spatial and temporal information individually.
Ii-B RGB based action recognition
Static RGB images in sequential or stacked frames are fed into a neural network to extract appearance features and motion patterns, and 2D-CNNs or RNNs are used for action recognition. To further improve the accuracy of action recognition, Simonyan and Zisserman proposed to capture the complementary information by incorporating RGB and optical flow information in . Later, some methods are proposed to use RGB and stacked optical flow to form a two-stream network to represent the appearance and motion information [27, 26, 34, 35]. One advantage of these approaches is that 2D-CNNs can be pretrained on ImageNet .
Recently, a few 3D-CNN structures [8, 9, 36] are proposed, trained on large action recognition datasets including Kinetics and Sport-1M. Besides the above work, which focused on “increasing depth”, some models pay attention to improve computational efficiency. In the work of ECO , Zolfaghari et al. proposed a sampling strategy to sample a fixed number of key frames from videos of different lengths. Once the frames are extracted, they are fed into a 2D-CNN to extract 2D features which are stacked into 3D feature cubes. Afterwords, 3D feature cubes are input to a 3D-CNN to extract deep spatiotemporal features and calculate classification scores. In  Zhu et al. proposed a key frame mining strategy, which share the similar idea of ECO. In  Wu et al. proposed to use the compressed video to train the network for improving computational speed and saving memory storage.
Ii-C Skeleton based action recognition
Skeleton based approaches capture the 3D dynamics. Conventional methods including conditional random field (CRF) 
and hidden Markov model (HMM) use 3D points to model the motion in the temporal domain. Dynamic temporal warping (DTWF)  tries to model the spatial-temporal point transition based on nearest neighbor. It relies heavily on the metric representing similarity between features and can easily cause misalignment for repeated actions.
Recently, some deep learning based approaches are proposed. In 
, local occupancy pattern (LOP) is applied to associate 3D joints, and Fourier temporal pyramid (FTP) is used to model the temporal information. Some other work use recurrent neural network (RNN) or long short term memory network (LSTM) to model the spatial-temporal evolution about the joints[43, 44]. A drawback of skeleton based methods is that they are often sensitive to view transfer and pose variation.
Ii-D Depth based action recognition
Similar to dense optical flow, scene flow  calculated from depth data provides the dynamic information of the scene and it is less sensitive than optical flow to the cases of fast-moving and heavy occlusions. Specifically, the work of  and 
employs the 4D features, which are histograms of orientated normal vectors (HON4D) or histogram of oriented principal components (HOPC). Some other work employs depth motion map (DMM) to conduct action classification[48, 49, 50, 51]. Specifically, the depth in each frame is aggregated into one single image. Based on the aggregating depth, HOG or CNN features are extracted for conducting action recognition.
Ii-E RGBD-based action recognition
With the development of Microsoft Kinect, cost-effective depth cameras make it possible to capture depth, which opens new possibilities for action recognition. Meanwhile, they also introduce some new challenges. For example, the acquired depth is usually noisy and it is hard to track the points which fall into the occlusion area .
proposed the method MMDT to fuse handcrafted features and deep features for action recognition. Specifically, the dense trajectories acquired by iDT and histograms of normal vector (HON) from depth images are extracted to represent the motion patterns. The features extracted from the 2D-CNN are used to represent the appearance characteristics. Wanget al. proposed c-ConvNet  to employ a joint loss composed of softmax loss and ranking loss to learn homogeneous and heterogeneous features in RGB and depth.
Ii-F Light-weight 2D-CNNs
In MobileNet, depth-wise separable convolutions are employed to reduce the computational cost in the network, and pointwise convolutions are applied to combine the features of separable channels. MobileNet V2 inherits the advantage of MobileNets and introduces inverted residuals and linear bottlenecks for further improvement.
Taking the structure of bottleneck  as basis, ShuffleNet uses group point-wise convolution and depth-wise convolution to reduce parameters and computational cost. Furthermore, shuffle layer is exploited for information exchange among different channel groups in each shuffle unit.
The above models heavily relies on depth-wise convolution and group convolution.
The regular 2D convolution always performs convolutions over multiple input channels, and the filter has the same depth with the input. In contrast, in the depth-wise convolution, each channel is separate, hence the name “depth-wise”. The three components are: 1) Split the input as well as the filter into channels. 2) Convolve the input and filter in corresponding channels. 3) Concatenate the output tensors. Depth-wise convolution can be viewed as a special form of the inception module.
Group convolution. Through sharing parameters among different filter groups, group convolution effectively reduces the redundant parameters. It can be understood as that group convolution reduces parameters by splitting a single convolution layer into several independent sub-paths.
Iii Our method
To achieve light-weight 3D-CNNs for action recognition, we propose three light-weight models including inception with spatial and temporal convolution network (IST), shuffle spatial and temporal convolution network (SST) and group shuffle spatial-temporal convolution network (GSST). Compared to the state-of-the-art method I3D , our three proposed models contain fewer parameters, are of lower computation cost and achieves better performance.
Inspired by I3D , the proposed models mainly focus on two aspects of optimization, which make them the better light-weight 3D networks for action recognition.
Optimization of the 3D structure. As mentioned in Section I, the additional dimension of 3D-CNNs dramatically increases the computation cost. The goal is to reduce the heavy computation burden due to the temporal dimension, while still keeping the advantage of 3D-CNNs that operates along both spatial and temporal dimensions. One effective approach is to replace the conventional 3D spatio-temporal convolutions with 2D spatial convolutions and 1D temporal convolutions.
In the following, the design principles and details of the proposed light-weight operations are explained.
Iii-a Separation of 3D convolution into spatial and temporal convolutions
For ease of exposition, the overall workflow of I3D  and the structure of its submodule 4b are shown in Fig. 2. It can be seen that the submodule 4b of I3D consists of four sub-paths, which are one single-layer path and three two-layer paths. In order to elucidate easily, the module is divided into two stages, according to the depth of the four subpaths.
In stage one, the input data is input into three point-wise convolution layers and one max-pooling layer. In stage two, the outputs of two point-wise convolution layers and the max-pooling layer are input into the subsequent respective convolution layers. It can be seen from Fig.2(b), the number of FLOPs of stage two is 926.1M, which is about 4 times that in stage one, and takes up 80.19 % of the total FLOPs in submodule 4b.
Compared to the convolution layers in stage one, those convolution layers in stage two differ as follows, which increases the parameters significantly, from 145.9k to 590.5k.
1) The convolution layers in stage two are 3D convolution layers. With the same width, the number of parameters of a -sized 3D convolution layer are 27 times that in point-wise convolution layer. Note that, in both I3D and the proposed networks, biases are omitted, so the parameters of biases are not taken into account here.
2) The width of convolution layers is increased. Specifically, from left to right, the widths of the second path and third path are increased from 96 and 16 to 208 and 48, respectively.
Based on these obversations, we propose two strategies for reducing the parameters.
Decomposition of 3D convolution into sequential spatial and temporal convolution. In , Szegedy et al. factorized a -sized convolution layer into a -sized convolution layer followed by a -sized convolution layer and achieved better results.
Thus the number of parameters is reduced while the receptive field keeps unchanged. Inspired by this operation, we factorize a 3D convolution into two asymmetric convolutions. According to the dimensional order of factorization, the two alternative structures are spatial-temporal and temporal-spatial convolutions. In this paper, we adopt the former because it can compress the parameters to the maximum.
Increasing the width of the temporal-convolution. In the module 4b of I3D, the width is increased in stage two. For our design, after the 3D convolution is factorized into the spatial and temporal convolutions, the width can be increased in either the spatial-convolution or the temporal-convolution. We adopt the latter for the purpose of maximal parameter compression. In Fig. 3, four factorization structures are listed relating to different factorization involving different dimensions, and different locations for increasing the width. In this paper, we employ the factorization structure (e) in Fig 3, due to the fact that it has the fewest parameters and FLOPs among the four factorization candidates.
Replacing all the 3D convolution layers in the nine modules of I3D, with our new structure, the 3D inception module with spatial and temporal convolution (IST) is obtained. The module 4b of IST is shown in Fig. 4, where the module 4b of IST is represented as IST 4b.
Compared to I3D, the proposed IST enjoys two advantages. First, IST contains fewer parameters and is of lower computational cost. Specifically, the nine inception modules in I3D take up 80.19 percentage of parameters. From Inc. 4b to IST 4b, the number of parameters in the module decrease from 736512 to 329496. With the same input, the new structure also decreases the computation cost (FLOPs) by half. Second, IST is more powerful in nonlinearity representation due to the increase in depth. Specifically, from Inc. 4b to IST 4b, the deepest sub-path of convolution layers increases from two to three.
Note that, the very recent method of S3D  which is concurrent to this work, shows some similarity to the design of the IST module propsed here.
Iii-B Channel shuffling and splitting
Clearly, IST has less parameters and lower computational cost when compared with I3D. However, the parameters in IST can be further reduced. Our observation is based on the fact that the stage one of each module in IST contains numerous parameters that have not changed during the evolution from I3D. Taking Inc. 4b and IST 4b as an example, from Inc. 4b to IST 4b, the number of parameters of stage two decreases from 590.5k to 183.6k, while the number of parameters of stage one remains the same, that is 145.9k.
Our work is inspired by ShuffleNet , which is composed of a series of shuffle units. For each shuffle unit, the point-wise convolution layers are replaced with group point-wise convolution layers for reducing parameters and the shuffle layer is adopted for information flow between different groups. At the beginning of each module, we add a shuffle layer, where the input data is shuffled and then split into four groups based on the capacity of each sub-path. By adding shuffle layer to IST, we obtain our new structure, shuffling spatial and temporal convolution (SST). The architecture of SST 4b is shown in Fig 5.
Channel shuffling. Using channel splitting alone in each module would split the whole network into four independent sub-paths, which may block the information flow between different sub-paths and lead to suboptimal performance. In order to overcome this drawback, channel shuffling is used for information exchange between sub-paths.
Channel splitting. In IST modules, there are no point-wise convolution layers, therefore introducing a group convolution layer does not make sense. However, the advantage of the group convolution layer is the channel-wise sparse connections. In a group convolution, each convolution operates only on the corresponding input channel group rather than the whole set of input channels. In IST, each module has four sub-paths, and each of them operates on all the input channels, which results in dense connections. In contrast, in SST, the input data is split into four groups for sparse connection between the input data and the four sub-paths.
Based on the two points described above, the implementation details are as follows.
First, the input is split into groups by the shuffle layer, and those groups are shuffled with each other while maintaining the same amount of information with that of before shuffling. The shuffle layer used here follows the one used in , except that the input data is a 3D cube in this paper.
Second, the groups are divided into four parts corresponding to four sub-paths based on their capacities. Note that, in the I3D and IST modules, the four sub-paths are different in structure and capacity.
Therefore, associating each sub-path with equal amount of input information is not reasonable. The information contained in each sub-path should have a positive correlation with their capacity. Specifically, the number of output channels of the last layer in each sub-path is taken as its approximate estimation of capacity. Suppose that the input is split intogroups and the output channels of the four sub-paths are . The number of groups that go through sub-path can be calculated via the formula . We can then fine tune these numbers to ensure that all of them are integers, and that the sum of these numbers should be equal to the number of channels of the input.
After using channel splitting and shuffling, the parameters are decreased from 145.9k in IST 4b to 52.8k in SST 4b, nearly a third of the original amount.
Iii-C Group convolution
In this paper, with the similar purpose of reducing the parameters and computation cost in 3D-CNNs for action recognition, the structure of SST is further optimized by employing group convolution, which lead our final model, group shuffling spatial and temporal convolution (GSST) network. The workflow of GSST and the architecture of GSST 4b module are shown in Fig 6.
In Fig 6(a), GSST replaces all the nine modules of SST with GSST modules. The second and the third convolutions in each module are converted into group convolutions at the beginning. The group number is set to 2 for all the group convolution layers, and the number of parameters of GSST is reduced by 50 percent compared to that of SST.
Iii-D Light-weight operations summary
|Modules||Inc. 4b||IST 4b||SST 4b||GSST 4b|
In this section, we briefly summarize the light-weight operations employed in this paper, and analyze the effects quantitively. Corresponding to Fig 2, 4, 5 and 6, the information of the parameters and their corresponding light-weight operations used in module 4b are listed in Table I.
During the evolvement from Inc. 4b to GSST 4b, the number of parameters decrease significantly. First, the number of parameters decrases by 55.26 percent by replacing the 3D convolutions with the separating convolutions in spatial and temporal domains. Second, splitting the input into four parts and adding a shuffle layer further takes away another 36.42 percent of parameters. Finally, converting all the convolutions used in the inception modules into group convolutions further reduces the number of parameters by half.
Iii-E Hyperbolic tangent function based merging strategy
Since the properties of RGB data and depth data are very different, the recognition performance of the same model with RGB data and depth data also varies a lot. Following , directly averaging the prediction scores of RGB data and depth data does not work when the accuracy of depth data is far away from that of RGB data or vice versa
. In order to overcome this problem, we propose a robust hyperbolic tangent function based merging strategy. The hyperbolic tangent function is widely used as an activation function in networks, and we calculate the weights of prediction results via Eq.1, where represents the recognition accuracy on RGB or depth data, and ranges from 0 to 1.
After the weights are calculated from the hyperbolic tangent function, the perdition scores of RGB data and depth data are merged according to the corresponding weights. The final prediction scores based on the hyperbolic tangent function are denoted as MS 2 in Table II. The prediction scores obtained by directly averaging prediction scores of RGB data and depth data are denoted by MS 1 in Table II.
In this section, the proposed models are evaluated and compared with state-of-the-arts on two benchmark datasets, i.e., the NTU RGB+D dataset  and the Northwestern-UCLA Multiview Action 3D Dataset .
The NTU RGB+D Dataset (referred to as NTU)  contains more than thousand video clips of different types of activities, including daily actions, mutual actions and medical-related actions. This dataset collects activities of distinct subjects using Microsoft Kinect v.2  cameras concurrently. To our knowledge, NTU is currently the largest RGB-D dataset for action recognition. Following , we perform both the cross-subject and cross-view evaluations. Specifically, for cross-subject evaluation, video clips of the subjects , , , , , , , , , , , , , , , , , , and are used for training and the remainder is used for test. As for cross-view evaluation, we used data captured by camera and for training and the data from camera is for test.
N-UCLA  covers action categories, and each action is performed by different subjects. Similar to NTU, N-UCLA is captured by Kinect v.1 sensors from various viewpoints. Examples from view and view are used for training and examples from view are used for test.
Iv-B Implementation details
|2 Layer PLSTM ||S|
|Multi-view dynamic images + CNN |
|Glimpse Clouds |
|Multiple Stream Networks ||RGB+D||79.7||81.4||80.6|
|MS 1, I3D|
|MS 1, IST †|
|MS 1, SST †|
|MS 1, GSST †|
|MS 2, I3D|
|MS 2, IST †||97.6||95.3|
|MS 2, SST †||93.2|
|MS 2, GSST †|
|Enhanced viz. ||S||86.1|
|MS 1, I3D||RGB+D|
|MS 1, IST †|
|MS 1, SST †||94.9|
|MS 1, GSST †|
|MS 2, I3D|
|MS 2, IST †|
|MS 2, SST †||95.5|
|MS 2, GSST †|
We use the stochastic gradient descent (SGD) optimizer (momentum is set to ) to train all the evaluated models. The initial learning rate is set to and reduced by once the training loss is saturated. For data augmentation, we randomly perform horizontal flipping, spatial crop and temporal clip on the original videos during training. The video frames are resized to and then cropped to patches at four corners or the center. Each video is randomly clipped to continuous frames. For a video shorter than frames, we repeat it until its length reaches frames. During inference, each video is also randomly clipped to several -frame sequences and the corresponding prediction scores are averaged. The mini-batch size is set to for RGB videos and for depth videos. As the size of N-UCLA is relatively small, we use the model trained on NTU (cross-view data) as the initial model, and fine tune it on the N-UCLA dataset.
|Conv2||64||Conv, , 64||-||-||Gconv, , 64|
|Module group 3||480||Inc. 3b - 3c||IST 3b - 3c||SST 3b - 3c||GSST 3b - 3c|
|Module group 4||832||Inc. 4b - 4f||IST 4b - 4f||SST 4b - 4f||GSST 4b - 4f|
|Module group 5||1024||Inc. 5b-5c||IST 5b - 5c||SST 5b - 5c||GSST 5b - 5c|
|Classifier||60||Conv, , 60||-||-||-|
Iv-C Comparison with state-of-the-art methods
Results on NTU
On this dataset, we compare the proposed approaches with state-of-the-arts on three input-modality settings, namely depth, RGB and RGB+depth. The results are listed in Table II.
It can be seen from Table II that the proposed models achieve the best performance for all the three settings. In terms of recognition accuracy, I3D and the three proposed models outperform others by a large margin. With significantly less computations, the accuracies of our proposed models are still slightly better than I3D.
Following , we combine the results on the RGB and depth data by merging the prediction scores. Taking the performance difference on RGB and depth data into consideration, we merge the prediction scores using two strategies. With strategy one (denoted as MS 1 in Table III), the prediction scores are averaged. With strategy two (denoted as MS 2 in Table III), the prediction scores are merged with different weights calculated via Eq. 1.
According to the results on RGB+depth data, we can draw two conclusions.
1) The fuse of prediction scores usually improves the performance of the model. As we mentioned before, RGB data and depth data are different from each other for carrying different modalities of information. Combining the RGB stream and depth stream is beneficial for the performance of action recognition.
2) Strategy two is slightly better than simple average of the two scores.
Results on N-UCLA We initialize the models with the weights pretrained on NTU, and then fine-tune the pretrained models on N-UCLA. Table III reports the fine-tuning results. The results on N-UCLA present a similar trend to that on NTU.
1) Experiments on RGB data achieve much better performance than that on depth data. The gaps in accuracy between RGB data based experiments and depth data based experiments are present to various degrees. On NTU, the gaps range from 2.9 to 3.8 %. But on N-UCLA, the average gap is about 18.3 %.
Also, NTU covers 60 different actions and N-UCLA only contains 10 different actions. In terms of the diversity in category, conducting action recognition on N-UCLA is supposed to be much easier than on NTU.
2) The three proposed models obtain best performance on RGB+depth data. The improvements by combining the prediction results on depth data and RGB data on N-UCLA are not significant as compared to that on NTU. This may be because the recognition performance on the depth data of N-UCLA is much worse than that of NTU. Therefore, the role that depth data prediction results plays in combining RGB and depth data prediction results for recognition is weakened.
Using the first strategy to fuse results, the recognition accuracy drops for I3D, IST and SST. With the second strategy, combining the prediction scores further improves the recognition accuracy by 0.2% for IST, 0.2 for SST and 0.4 for GSST.
As can be seen in Table III, on depth data, HPM+TM shows better performance than our proposed models. However, there are two points that are worth mentioning.
1) On N-UCLA, each class has only about 97 samples. For deep CNN models, it is very easy to be over-fitted. The shallowest model of our proposed models contain over 20 3D convolution layers. However, HPM+TM used a pretrained 2D-CNN model with only 5 convolution layers.
V Ablation study
|Module group 3||1.222||0.493||0.402||0.201||15.483||6.340||5.067||2.543|
|Module group 4||5.894||2.354||1.647||0.824||9.350||3.799||2.597||1.305|
|Module group 5||4.754||2.103||1.333||0.666||0.941||0.421||0.262||0.132|
In this section, in order to comprehensively reveal the effects of the compression operations employed in this paper, we conduct a full comparison between I3D, IST, SST and GSST in terms of the action recognition accuracy, network structure, number of parameters, computational cost (FLOPs) and the execution time.
V-a Action recognition accuracy
In the last section, we have compared the four models against the state-of-the-art. Here, we compare the performance of the four models themselves. In Fig. 7, the recognition accuracy, parameters and computational cost of the four models are shown.
From Fig. 7(c) and Fig. 7(d), we can see that the compression operations that we have designed for compressing I3D are useful. From I3D to GSST, both the number of parameters and computation cost significantly decrease step by step. After a series of compression operations, the number of parameters decreased by a factor of 6.93 times and the computation cost reduced by a fact of 3.60 times.
Surprisingly, unlike other 2D-CNN compression work, such as MobileNet, and ShuffleNet, where the reduction of network parameters and computation cost usually reduces the classification accuracy, on NTU and N-UCLA, IST, SST and GSST greatly reduce the network parameters and computation cost while improving the acuuracy.
On NTU, the accuracy is increased by 0.4% in IST, 0.3% in SST and 0.1% in GSST. On N-UCLA, the IST, SST and GSST improved the accuracy by 2.3%, 3.4% and 1.7%.
In summary, we do not observe difference in performance between IST and SST. On NTU, IST has better performance than SST. On N-UCLA, the results are reversed. However, from IST and SST to GSST, the accuracy drops significantly. Our conjecture is that this may be because the the employment of a group convolution may have a negative effect on the capacity of the network and that is why we do not further split each convolution to more groups.
In Table IV, the detailed structures of I3D, IST, SST and GSST are listed. Gconv represents group convolution. Max-p and Avg-p are the Max pooling layer and average pooling layer, respectively. The size of output data and kernels are represented in the format , where , and are the temporal size, height and width of output data or kernels. Conv1, Conv2 and Conv3 represent the first three convoluiton layers in the networks.
For IST, SST and GSST, the parts that have the same structure as the corresponding parts of I3D are shown in ’-’ for concise display in Table IV.
For the four models, apart from the differences in modules, the rest of the differences are focused on the first three convolution layers. IST, SST and GSST separate Conv 2 and Conv 3 into separate spatial and temporal convolutions. GSST further splits Conv 2 and Conv 3 into group convolutions.
V-C The number of parameters and computation cost
In Table VI, the numbers of parameters and FLOPs in each part of the four models are listed. Fig. 8 and Fig. 9 show the distribution of computation cost and parameters in the first three convolution layers, the three Module groups and the four Max-pooling layers.
As the depth increases, the number of parameters increases and the computation cost decreases. In other words, parameters mainly concentrate on latter layers and computation cost mainly concentrates on the former layers.
1) As more compression operations are employed, the proportion of computational cost taken by Conv 1 increases from 23.8% to 61.8%. Following other light-weight work [16, 17], we do not compress the first convolution layer. Therefore, with the decreasing of the total computational cost of the network, the relative computational cost of first convolution layer increases.
2) The parameters distribution has no much difference between the four models. Note that, the first convolution layer only contains 0.5 to 2.1 percent of the total parameters. However, it takes 23.8 to 61.8 percent of total computational cost.
V-D Execution time
In Table VI, execution time of the four models on different GPU devices are compared. Here, we evaluate the four models on three different GPU devices, Tesla K40c, GeForce GTX TITAN and GeForce GT 740. During inference, as the GeForce GT 740 can take 4 samples at most, we set the batch size to 4 on all the three devices. The numbers listed in TABLE VI are the average time for one iteration. Generally speaking, GSST is twice as fast as I3D in the execution speed.
Note that the improvement in the execution time on GPUs depends on many factors and is less significant than FLOPs. On other hardwares such as ARM CPUs, the improvement can be more significant.
|Devices||Time cost (s)|
|GeForce GTX TITAN||0.421||0.267||0.242||0.214|
|GeForce GT 740||3.112||2.023||1.802||1.618|
In this paper, we have introduced 3D-CNNs for action recognition on RGB-D data and achieved good performance. In order to take full advantage of RGB-D data which contains both RGB information and depth information, we take both RGB video and depth video as input into our 3D-CNNs based action recognition frameworks and achieve better performance than only employing single modal video information, i.e. RGB video or depth video alone.
The main contribution of our work is to overcome the problem that 3D-CNNs generally having a large number of parameters and are of heavy computation. We have proposed a series of light-weight 3D architectures, namely IST, SST and GSST.
We have validated the proposed models on two RGB-D datasets, the largest available RGB-D human activity dataset, NTU and a small dataset N-UCLA. The proposed light-weight models show comparable recognition accuracy as compared to I3D which is the state-of-the-art method for action recognition in accuracy.
Design of light-weight 3D CNN structures can be very important, not limited to action recognition. The designs proposed in this work may be applicable to most 3D data related tasks.
-  J. Yu, K. Weng, G. Liang, and G. Xie, “A vision-based robotic grasping system using deep learning for 3d object recognition and pose estimation,” in Robotics and Biomimetics (ROBIO), 2013 IEEE International Conference on. IEEE, Dec. 2013, pp. 1175–1180.
E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture
for person re-identification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3908–3916.
-  R. Marks, “System and method for providing a real-time three-dimensional interactive environment,” Dec. 6 2011, uS Patent 8,072,470.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.
-  G. Lev, G. Sadeh, B. Klein, and L. Wolf, “Rnn fisher vectors for action recognition and image annotation,” in European Conference on Computer Vision. Springer, 2016, pp. 833–850.
-  Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm convolves, attends and flows for action recognition,” Computer Vision and Image Understanding, vol. 166, pp. 41–50, 2018.
-  K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 18–22.
-  J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 4724–4733.
-  Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE multimedia, vol. 19, no. 2, pp. 4–10, 2012.
-  J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang, “Jointly learning heterogeneous features for rgb-d activity recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5344–5352.
-  Y. Kong and Y. Fu, “Max-margin heterogeneous information machine for rgb-d action recognition,” International Journal of Computer Vision, vol. 123, no. 3, pp. 350–371, 2017.
-  P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu, “Cooperative training of deep aggregation networks for rgb-d action recognition,” arXiv preprint arXiv:1801.01080, 2017.
-  A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of actions with actoms,” IEEE transactions on pattern analysis and machine intelligence, p. 1, 2013.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017,” arXiv preprint arXiv:1707.01083.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 1290–1297.
-  I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
-  P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in Proceedings of the 15th ACM international conference on Multimedia. ACM, 2007, pp. 357–360.
-  H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 3169–3176.
-  H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568–576.
-  L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional descriptors,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4305–4314.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
-  M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Sequential deep learning for human action recognition,” in International Workshop on Human Behavior Understanding. Springer, 2011, pp. 29–39.
-  N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented histograms of flow and appearance,” in European conference on computer vision. Springer, 2006, pp. 428–441.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
-  Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “Mict: Mixed 3d/2d convolutional tube for human action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 449–458.
-  Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3d residual networks,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5534–5542.
-  L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices for very deep two-stream convnets,” arXiv preprint arXiv:1507.02159, 2015.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 20–36.
-  D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri, “Convnet architecture search for spatiotemporal feature learning,” arXiv preprint arXiv:1708.05038, 2017.
-  M. Zolfaghari, K. Singh, and T. Brox, “Eco: Efficient convolutional network for online video understanding,” arXiv preprint arXiv:1804.09066, 2018.
-  W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, “A key volume mining deep framework for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1991–1999.
-  C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Krähenbühl, “Compressed video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6026–6035.
-  L. Han, X. Wu, W. Liang, G. Hou, and Y. Jia, “Discriminative human action recognition in the learned hierarchical manifold space,” Image and Vision Computing, vol. 28, no. 5, pp. 836–849, 2010.
-  F. Lv and R. Nevatia, “Recognition and segmentation of 3-d human action using hmm and multi-class adaboost,” in European conference on computer vision. Springer, 2006, pp. 359–372.
-  M. Müller and T. Röder, “Motion templates for automatic classification and retrieval of motion capture data,” in Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation. Eurographics Association, 2006, pp. 137–146.
-  J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in European Conference on Computer Vision. Springer, 2016, pp. 816–833.
-  J. Liu, A. Shahroudy, D. Xu, A. K. Chichung, and G. Wang, “Skeleton-based action recognition using spatio-temporal lstm network with trust gates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  T. Basha, Y. Moses, and N. Kiryati, “Multi-view scene flow estimation: A view centered variational approach,” International journal of computer vision, vol. 101, no. 1, pp. 6–21, 2013.
-  H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, “Hopc: Histogram of oriented principal components of 3d pointclouds for action recognition,” in European conference on computer vision. Springer, 2014, pp. 742–757.
-  O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 716–723.
-  V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neural networks for action recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4041–4049.
-  X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 1057–1060.
-  P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. O. Ogunbona, “Action recognition from depth maps using deep convolutional neural networks,” IEEE Transactions on Human-Machine Systems, vol. 46, no. 4, pp. 498–509, 2016.
-  P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. Ogunbona, “Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring,” in Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015, pp. 1119–1122.
-  A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
-  J. Han, L. Shao, D. Xu, and J. Shotton, “Enhanced computer vision with microsoft kinect sensor: A review,” IEEE transactions on cybernetics, vol. 43, no. 5, pp. 1318–1334, 2013.
-  M. Asadi-Aghbolaghi, H. Bertiche, V. Roig, S. Kasaei, and S. Escalera, “Action recognition from rgb-d data: Comparison and fusion of spatio-temporal handcrafted features and deep strategies,” in Chalearn Workshop on Action, Gesture, and Emotion Recognition: Large Scale Multimodal Gesture Recognition and Real versus Fake expressed emotions (ICCV17), 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” arXiv preprint arXiv:1801.04381, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
-  S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 305–321.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
-  J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view action modeling, learning and recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2649–2656.
-  T. L. Minh, N. Inoue, and K. Shinoda, “A fine-to-coarse convolutional neural network for 3d human action recognition,” arXiv preprint arXiv:1805.11790, 2018.
-  S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” arXiv preprint arXiv:1801.07455, 2018.
-  C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan, “Skeleton-based action recognition with spatial reasoning and temporal stack learning,” arXiv preprint arXiv:1805.02335, 2018.
-  P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona, “Depth pooling based large-scale 3-d action recognition with convolutional neural networks,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1051–1061, 2018.
-  Y. Xiao, J. Chen, Z. Cao, J. T. Zhou, and X. Bai, “Action recognition for depth video using multi-view dynamic images,” arXiv preprint arXiv:1806.11269, 2018.
-  F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, “Glimpse clouds: Human activity recognition from unstructured feature points,” Computer Vision and Pattern Recognition (CVPR)(To appear), vol. 3, 2018.
-  N. Garcia, P. Morerio, and V. Murino, “Modality distillation with multiple stream networks for action recognition,” arXiv preprint arXiv:1806.07110, 2018.
-  M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol. 68, pp. 346–362, 2017.
-  R. Li and T. Zickler, “Discriminative virtual views for cross-view action recognition,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2855–2862.
-  Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, “Cross-view action recognition via a continuous virtual path,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2690–2697.
-  H. Rahmani and A. Mian, “3d action recognition from novel viewpoints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1506–1515.
-  A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, “3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2601–2608.
-  H. Rahmani and A. Mian, “Learning a non-linear knowledge transfer model for cross-view action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2458–2466.
-  Cmu motion capture database. [Online]. Available: http://mocap.cs.cmu.edu/
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for 3d human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 5, pp. 914–927, 2014.