Action Recognition for Depth Video using Multi-view Dynamic Images

06/29/2018 ∙ by Yang Xiao, et al. ∙ Huazhong University of Science u0026 Technology Agency for Science, Technology and Research 6

Dynamic image is the recently emerged action representation paradigm able to compactly capture the temporal evolution, especially in context of deep Convolutional Neural Network(CNN). Inspired by its preliminary success towards RGB videos, we propose its extension to the depth domain. To better exploit the 3D characteristics of depth video to leverage the performance, multi-view dynamic image is proposed by us. In particular, the raw depth video will be densely projected onto the different imaging view-points by rotating the virtual camera around the specific instances within the 3D space. Dynamic images are then extracted from the yielded multi-view depth videos respectively to constitute the multi-view dynamic images. In this way, more view-tolerant representative information can be involved in multiview dynamic images than the single-view counterpart. A novel CNN learning model is consequently proposed to execute feature learning on multi-view dynamic images. The dynamic images from different views will share the same convolutional layers, but with the different fully-connected layers. This model aims to enhance the tuning of shallow convolutional layers by alleviating gradient vanishing. Furthermore, to address the effect of spatial variation an action proposal method based on faster R-CNN is proposed. The dynamic images will be extracted only from the action proposal regions. In experiments, our approach can achieve the state-of-the-art performance on 3 challenging datasets (i.e., NTU RGB-D, Northwestern-UCLA and UWA3DII).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the recently emerged low-cost depth sensors (e.g., Microsoft Kinect), action recognition using depth has attracted increasing attention in, e.g., video surveillance, multimedia data analysis, and human-machine interaction. Compared with its RGB counterpart, depth video can provide richer three-dimensional (3D) descriptive information for action characterization. Several studies Wang et al. (2014a); Veeriah et al. (2015); Oreifej and Liu (2013); Yang et al. (2012); Wang et al. (2016); Yang and Tian (2014) have been concerned with the advantage of 3D visual cues for action recognition from different theoretical perspectives. In practical applications, a major trend is to address the depth-based action recognition problem under high intra-class and camera view variation conditions, with a large number of action classes. Accordingly, some challenging datasets (e.g., NTU RGB-D Shahroudy et al. (2016)) have recently been proposed. Unfortunately, the performance of existing approaches on these datasets is unsatisfactory. To improve performance, one may seek a more discriminative depth-based action representation paradigm.

To better reveal action characteristics, a key issue is to adequately capture dynamic motion information. To this end, most state-of-the-art RGB-based action recognition approaches Simonyan and Zisserman (2014) resort to extracting dense optical flow fields Brox and Malik (2011), and for depth video, scene flow Basha et al. (2013)

is required for obtaining 3D motion characteristics. Nevertheless, accurate scene flow estimation is still a challenging task involving high computational cost 

Basha et al. (2013); thus, it is not feasible for practical applications. An alternative method for resolving this is to capture the articulated 3D movement of human body skeleton joints Xia et al. (2012); Wang et al. (2014a); Vemulapalli et al. (2014). However, this point-based sparse description scheme may lead to information loss. Moreover, the existing human body skeleton extraction methods are still sensitive to pose and imaging viewpoint variation Oreifej and Liu (2013) and may fail in certain cases. Accordingly, a more concrete motion characterization approach for depth video is required.

Dynamic imaging Bilen et al. (2016b, a) is the most recent compact video representation paradigm, based on temporal rank pooling Fernando et al. (2017)

. It compresses a video or video clip into a single image and simultaneously maintains rich motion information. In the context of deep learning technology (i.e., deep convolutional neural networks (CNNs)), considerable success has been achieved in RGB-based action recognition 

Bilen et al. (2016b, a); Fernando et al. (2016)

at reasonable computational cost. Compared with optical flow for motion characterization, dynamic imaging has the advantages of computational efficiency and compactness. In particular, achieving higher running efficiency is relatively easy by solving linear ranking support vector machines (SVMs), and to quickly obtain approximate solutions 

Bilen et al. (2016a). Furthermore, using dynamic imaging, stacking of the optical flow field sequence (as the input of the CNN for action description) may be avoided Simonyan and Zisserman (2014), thus reducing the complexity of the CNN. Inspired by this, the present study extends dynamic imaging to the depth domain. To the authors’ knowledge, this has not been well studied. A naive approach is to directly apply dynamic imaging to depth video as in the RGB case. However, this cannot fully exploit 3D motion properties. Accordingly, it is proposed that the raw depth video be densely projected with respect to different virtual imaging viewpoints by rotating a virtual camera around specific instances within the 3D observation space. This procedure is derived from the intrinsic 3D imaging characteristics of depth data, which cannot be achieved by RGB data. Subsequently, dynamic images are extracted from the obtained multi-view depth videos, and thus multi-view dynamic images are constructed for action characterization.

The high adaptability to CNNs is another major advantage of dynamic images, which ensures strong discriminative power. In Bilen et al. (2016b, a); Fernando et al. (2016), the dynamic images from RGB channels are fed into a one-stream CNN model Jia et al. (2014) for action representation. In the present study, it is argued that this paradigm is not optimal for the proposed multi-view dynamic imaging. Deep insight is obtained from the CNN learning perspective, and the gradient vanishing problem Bengio et al. (1994); Glorot and Bengio (2010) in the training is the critical issue faced by the deep neural network, particularly when the training sample size is not sufficient. Therefore, the shallow layers (i.e., the convolutional layers) in the CNN may not be adequately tuned He et al. (2016), thus impairing its visual pattern capture capacity. Regarding the aim of the present study, the scale of the existing depth action datasets Shahroudy et al. (2016); Rahmani et al. (2016); Wang et al. (2014c)

(i.e., 56000 samples at most) is still relatively small compared with the large-scale image recognition datasets (e.g., Imagenet with millions of samples) suitable for CNN training. Moreover, compared with the RGB dynamic images in 

Bilen et al. (2016b, a); Fernando et al. (2016)

, multi-view dynamic images in depth video are of higher complementarity. Hence, the motivation is to better exploit this complementarity property and improve the training effectiveness on the convolutional layers in the case that the training samples are not abundant. To this end, a novel CNN learning model is proposed. In particular, the dynamic images from different imaging viewpoints share the same convolutional layers but correspond to different fully connected layers. During training, the fully connected layers of different viewpoints will be iteratively tuned, whereas the shared convolutional layers are consistently tuned during the entire training phase, and the training error from different viewpoints can be back-propagated to the convolutional layers with more chances. Thus, the gradient vanishing problem can be alleviated to some degree. Consequently, the output of the fully connected layers will be employed as the visual feature for action recognition, being input into the SVM with principal component analysis (PCA).

Figure 1: Technical pipeline diagram of the proposed action recognition method for depth video using multi-view dynamic images.

Generally, actions take place at different spatial locations in varying scene conditions. Dynamic imaging can fade the background effect but is still spatial-sensitive. This is the case in CNNs as well Cimpoi et al. (2015). To counter the effect of spatial variation, an action proposal method is put forth in this study. Specifically, an off-the-shelf object detector faster R-CNN Ren et al. (2015) is first used to detect humans per frame, owing to its generality and robustness. Subsequently, the resulting human detection bounding boxes are spatial-temporally merged to construct the action proposal volume. It is noteworthy that the human–human and human–object interaction information can still be maintained within the action proposals. The dynamic image is subsequently extracted from the action proposal volume (not from the entire video) and is fed to the CNN. The main technical pipeline diagram of the proposed scheme is shown in Fig. 1.

The proposed action recognition approach based on multi-view dynamic images was tested on three challenging depth datasets (i.e., NTU RGB-D Shahroudy et al. (2016), Northwestern–UCLA Wang et al. (2014c), and UWA3DII Rahmani et al. (2016)). The experimental results demonstrate that the method can achieve state-of-the-art performance on all datasets. The effectiveness of the method is also investigated by an ablation study.

The main contributions of this study are:

  • Dynamic images are extended to the depth domain for action recognition. Multi-view dynamic imaging is proposed for obtaining 3D motion characteristics for action description.

  • A novel CNN learning model is proposed to enhance the training effectiveness on multi-view dynamic images;

  • An action proposal approach is put forth to counter the effect of spatial variation.

The source code and supporting material for this paper can be accessed at https://github.com/3huo/MVDI.

The rest of this paper is organized as follows. Sec. 2 introduces related work. Sec. 3 illustrates the concept of multi-view dynamic image in detail. The proposed CNN learning model for multi-view dynamic images is presented in Sec. 4. The action proposal approach is introduced in Sec. 5. Experiments and discussion are presented in Sec. 6. Sec. 7 concludes the paper.

2 Related Work

Depth-based action recognition approaches can be generally categorized into three main groups: skeleton-based, raw depth-video-based, and their combination.

Skeleton-based. Under this paradigm, 3D human body skeleton joints are first extracted from the depth frames for action characterization. Using the skeleton information, Wang et al. Wang et al. (2014a) extracted the spatial-temporal pairwise distances between the skeleton joints and mined the most discriminative joint combinations for specific action classes. Vemulapalli et al. Vemulapalli et al. (2014) and Huang et al. Huang et al. (2017) proposed extracting the discriminative Lie group action representation from the manifold learning perspective. Weng et al. Weng et al. (2017)

resorted to nearest neighbor search to categorize actions. A recent research trend is to capture the spatial-temporal evolution of skeleton joints using recurrent neural networks (RNNs) 

Veeriah et al. (2015)

and long–short-term memory networks (LSTM) 

Shahroudy et al. (2016). Despite its noticeable progress, the skeleton-based approach still suffers from information loss and potential skeleton joint extraction failure. Moreover, the CNN cannot benefit from this paradigm and improve its performance.

Raw depth-video-based. Within this group, spatial-temporal features for action description are captured from the raw depth video directly without extracting the skeletons. Oreifej et al. Oreifej and Liu (2013) proposed HON4D for action characterization by computing the histogram of oriented normal vectors in the 4D space (i.e., XYZ-T). Later, Rahmani et al. Rahmani et al. (2014) refined HON4D by encoding the histogram of oriented principal components (HOPC) within the 3D volume around each cloud point. To counter the effect of imaging noise, camera view variation, and cluttered background, Lu et al. Lu et al. (2014) proposed a binary descriptor by performing pairwise comparisons among the 3D cloud points. In Yang et al. (2012), the concept of depth motion map (DMM) is proposed to compress the depth video into a single image by intuitively aggregating the video frames using sum pooling. The histogram of oriented gradients (HOG) descriptor is consequently extracted on the DMM from three orthogonal projection planes. In fact, DMMs are in a sense similar to dynamic images. However, it does not involve temporal evolution characteristics. HOG was replaced with the FV-encoded LBP in Chen et al. (2015). Wang et al. Recently, in  Wang et al. (2015, 2016) CNNs were applied to DMMs, as in the present study. The main differences are as follows. First, the superiority of dynamic images over DMMs is verified for action characterization in depth video. Second, a novel CNN learning model that better handles multiple-view dynamic images is constructed. Finally, the action proposal procedure is not involved in Wang et al. (2015, 2016).

Skeleton and raw depth video fusion manner. To better use the information from the skeletons and raw depth video, Wang et al. Wang et al. (2012) proposed extracting the local occupancy pattern from the 3D volume space along the skeleton joints. Following this, Yang et al. Yang and Tian (2014) chose to extract the super normal vector (SNV) descriptor along the skeleton joints. Althloothi et al. Althloothi et al. (2014) and Chaaraoui et al. Chaaraoui et al. (2013) proposed fusing the skeleton and silhouette shape features.

The present study falls into the second group. Skeleton extraction is not required; thus, robustness is ensured, and richer representative depth video information is involved. To enhance the discriminative power of spatial-temporal features, dynamic imaging and a CNN model are employed. For a more complete survey on action recognition using depth data, readers are referred to Wang et al. (2018).

Obtaining multi-view dynamic images from depth video can be regarded as the most important contribution of this study, as this allows the action recognition task to be considered within the multi-view learning framework, thus enhancing performance through late feature fusion. Multi-view learning can also be applied to numerous other visual recognition problems. For instance, Yu et al. Yu et al. (2014, 2017)

proposed novel sparse coding and deep metric learning approaches to fuse multimodal information and facilitate image ranking. Moreover, a deep autoencoder method 

Hong et al. (2015) was also proposed to integrate multimodal cues and map two-dimensional (2D) images to 3D poses. Compared with these methods, the focus of the present study is on alleviating the gradient vanishing problem in the CNN by multi-view learning.

3 Multi-View Dynamic Images

To concretely capture the motion information in depth video, the concept of dynamic image is extended from the RGB domain to the depth domain. By densely projecting the depth video with respect to multiple virtual imaging points, multi-view dynamic images are extracted involving more discriminative information. Moreover, the training sample size can be increased for better CNN tuning.

3.1 Dynamic images

To use CNNs for action characterization, Bilen et al. Bilen et al. (2016b, a) proposed the concept of dynamic images to capture spatial-temporal dynamic evolution information in a video. A major advantage of dynamic images is that they can summarize a video or video clip into a single static image. Intuitively, this can improve the efficiency of the CNN for both the off-line training and the online test phases. Let denote the video frames, and be the frame average until time . Subsequently, a video ranking score function at each time is defined as

(1)

where is the ranking parameter vector. u is learned from the specific video to reflect the ranking relationship among the video frames. The criterion is that the later frames are associated with the larger ranking scores as

(2)

The learning procedure of u is subsequently formulated as a convex optimization problem based on the framework of RankSVM as

(3)
(a) Hand waving
(b) Wearing shoe
(c) Kicking other person
(d) Hugging other person
Figure 2: Dynamic image samples of four actions in depth video.

In particular, the first term is the regularizer usually employed in SVMs and the second term is the hinge-loss soft-counting of the number of pairs incorrectly ranked by the scoring function, i.e., those that do not satisfy . Optimizing Eqn. 3 can map the video frames to a single vector . In fact, encodes the dynamic evolution information from all frames. Spatially reordering from 1D to 2D yields dynamic images for video representation. Dynamic images have already demonstrated their superiority for action characterization in RGB video  Bilen et al. (2016b, a); Fernando et al. (2016). In the present study, it is demonstrated that this can also be extended to depth video. Fig. 2 shows the dynamic image samples of four actions (i.e., “hand waving”, “wearing shoe”, “kicking other person”, and “hugging other person”) in the depth videos from the NTU RGB-D dataset Shahroudy et al. (2016). It can be observed that the discriminative dynamic motion information in the video frames can be obtained from a single dynamic image, and simultaneously the background is suppressed. Moreover, the motion temporal order is also reflected by the gray-scale value.

However, intuitively applying dynamic imaging to the depth domain cannot fully exploit the 3D visual clues contained in depth video. To address this, it is proposed that the depth video be densely projected with respect to multiple virtual imaging viewpoints in the 3D observation space. Dynamic images are then extracted from the obtained depth videos, and thus multi-view dynamic images are constructed.

Figure 3: Rotation of the virtual camera within 3D space to mimic different imaging viewpoints in depth video.

3.2 Multi-view projection on depth video

Unlike RGB video, depth video can be observed from different viewpoints. This can be achieved by rotating the virtual camera around specific instances in 3D space, as shown in Fig. 3. This is equivalent to rotating the 3D cloud points within the depth frames. Consequently, a series of synthesized depth videos can be generated by multi-view projection in the raw video. This facilitates 3D discriminative visual information mining as well as data augmentation for CNN training.

As shown in Fig. 3, if the virtual camera is to be moved from to , can be first moved to with a rotation angle around the Y axis, and then can be moved to with a rotation angle around the X axis. The corresponding rotation matrices for the 3D point coordinate transformation are given by

(4)

and

(5)

where the right-handed coordinate system is used for rotation, and the original camera viewpoint is regarded as the rotation angle origin. Consequently, the coordinates of a 3D point at after viewpoint rotation are transformed as

(6)

where , , and can be regarded as the new screen coordinates, and the corresponding depth value, respectively, for the synthesized depth frames. The proposed multi-view projection approach for depth video is similar to that in Wang et al. (2015, 2016). However, it is noteworthy that it does not transform to real-world coordinates as in Wang et al. (2015, 2016) because this requires the focal length of the depth camera, which is not always available in practical applications. Thus, even though the proposed method is relatively easier and more convenient in applications, it still has high performance. Fig. 4 shows some multi-view projection results in a specific depth frame corresponding to different combinations. In particular, “” corresponds to the raw depth frame. It can be observed that more representative visual information is involved in the ensemble of the synthesized depth frames. Moreover, the sample size for a specific action can be significantly increased, which facilitates the alleviation of the data hungriness problem during CNN training, as previously mentioned.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 4: Multi-view projection results on depth frame. In particular, “” corresponds to the raw depth frame.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5: Dynamic images corresponding to the multi-view projection results in Fig. 4.

3.3 Multi-view dynamic image extraction

After the multi-view projection procedure in the depth video has been completed, dynamic images are individually extracted from the synthesized multi-view depth videos (including the raw video), as shown in Fig. 5. The ensemble of the obtained single-view dynamic images is termed as multi-view dynamic images for action characterization. Moreover, to involve richer temporal representative information, the single-view videos are split into overlapping temporal segments. The dynamic images are simultaneously extracted from the temporal segments and the entire single-view video, as shown in Fig. 6. In particular, each video is empirically split into four temporal segments with an overlap ratio of 0.5. Following Bilen et al. (2016b, a), the temporal segments share the same action category label as the raw single-view video. In fact, the temporal split procedure is also used for data augmentation for CNN training to improve performance.

Figure 6: Temporal split for multi-view dynamic image extraction.

4 Multi-view Dynamic Image Adaptive CNN Learning Model

After the multi-view dynamic images have been extracted, they are fed into the CNN for action characterization. Hence, an adapted CNN learning model is required to ensure performance. Two issues require attention. First, the scale of the existing depth action datasets Shahroudy et al. (2016); Rahmani et al. (2016); Wang et al. (2014c) (i.e., 56000 samples at most) is still limited; thus, the requirements of data-hungry CNN training cannot be fully met. In fact, this may aggravate the effect of the gradient vanishing problem Bengio et al. (1994); Glorot and Bengio (2010), particularly on the shallow convolutional layers that capture visual patterns. Second, even though the extracted multi-view dynamic images are of high complementarity, as shown in Fig. 5, the conventional one-stream CNN model Jia et al. (2014) employed in Bilen et al. (2016b, a); Fernando et al. (2016) cannot capture this well. In Wang et al. (2015, 2016), a multi-stream CNN model is proposed to alleviate this problem on multi-view DMMs. That is, the DMMs from different viewpoints correspond to the individual convolutional and fully connected layers. However, this model still suffers from insufficient training data. Accordingly, the primary motivation for this study is to better exploit the complementary information from different viewpoints to enhance CNN training when the training samples are insufficient.

Figure 7: Multi-view dynamic image adaptive CNN learning model.
Input: dynamic image training sample sets from the virtual imaging viewpoints; sub-CNN models pre-trained on Imagenet, with the same structure and parameter setting;
Output: the tuned sub-CNN models for action characterization;
1 for each training iteration do
2        for training sample set that corresponds to the -th viewpoint  do
3               if i1; then
4                      inherits the convolutional layers of from the last training iteration;
5                      Train using ;
6              else
7                      inherits the convolutional layers of within the current training iteration;
8                      Train using ;
9              
10       
11Return ;
Algorithm 1 Training procedure of the proposed multi-view dynamic image adaptive CNN learning model

Thus, a novel multi-view dynamic image adaptive CNN learning model is proposed in Fig. 7. In particular, the dynamic images from different viewpoints share the same convolutional layers (specifically, the same convolutional filters), but correspond to different fully connected layers. The intuition is that the shared convolutional layers capture the discriminative fundamental visual patterns among the multi-view dynamic images, whereas the multi-stream fully connected layers reflect complementarity. During the training phase, the fully connected layers of different viewpoints will be iteratively tuned, whereas the shared convolutional layers are consistently trained. In particular, in this learning model, each imaging viewpoint corresponds to a sub-CNN model. All the sub-CNN models with the same structure can be trained independently in an end-to-end manner, but share the same convolutional layers. Let the sub-CNN models be denoted by . During each iteration of the training phase, is trained first. is subsequently trained by inheriting the acquired convolutional layers of . This procedure is applied to the remaining sub-CNN models in the current training iteration. When the next training iteration starts, inherits the convolution layers of from the previous iteration. This recursive procedure continues until the entire training phase of the proposed CNN model is completed. Algorithm 1 shows the entire training procedure in detail. The proposed method can be regarded as a hybrid of the one-stream CNN model in Bilen et al. (2016b, a); Fernando et al. (2016) and the multi-stream model in Wang et al. (2015, 2016).

Its main advantages are as follows. First, compared with the one-stream model in Bilen et al. (2016b, a); Fernando et al. (2016), the multi-stream fully connected layers in the proposed approach maintain the complementary information among the viewpoints. It is noteworthy that the output of the multi-stream fully connected layers will be concatenated as a visual feature for action recognition with PCA and SVMs in the proposed method. Furthermore, compared with the multi-stream model in Wang et al. (2015, 2016), in the proposed method, the training error from different viewpoints can be back-propagated to the convolutional layers with better chance. Thus, the gradient vanishing problem can be alleviated to some degree, thereby enhancing the training effectiveness on the convolutional layers.

5 Spatial-temporal Action Proposal

Figure 8: Action proposal for “walking apart from each other” from the NTU RGB-D Shahroudy et al. (2016) dataset. In particular, the red bounding boxes are the human detection results using the faster R-CNN, and the green cubic is the action proposal region.

In Bilen et al. (2016b, a); Fernando et al. (2016)

, dynamic image extraction is directly performed in the entire video frames. However, this is not optimal for the subsequent CNN-based visual feature extraction procedure because actions of the same category may take place at different spatial locations and in varying scene conditions. Dynamic images can fade the background effect but are still spatial-sensitive. This is also the case in CNNs 

Cimpoi et al. (2015), particularly on the shallow convolutional layers, and hinders stable action representation. Consequently, effective action proposal is required.

According to the principal idea that actions are carried out by humans, a spatial-temporal action proposal approach is set forth. In particular, human detection is first performed in each depth frame. Then, the resulting human detection bounding boxes are spatial-temporally linked and merged. That is, the minimal spatial-temporal cubic volume that closely covers all bounding boxes is considered the action proposal result, as shown in Fig. 8. Consequently, the multi-view dynamic images are extracted only from the obtained action proposal region with certain extensions111Within each frame, the four sides of the action proposal bounding box will be extended by 30 pixels evenly to involve more discriminative information., not form the entire depth video. Generally, the action items of human–human interaction and human–object interaction can be intrinsically involved in the proposed action proposal approach. Specifically, the off-the-shelf faster R-CNN Ren et al. (2015)

is employed as a human detector owing to its high object detection capacity. It is noteworthy that this study primarily focuses on addressing the action recognition not spatial-temporal action detection in untrimmed video. Accordingly, all the training and test action video samples involved were trimmed in advance, with definite starting and ending time points. In particular, the action proposal procedure is performed during the entire duration of each video sample, both for training and testing. Thus, temporal action stride need not be considered.

6 Experiments

In the experiments, the focus will be on verifying the discriminative power of multi-view dynamic images for action characterization in depth video. Three challenging action datasets were used, specifically, NTU RGB-D Shahroudy et al. (2016), Northwestern–UCLA Wang et al. (2014c), and UWA3DII Rahmani et al. (2016). Although both RGB and depth information is involved in these datasets, only the depth visual cues are taken into consideration. If not otherwise specified, the available samples in each dataset are split into the training set and the testing set according to the original principles in Shahroudy et al. (2016); Wang et al. (2014c); Rahmani et al. (2016).

View group ()
Group 1
Group 2
Group 3
Group 4
Group 5
Table 1: view group division according to the value of .

When multi-view projection is performed in depth video, is empirically set to , and to , , , , , , , , , , and . It is noteworthy that the tuple corresponds to the raw depth video. Furthermore, the tuples are divided into five view groups, as listed in Table 1, according to the value of . During training, the five view groups share the same convolutional layers but correspond to individual fully connected layers, as shown in Fig. 7. The main reason for merging the adjacent tuples into groups is to restrict model size.

Arch conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8
CNN-F

st.(4), pad.(0)

LRN, x2 pool
st.(1), pad.(2)
LRN, x2 pool
st.(1), pad.(1)
-
st.(1), pad.(1)
-
st.(1), pad.(1)
x2 pool
4096
drop-
out
4096
drop-
out
1000
soft-
max
Table 2: CNN-F model architecture. It contains five convolutional layers (conv1-5) and three fully connected layers (fc6-8). The details of each convolutional filter are listed in three sub-rows: the first sub-row specifies the number of convolution filters and their receptive field size in the form of “

”; the second sub-row indicates the convolution stride (i.e., “st.”) and the spatial padding (i.e., “pad.”); the third sub-row indicates the max-pooling down-sampling factor and whether local response normalization (LRN) 

Krizhevsky et al. (2012) is applied. fc6-8, fc6, and fc7 are regularized using dropout Krizhevsky et al. (2012)

, whereas the last layer acts as a softmax classifier. The activation function for all weight layers (except for the softmax layer) is the rectified linear unit  

Krizhevsky et al. (2012).

The proposed multi-view dynamic image adaptive CNN learning model is constructed using the CNN-F model in Chatfield et al. (2014), which consists of five convolutional layers and three fully connected layers. The output of the fc6 layer is employed for action representation. The detailed CNN-F model structure is shown in Table 2. During training, most of the training parameters were set as in Chatfield et al. (2014) with a momentum of 0.9 and a weight decay of 0.0005. However, in the proposed model, the initial learning rate was set to 0.001, i.e., decreased by a factor of 10, and batch size to . The five temporal segments of each depth video illustrated in Sec. 3.3 were randomly chosen for training (i.e., one for each batch). The training procedure was executed on a single NVIDIA GTX 1080 GPU.

As the existing faster R-CNN model Ren et al. (2015) is generally trained on RGB object detection datasets, it should be re-trained for human detection in depth video. Thus, sufficient training samples are required. Fortunately, in all three test datasets, human body skeleton information is provided per frame. The minimum bounding rectangle that covers all the skeleton joints is regarded as the ground-truth human bounding box for training. The technical pipeline diagram and the procedure in Ren et al. (2015) were strictly followed to train the faster R-CNN. The only difference is that in the present scenario, the input of the faster R-CNN is depth maps instead of RGB images as in Ren et al. (2015).

LIBLINAR was used as the SVM classifier. A linear kernel was applied owing to its efficiency. The penalty factor was set by 5-fold cross-validation on the training set. After PCA, the dimension of the feature vector for the SVM was reduced to 1000.

The experimental results are organized as follows. The action recognition results for the three benchmark datasets are reported in Secs. 6.16.3. The effectiveness of the dynamic image, multi-view projection, multi-view dynamic image adaptive CNN learning model, spatial-temporal action proposal, and SVM is demonstrated in Secs. 6.46.8, respectively.

Figure 9: Key depth frames of four actions from three different viewpoints in the NTU RGB-D dataset.
Method Cross-subject Cross-view
Input: Skeleton data
Skeletal Quads Evangelidis et al. (2014) 38.6 41.4
LARP Vemulapalli et al. (2014) 50.1 52.8
HBRNN-L Du et al. (2015) 59.1 64.0
FTP Dynamic Skeletons Hu et al. (2015) 60.2 65.2
PA-LSTM Shahroudy et al. (2016) 63.0 70.3
ST-LSTM Liu et al. (2016) 69.2 77.7
GCA-LSTM network Liu et al. (2017) 74.4 82.8
Clips+CNN+MTLN Ke et al. (2017) 79.6 84.8
Input: Depth maps
HON4D Oreifej and Liu (2013) 30.6 7.3
SNV Yang and Tian (2014) 31.8 13.6
HOG Ohn-Bar and Trivedi (2013) 32.2 22.3
Multi-view dynamic images+CNN 84.6 87.3
Table 3: Comparison of action recognition accuracy (%) for different methods on the NTU RGB-D dataset.

6.1 NTU RGB+D dataset

NTU RGB+D is a recently proposed large-scale dataset for RGB-D human action recognition. In particular, it involves 56000 samples of 60 action classes, collected from 40 subjects. The actions can be generally divided into three categories: 40 daily actions (e.g., drinking, eating, reading), nine health-related actions (e.g., sneezing, staggering, falling down), and 11 mutual actions (e.g., punching, kicking, hugging). These actions take place under 17 different scene conditions corresponding to 17 video sequences (i.e., S001–S017). The actions were captured using three cameras with different horizontal imaging viewpoints, namely, , and . Fig. 9 shows some key depth frames of four actions in this dataset from three different viewpoints. Multi-modality information is provided for action characterization, including depth maps, 3D skeleton joint position, RGB frames, and infrared sequences. The performance evaluation was performed by a cross-subject test that split the 40 subjects into training and test groups, and by a cross-view test that employed one camera () for testing, and the other two cameras for training. The results of the comparison between the proposed method and other state-of-the-art approaches are listed in Table 3. The following can be observed:

The proposed action recognition method consistently outperforms all the other state-of-the-art approaches in both the cross-subject and cross-view tests, regardless of whether the input was skeleton data or depth maps. This demonstrates the effectiveness of the proposed method for depth-based action recognition under complex scene and viewpoint conditions.

The proposed method significantly outperforms the other depth-map-based approaches by large margins. This may be due to the following: (1) The hand-crafted visual descriptors employed by the other approaches are not as discriminative as the proposed CNN-based feature learning in multi-view dynamic images. (2) The other methods are sensitive to viewpoint variation. However, the proposed multi-view projection in depth video can alleviate this. It should be noted that the proposed approach can even achieve better performance in the cross-view test. (3) The other methods take the entire depth maps into consideration for action characterization without effective action proposal. Thus, they are also sensitive to scene and action position variation.

Skeleton-based approaches achieve considerably better performance than depth-map-based approaches Oreifej and Liu (2013); Yang and Tian (2014); Ohn-Bar and Trivedi (2013), except for the proposed method. In fact, the most recently proposed depth-based action recognition techniques Ke et al. (2017); Liu et al. (2017) mainly resort to skeleton information. However, the present study verifies that using the depth maps only can also achieve comparable or even better performance by employing feature learning and human detection.

6.2 Northwestern-UCLA dataset

Figure 10: Key depth frames of four actions from three different viewpoints in the Northwestern–UCLA dataset.
Method Accuracy
Input: Skeleton data
HOJ3D Xia et al. (2012) 54.4
Actionlet Wang et al. (2014a) 69.9
LARP Vemulapalli et al. (2014) 74.2
Input: Depth maps+human mask
HPM+TM+external training data Rahmani and Mian (2016) 92.0
Input: Depth maps
CCD Cheng et al. (2012) 34.4
HON4D Oreifej and Liu (2013) 39.9
SNV Yang and Tian (2014) 42.8
DVV Li and Zickler (2012) 52.1
AOG Wang et al. (2014b) 53.6
HOPC Rahmani et al. (2014) 80.0
Multi-view dynamic images+CNN 84.2
Table 4: Comparison of action recognition accuracy (%) for different methods on the Northwestern–UCLA dataset.

This dataset contains 1475 video samples of 10 action categories: picking up with one hand, picking up with two hands, dropping off trash, walking around, sitting down, standing up, donning, doffing, throwing, and carrying. They were captured by three depth cameras from different viewpoints in varying scene conditions. Each action was executed by 10 subjects. Fig. 10 shows some key depth frames of four actions in this dataset from three different viewpoints. Compared with NTU RGB-D, the samples in this dataset contain considerably higher imaging noise, which makes action recognition significantly more difficult. Multi-modality information is provided for action characterization, including depth maps, 3D skeleton joint position, and RGB frames. Following Wang et al. (2014c), the samples from the first two cameras were used for training, and the samples from the third camera for testing. The results of the comparison between the proposed approach and the other state-of-the-art approaches are listed in Table 4. The following can be noted:

Except for HPM+TM Rahmani and Mian (2016), the proposed approach can still achieve better performance than the other state-of-the-art methods when the input is skeleton data or depth maps. This verifies the superiority and generality of the proposed method on different datasets.

The recognition accuracy of HPM+TM  Rahmani and Mian (2016) is higher than that of the proposed method. However, it introduces external synthetic data to train the CNN. Moreover, accurate human masks are also required to counter the effect of background and imaging noise. By contrast, the proposed approach does not require these. Thus, without external training data, the proposed method outperforms all the others.

The training sample size for this dataset is significantly smaller than that for the NTU RGB+D dataset. Nevertheless, the proposed approach can still achieve good performance (i.e., 84.2%). This demonstrates that the proposed action recognition method is applicable to both small-scale and large-scale cases.

Figure 11: Key depth frames of four actions from four different viewpoints in the UWA3DII dataset.
  Training views Mean
  Test views
Input: Skeleton data
  HOJ3D Xia et al. (2012) 15.3 28.2 17.3 27.0 14.6 13.4 15.0 12.9 22.1 13.5 20.3 12.7 17.7
  Actionlet Wang et al. (2014a) 45.0 40.4 35.1 36.9 34.7 36.0 49.5 29.3 57.1 35.4 49.0 29.3 39.8
  LARP Vemulapalli et al. (2014) 49.4 42.8 34.6 39.7 38.1 44.8 53.3 33.5 53.6 41.2 56.7 32.6 43.4
Input: Depth maps+human mask
  HPM+TM Rahmani and Mian (2016) 80.6 80.5 75.2 82.0 65.4 72.0 77.3 67.0 83.6 81.0 83.6 74.1 76.9
       Input: Depth maps
  CCD Cheng et al. (2012) 10.5 13.6 10.3 12.8 11.1 8.3 10.0 7.7 13.1 13.0 12.9 10.8 11.2
  DVV Li and Zickler (2012) 23.5 25.9 23.6 26.9 22.3 20.2 22.1 24.5 24.9 23.1 28.3 23.8 24.2
  AOG Wang et al. (2014b) 23.9 31.1 25.3 29.9 22.7 21.9 25.0 20.2 30.5 27.9 30.0 26.8 26.7
  HON4D Oreifej and Liu (2013) 31.1 23.0 21.9 10.0 36.6 32.6 47.0 22.7 36.6 16.5 41.4 26.8 28.9
  SNV Yang and Tian (2014) 31.9 25.7 23.0 13.1 38.4 34.0 43.3 24.2 36.9 20.3 38.6 29.0 29.9
  HOPC Rahmani et al. (2014) 52.7 51.8 59.0 57.5 42.8 44.2 58.1 38.4 63.2 43.8 66.3 48.0 52.2
  MVDI+CNN 77.0 59.5 68.3 57.2 57.8 72.9 80.3 51.3 76.6 69.5 78.8 67.9 68.1
Table 5: Comparison of action recognition accuracy (%) for different methods on the UWA3DII dataset. Each time, two views are used for training, and the remaining two views are for testing. In particular, “MVDI” indicates the proposed multi-view dynamic images, and “HPM+TM” runs with the external training data.

6.3 UWA3DII dataset

UWA3DII consists of 1075 action samples from 30 classes: one hand waving, one hand punching, two hands waving, two hands punching, sitting down, standing up, vibrating, falling down, holding chest, holding head, holding back, walking, irregular walking, lying down, turning around, drinking, phone answering, bending, jumping jack, running, picking up, putting down, kicking, jumping, dancing, moping floor, sneezing, sitting down (chair), squatting, and coughing. These actions were captured using 10 subjects from four different viewpoints (i.e., front, top, left, and right) under the same scene conditions. Multi-modality information was provided for action characterization, including depth maps, 3D skeleton joint position, and RGB frames. This dataset is challenging because the action samples were acquired from varying viewpoints at different time points. Moreover, serious self-occlusion may occur. Finally, some action categories are of high similarity. Fig. 11 shows some key depth frames of four actions in this dataset from four different viewpoints. Following Rahmani et al. (2016), the samples from two viewpoints were used for training, and the remaining two for testing. Cross-validation among the viewpoints was performed. The performance comparison between the proposed approach and the other state-of-the-art approaches is listed in Table 5. The following can be observed:

Regarding the different viewpoint combinations, the proposed method significantly outperforms all the other approaches nearly in all cases, except for HPOC Rahmani et al. (2014) (i.e., training on and testing on ) and HPM+TM Rahmani and Mian (2016). As previously mentioned, HPM+TM employs external training data and human masks. Without these, the proposed approach achieves the best overall performance among all the methods under comparison (i.e., better than the second best). This demonstrates the superiority of the proposed method for depth-based action recognition under varying viewpoint conditions.

The performance of the proposed approach on this dataset (i.e., overall ) is relatively poor compared with that on the NTU RGB-D and Northwestern–UCLA datasets. This may be because the training sample size for this dataset is not sufficient for CNN training. Moreover, the actions from different categories (e.g., drinking vs. phone answering) are of high similarity. It seems that dynamic images cannot capture and emphasize fine-grained action characterization clues very well, which should be addressed in future studies.

The performance of skeleton-based methods is indeed poor (i.e., at most) on this dataset. This verifies that under dramatic viewpoint variation and serious self-occlusion conditions, accurate action representation using skeleton information is still a challenging task. As demonstrated in this study, using depth maps directly can alleviate this to some degree.

6.4 Comparison between dynamic images and DMMs

As previously mentioned, DMMs Yang et al. (2012) are similar to dynamic images, as they can also summarize the depth video into a single image as

(7)

where denotes the i-th depth frame, indicates the binary motion energy map, and is a predefined threshold 222 was set following  Chen et al. (2015).. An intuitive comparison of a dynamic image and a DMM is shown in Fig. 12. In particular, they correspond to the same “Hugging other person” action in the NTU RGB-D dataset. The following can be observed:

(a) Dynamic image
(b) DMM
Figure 12: Dynamic image and DMM for “Hugging other person” in the NTU RGB-D dataset.
Dataset / Test setting DMM Dynamic image
NTU RGB-D Cross-subject 74.7 84.6
Cross-view 72.1 87.3
Northwestern-UCLA 78.4 84.2
UWA3D II 51.5 68.1
Table 6: Comparison of action recognition accuracy (%) between dynamic image and DMM on the three test datasets.

Dynamic images can reveal the motion temporal order of actions by using the gray-scale value, which cannot be achieved by a DMM.

Compared with DMMs, dynamic images better suppress the effect of background and imaging noise.

Owing to the unsuitable setting of the motion threshold (i.e., in Eqn. 7), DMMs tend to lose some action details of relatively low motion energy. This can be detrimental for effective action representation.

A quantitative performance comparison between dynamic images and DMMs is now performed on all three test datasets. They share the same multi-view projection procedure, CNN learning model, and action proposal results. The results of the comparison are listed in Table 6. It can be seen that dynamic imaging significantly outperforms DMMs in all test cases. It verifies the superiority of dynamic imaging in depth video summarization for action characterization.

View group Cross-subject Cross-view
Group 1 76.7 75.1
Group 2 78.0 88.4
Group 3 70.4 82.8
Group 4 79.8 91.4
Group 5 78.6 76.5
Group 1+2 80.2 89.5
Group 1+2+3 81.6 93.1
Group 1+2+3+4 84.0 94.3
Group 1+2+3+4+5 84.1 94.9
Table 7: Comparison of action recognition accuracy (%) among viewpoint groups in Table 1 and their combination in the S001 sequence of the NTU RGB-D dataset.

6.5 Effectiveness of multi-view projection

To verify the effectiveness of multi-view projection in depth video, the performance of the view groups in Table 1 and their combination are compared using the proposed action recognition method. The test was executed only on the S001 action sequence of the NTU RGB-D dataset, owing to the high computational cost. The results of the comparison are listed in Table 7. The following can be noted:

As view groups increase, the performance of action recognition is consistently enhanced both in the cross-subject and the cross-view tests. It is noteworthy that the performance improvement in the cross-subject test demonstrates that multi-view projection in depth video not only alleviates the cross-view divergence problem but also introduces richer discriminative information for cross-subject action characterization. The results demonstrate the effectiveness of the multi-view projection mechanism in depth video for action recognition.

As listed in Table 1, group 3 corresponds to the raw depth view. However, it is interesting that among all the view groups, it does not achieve the best performance. First, this indeed demonstrates the effectiveness of the view projection method proposed in Sec 3.2. Second, it reveals the fact that the obtained multi-view projected depth videos do not merely play the role of providing additional auxiliary information.

6.6 Effectiveness of multi-view dynamic image adaptive CNN learning model

Dataset OS CNN MS CNN Our CNN
NTU RGB-D 78.8 82.5 84.6
Northwestern-UCLA 82.2 82.7 84.2
UWA3DII 57.1 66.4 68.3
Table 8: Comparison of action recognition accuracy (%) among different CNN learning models on all test datasets.

To demonstrate the superiority of the proposed multi-view dynamic image adaptive CNN learning model, it was compared with two related models. One is the multi-stream CNN learning (MS CNN) in Wang et al. (2015, 2016). That is, the view groups correspond to the individual convolutional and fully connected layers. The other is the standard one-stream CNN learning (OS CNN) model Jia et al. (2014). In particular, the view groups were regarded as different input channels. The comparison was performed on the three testing datasets with the same experimental settings. That is, PCA and SVM were also applied to MS CNN and OS CNN. In the NTU RGB-D dataset, the result on cross-subject test is reported. The comparison results are listed in Table 8. The following can be observed:

In all test datasets, the proposed multi-view dynamic image adaptive CNN learning model consistently outperforms the other CNN models. This demonstrates the effectiveness and generality of the proposed method with multi-view dynamic images for action representation.

Among the CNN models, the one-stream model is the weakest because its structure of only one-stream fc layers is not suitable for capturing the complementary information from different virtual imaging viewpoints.

6.7 Effectiveness of spatial-temporal action proposal

Dataset / Test setting MVDI-O MVDI-AP
NTU RGB-D Cross-subject 78.6 84.6
Cross-view 75.4 87.3
Northwestern-UCLA 73.2 84.2
UWA3D II 72.6 68.1
Table 9: Comparison of action recognition accuracy (%) of the proposed action recognition method with and without spatial-temporal action proposal on the three test datasets.
(a) Human–human interaction
(b) Human–object interaction
Figure 13: Human detection results of faster R-CNN when human–human or human–object interaction occurs.

In the proposed method, spatial-temporal action proposal is performed to counter the effect of scene variation and spatial-sensitivity of the CNN. To demonstrate its effectiveness, the test cases with and without action proposal were compared in the three test datasets. The results of the comparison are listed in Table 9, where the proposed action recognition method with and without action proposal is denoted by “MVDI-AP” and “MVDI-O”, respectively. The following can be observed:

Action proposal can significantly improve the performance of the method on the NTU RGB-D and Northwestern–UCLA datasets in three cases. The performance enhancement is at least. This verifies the feasibility and effectiveness of the spatial-temporal action proposal approach. It also demonstrates that countering the effect of scene variation and spatial sensitivity of the CNN is essential for action recognition.

In the UWA3D II dataset, when action proposal is employed, the performance of the proposed approach drops. This may be because the action samples in this dataset are captured under similar scene conditions. Thus, in this case, performing action proposal tends to cause information loss. The enhancement of the adaptability of action proposal will be addressed in future studies.

Furthermore, human detection using faster R-CNNs plays a key role for action proposal. It was demonstrated that faster R-CNNs is applicable to depth frames even when human–human or human–object interaction occurs. Fig. 13 shows some live examples.

6.8 Comparison between softmax classifier and SVM

Dataset / Test setting Softmax SVM
NTU RGB-D Cross-subject 71.8 84.6
Cross-view 73.7 87.3
Northwestern-UCLA 70.0 84.2
UWA3D II 57.0 68.1
Table 10: Comparison of action recognition accuracy (%) of softmax classifier and SVM on the three test datasets.

In the proposed action recognition approach, an SVM is employed as the classifier for deciding the action category instead of the softmax classifier in the CNN. The intuition is that the training sample size of depth actions is still insufficient to tune the CNN well in end-to-end learning. The introduction of the SVM can alleviate this. To verify the superiority of the SVM, it was compared with the softmax classifier on the three test datasets with the same experimental settings and CNN structure. It is noteworthy that in the proposed CNN model, multi-stream softmax classifiers corresponding to different view groups are involved. In softmax-based classification, for a test action sample , the output of different softmax classifiers is combined by summation to acquire the final classification score. The results of the comparison between the softmax classifier and the SVM are listed in Table 10. It can be summarized that in all cases, the SVM is significantly better than the softmax classifier (i.e., better at least). This demonstrates that for a specific depth-based action recognition task, intuitively applying CNNs with end-to-end learning but without sufficient training samples is not the optimal choice. Alleviating this by unsupervised or low-shot learning will be addressed in future studies.

Technical components Time cost (s)
Multi-view projection (CPU) 34.5
Multi-view dynamic images extraction (CPU) 5.6
Spatial-temporal action proposal (GPU) 2.4
CNN feature extraction (GPU) 8.5
SVM classification (CPU) 0.02
Overall time cost 51.02
Table 11: Average online time consumption per video corresponding to the main technical components in the proposed approach.

6.9 Time consumption of proposed approach

Herein, the online running efficiency of the proposed method will be analyzed. 100 video samples were randomly selected from the NTU RGB-D dataset for testing, with an average length of 97 frames per video sample. The time consumption (excluding I/O time) statistics were computed on an Intel (R) Xeon(R) E5-2630 V3 computer running at 2.4 GHz (using only one core) with an Nvidia GeForce 1080 GPU. The average time consumption per video corresponding to the main technical components is listed in Table 11. It can be observed that multi-view projection is the most time consuming. The running efficiency of the proposed approach is not high. Its acceleration is a critical issue that should be addressed in future work for practical applications.

7 Conclusions

A novel action recognition approach based on dynamic image for depth video was proposed. Through multi-view projection in depth video, multi-view dynamic images are extracted for action characterization by summarizing the depth video into static images. The key insight for this is to involve more discriminative information concerning the 3D features of depth video. To better handle multi-view dynamic images, a novel CNN learning model was also proposed. That is, different view groups share the same convolutional layers but with different fully connected layers. The main advantage of the proposed CNN model is the alleviation of the gradient vanishing problem of CNN training, particularly on the shallow convolutional layers. Spatial-temporal action proposal is performed to counter the effect of scene variation and spatial-sensitivity of the CNN. The experimental results on three test datasets demonstrated the superiority of the proposed approach.

In future work, it is planned to embed multi-view dynamic images into deeper CNN structure (i.e., ResNet He et al. (2016)

) by relaxing the strong requirement on training sample size. The enhancement of the generality of the proposed spatial-temporal action proposal approach for different scene conditions will also be considered. Another direction is to resort to unsupervised deep learning as well as weakly and semi-supervised learning to alleviate the problem of insufficient labeled depth action video samples for training.

8 Acknowledgements

This work is jointly supported by the Natural Science Foundation of China (Grant No. 61502187, 61876211, 61602193 and 61702182), the National Key R&D Program of China (No. 2018YFB1004600), the International Science & Technology Cooperation Program of Hubei Province, China (Grant No. 2017AHB051), the HUST Interdisciplinary Innovation Team Foundation (Grant No. 2016JCTD120), and Hunan Provincial Natural Science Foundation of China (Grant 2018JJ3254). Joey Tianyi Zhou is supported by Programmatic Grant No. A1687b0033 from the Singapore government’s Research, Innovation and Enterprise 2020 plan (Advanced Manufacturing and Engineering domain).

References

  • Althloothi et al. (2014)

    Althloothi, S., Mahoor, M. H., Zhang, X., Voyles, R. M., 2014. Human activity recognition using multi-features and multiple kernel learning. Pattern Recognition 47 (5), 1800–1812.

  • Basha et al. (2013)

    Basha, T., Moses, Y., Kiryati, N., 2013. Multi-view scene flow estimation: A view centered variational approach. International Journal of Computer Vision 101 (1), 6–21.

  • Bengio et al. (1994) Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. on Neural Networks 5 (2), 157–166.
  • Bilen et al. (2016a) Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., 2016a. Action recognition with dynamic image networks. arXiv preprint arXiv:1612.00738.
  • Bilen et al. (2016b) Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S., 2016b. Dynamic image networks for action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Brox and Malik (2011) Brox, T., Malik, J., 2011. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence 33 (3), 500–513.
  • Chaaraoui et al. (2013) Chaaraoui, A., Padilla-Lopez, J., Flórez-Revuelta, F., 2013. Fusion of skeletal and silhouette-based features for human action recognition with rgb-d devices. In: Proc. IEEE International Conference on Computer Vision Workshops (ICCVW). pp. 91–97.
  • Chatfield et al. (2014) Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A., 2014. Return of the devil in the details: Delving deep into convolutional nets. Proc. British Machine Vision Conference (BMVC).
  • Chen et al. (2015) Chen, C., Jafari, R., Kehtarnavaz, N., 2015. Action recognition from depth sequences using depth motion maps-based local binary patterns. In: Proc. IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp. 1092–1099.
  • Cheng et al. (2012) Cheng, Z., Qin, L., Ye, Y., Huang, Q., Tian, Q., 2012. Human daily action analysis with multi-view and color-depth data. In: Proc. International Conference on Computer Vision (ICCV). pp. 52–61.
  • Cimpoi et al. (2015) Cimpoi, M., Maji, S., Vedaldi, A., 2015. Deep filter banks for texture recognition and segmentation. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3828–3836.
  • Du et al. (2015) Du, Y., Wang, W., Wang, L., 2015. Hierarchical recurrent neural network for skeleton based action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1110–1118.
  • Evangelidis et al. (2014) Evangelidis, G., Singh, G., Horaud, R., 2014. Skeletal quads: Human action recognition using joint quadruples. In: International Conference on Pattern Recognition. pp. 4513–4518.
  • Fernando et al. (2016) Fernando, B., Anderson, P., Hutter, M., Gould, S., 2016. Discriminative hierarchical rank pooling for activity recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1924–1932.
  • Fernando et al. (2017) Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T., 2017. Rank pooling for action recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 39 (4), 773–787.
  • Glorot and Bengio (2010)

    Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. In: Proc. International Conference on Artificial Intelligence and Statistics (AISTATS). pp. 249–256.

  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proc. IEEE conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778.
  • Hong et al. (2015) Hong, C., Yu, J., Wan, J., Tao, D., Wang, M., 2015. Multimodal deep autoencoder for human pose recovery. IEEE Trans. on Image Processing 24 (12), 5659–5670.
  • Hu et al. (2015) Hu, J. F., Zheng, W. S., Lai, J., Zhang, J., 2015. Jointly learning heterogeneous features for rgb-d activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5344–5352.
  • Huang et al. (2017) Huang, Z., Wan, C., Probst, T., Van Gool, L., 2017. Deep learning on lie groups for skeleton-based action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Jia et al. (2014)

    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T., 2014. Caffe: Convolutional architecture for fast feature embedding. In: Proc. ACM International Conference on Multimedia (MM). pp. 675–678.

  • Ke et al. (2017) Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F., 2017. A new representation of skeleton sequences for 3d action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G. E., 2012. Imagenet classification with deep convolutional neural networks. In: Proc. Advances in Neural Information Processing Systems (NIPS). pp. 1097–1105.
  • Li and Zickler (2012) Li, R., Zickler, T., 2012. Discriminative virtual views for cross-view action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2855–2862.
  • Liu et al. (2016) Liu, J., Shahroudy, A., Xu, D., Wang, G., 2016. Spatio-temporal lstm with trust gates for 3d human action recognition. In: Proc, European Conference on Computer Vision (ECCV). pp. 816–833.
  • Liu et al. (2017) Liu, J., Wang, G., Hu, P., Duan, L.-Y., Kot, A. C., 2017. Global context-aware attention lstm networks for 3d action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Lu et al. (2014) Lu, C., Jia, J., Tang, C.-K., 2014. Range-sample depth feature for action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 772–779.
  • Ohn-Bar and Trivedi (2013) Ohn-Bar, E., Trivedi, M. M., 2013. Joint angles similarities and HOG2 for action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 465–470.
  • Oreifej and Liu (2013) Oreifej, O., Liu, Z., 2013. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 716–723.
  • Rahmani et al. (2016) Rahmani, H., Mahmood, A., Huynh, D., Mian, A., 2016. Histogram of oriented principal components for cross-view action recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 38 (12), 2430–2443.
  • Rahmani et al. (2014) Rahmani, H., Mahmood, A., Huynh, D. Q., Mian, A., 2014. HOPC: Histogram of oriented principal components of 3d pointclouds for action recognition. In: Proc. European Conference on Computer Vision (ECCV). pp. 742–757.
  • Rahmani and Mian (2016) Rahmani, H., Mian, A., 2016. 3D action recognition from novel viewpoints. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Ren et al. (2015) Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Porc. Advances in Neural Information Processing Systems (NIPS). pp. 91–99.
  • Shahroudy et al. (2016) Shahroudy, A., Liu, J., Ng, T.-T., Wang, G., 2016. NTU RGB+D: A large scale dataset for 3d human activity analysis. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for action recognition in videos. In: Proc. Advances in Neural Information Processing Systems (NIPS). pp. 568–576.
  • Veeriah et al. (2015) Veeriah, V., Zhuang, N., Qi, G.-J., 2015. Differential recurrent neural networks for action recognition. In: Proc. IEEE International Conference on Computer Vision (ICCV). pp. 4041–4049.
  • Vemulapalli et al. (2014) Vemulapalli, R., Arrate, F., Chellappa, R., 2014. Human action recognition by representing 3d skeletons as points in a lie group. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 588–595.
  • Wang et al. (2012) Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y., 2012. Robust 3d action recognition with random occupancy patterns. In: Proc. Euorpean Conference on Computer Vision (ECCV). pp. 872–885.
  • Wang et al. (2014a) Wang, J., Liu, Z., Wu, Y., 2014a. Learning actionlet ensemble for 3d human action recognition. In: Human Action Recognition with Depth Cameras. Springer, pp. 11–40.
  • Wang et al. (2014b) Wang, J., Nie, X., Xia, Y., Wu, Y., 2014b. Cross-view action modeling, learning, and recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2649–2656.
  • Wang et al. (2014c) Wang, J., Nie, X., Xia, Y., Wu, Y., Zhu, S.-C., 2014c. Cross-view action modeling, learning, and recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2649–2656.
  • Wang et al. (2015) Wang, P., Li, W., Gao, Z., Tang, C., Zhang, J., Ogunbona, P., 2015. Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: Proc. ACM International Conference on Multimedia (ACM MM). pp. 1119–1122.
  • Wang et al. (2016) Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P. O., 2016. Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. on Human-Machine Systems 46 (4), 498–509.
  • Wang et al. (2018) Wang, P., Li, W., Ogunbona, P., Wan, J., Escalera, S., 2018. Rgb-d-based human motion recognition with deep learning: A survey. Computer Vision and Image Understanding.
  • Weng et al. (2017)

    Weng, J., Weng, C., Yuan, J., 2017. Spatio-temporal naive-bayes nearest-neighbor (st-nbnn) for skeleton-based action recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Xia et al. (2012) Xia, L., Chen, C.-C., Aggarwal, J., 2012. View invariant human action recognition using histograms of 3d joints. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 20–27.
  • Yang and Tian (2014) Yang, X., Tian, Y., 2014. Super normal vector for activity recognition using depth sequences. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 804–811.
  • Yang et al. (2012) Yang, X., Zhang, C., Tian, Y., 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proc. ACM International Conference on Multimedia (ACM MM). pp. 1057–1060.
  • Yu et al. (2014) Yu, J., Rui, Y., Tao, D., et al., 2014. Click prediction for web image reranking using multimodal sparse coding. IEEE Trans. on Image Processing 23 (5), 2019–2032.
  • Yu et al. (2017) Yu, J., Yang, X., Gao, F., Tao, D., 2017. Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans. on Cybernetics 47 (12), 4014–4024.