Human action recognition is widely applied in public surveillance, image/video captioning and human-robot interaction [1, 2, 3], etc. Approaches for action recognition has developed from silhouettes [4, 5], local features [6, 7, 8] to depth features [9, 10], and skeletons [11, 12]. Existing researches focus on single-view (mostly in the front viewpoint) and multi-view action recognitions . However, they can hardly satisfy the demand of robots to recognize human actions in arbitrary views for Human-robot interaction (HRI) applications. Taking a service robot at home (shown in Fig. 1
) as an example, it freely moves to anywhere and interacts with family members. During moving, the robot captures human actions in any viewpoints, and it is certainly expected to understand human behaviors in arbitrary viewpoints. However, the arbitrary-view human action recognition is still a big challenging problem. On the one hand, view changes lead to action occlusions and pose variances. On the other hand, there are few datasets for arbitrary-view action recognition.
There have been datasets developed for multi-view action recognition . RGB information is used for the multi-view action recognition [15, 16]. With the development of depth sensors, datasets containing RGB-D information were presented, such as , Multiview 3D Event, Northwestern-UCLA, and UWA3D Multiview, and the NTU RGB+D action dataset [17, 18, 19, 20, 21]. For the arbitrary-view recognition, it is expected to use action samples captured in wide-range views for model training. However, almost all existing datasets were captured in limited viewpoints. Taking advantage of the mocap motion in the CMU action dataset111http://mocap.cs.cmu.edu
, training datasets including action samples of various viewpoints were generated and used to train classifiers for multi-view action recognition[22, 23, 24]. Nonetheless, the dataset generation suffers from expensive computational cost, and the mocap motion dataset is required to cover a large number of action categories, which is also a difficult problem. To solve the problem of lacking suitable data, we present a large-scale RGB-D action dataset which contains varying-view sequences covering the entire view angels. The dataset provides sufficient samples for the arbitrary-view action recognition.
Many research endeavors have been dedicated to solving the problem of multi-view action recognition. Transfer learning methods were adopted to transfer the knowledge from one viewpoint to other viewpoints[25, 26, 27, 28, 5], or to transfer feature knowledge from the benchmark dataset to test datasets . Since action sequences observed in varying viewpoints easily suffer occlusions, temporal motion is used for view-invariance action representation [21, 29, 18]. A further solution is to learn spatial relationships of joints from 3D poses to construct view-invariant representations [30, 31, 12]. However, most of the existing approaches can only deal with small-range view changes. For action recognition with view changes, one solution is to seek a common representation for actions in different views. Liu et al.  visualized a skeleton sequence into a color image for action representation and presented the SK-CNN approach to recognize actions, which is potent to weaken the difference between different views. Regarding our dataset containing full-circle views, current solutions cannot handle the recognition task. To cover the full-circle view, we separate the full-circle view into four view groups, and propose a View-guided Skeleton-CNN (VS-CNN) approach to recognize actions with large view changes.
In this paper, we newly collect a large-scale RGB-D action dataset for arbitrary-view action recognition. The dataset contains samples captured in 8 fixed viewpoints and varying-view sequences that cover the entire view angels. Samples captured in fixed viewpoints provide training data for the arbitrary-view recognition, and also may be used for the multi-view recognition. The dataset contains 40 fitness action categories, and 118 persons are invited to act these actions. In total 83 hours’ RGB videos are collected, and depth image sequences, skeleton sequences have similar frame numbers with RGB videos. Moreover, we propose a baseline, termed View-guided Skeleton-CNN (VS-CNN) to tackle these problems. The model consists of a view-group prediction module and four classifiers corresponding to four view groups. The view-group prediction module guides the training of classifiers through separating action samples to four view groups, and driving the training of corresponding classifiers. Finally, a weighted fusion is performed on the four classifiers, and the SoftMax classifier is used to classify fused features to corresponding action categories. Since view groups overlap each other, the VS-CNN learns a common representation of actions in different view groups. In summary, our major contributions include:
We present a large-scale RGB-D action dataset for arbitrary-view action recognition, which includes 118 subjects and 8 fixed viewpoints. To the best of our knowledge, this is one of the first datasets covering the entire varying-view sequences.
To tackle the arbitrary-view action recognition problems, we propose the VS-CNN, which overcomes the gap of action recognition in large view ranges.
The proposed approach is extensively evaluated on our collected dataset, and the promising performance validates the efficacy of both the approach and dataset.
Ii Related work
Ii-a Multi-view action recognition with 2D features
The same with general action recognition, the crucial problem of multi-view action recognition is also to learn an effective representation for actions. There had developed many local features for action representation in 2D videos and depth sequences [33, 34, 35], and they were introduced for multi-view action recognition [36, 37, 38]. To learn effective features, Hu et al.  presented the JOULE model which explored the shared and feature-specific components from multiple feature channels, i.e
. RGB, and skeleton features, as heterogeneous features for action recognition. Recent years, Convolution Neuron Networks (CNN) were introduced for 2D feature learning, and a series of effective networks were developed,i.e. ResNeXt . To include temporal information of action sequences, LRCN (Long-term Recurrent Convolutional Networks)  were presented for action recognition.
Since view variance leading to human pose change in 2D videos and depth sequences, a series of approaches were proposed to solve the problem. Liu et al.  used the bipartite graph partitioning to cluster vocabularies collected from two independent viewpoints by a bag of visual-words into visual-word clusters, which bridged the semantic gap of actions between different viewpoints. Moreover, Liu et al.  built a transferable dictionary pair for feature transformation between the front view and side view actions, and a common representation was obtained in the two views. Though Local features are insensitive to viewpoint change in a small range, it suffers serious occlusions when a large view change occurs. Therefore, approaches that could be used to solve the problem of view change are required.
Ii-B Multi-view action recognition with 3D features
The 3D information plays important roles in multi-view action recognition [29, 42]. Building bridges between a large collection dataset and test datasets, multi-view action recognition was realized by matching sequences of various viewpoints to data samples of the large collection dataset to reduce the gap between different viewpoints [22, 23, 24]. However, a major limitation is the expensive computational cost for dataset generation, and the mocap motion dataset is also required to have numerous action categories. A solution was to learn spatial relationships of 3D joints for a view-invariant action representation and recognition [17, 30, 31]. Moreover, Shahroudy et al.  presented a Part-aware LSTM model (P-LSTM) which contained multiple parallel memory cells for body-part feature learning and one output gate for information sharing among body parts. The P-LSTM combined body-part context information, and provided a global representation for action recognition. Graph model was employed to model the 3D geometric relations for multi-view recognition [18, 19]. These high-level representations somewhat produced a common description in different viewpoints.
Moreover, some approaches transformed the 3D skeleton feature to 2D visual images, and took advantage of feature learning via CNN to achieve higher action recognition results. Kim et al.  collected temporal skeleton trajectories and created frame-wise skeleton features concatenated temporally across the entire video sequence, and the Res-TCN was designed for action recognition. Liu et al.  visualized skeleton motions of an action sequence to an enhanced color image, and a multi-stream CNN fusion model was used to recognize actions (SK-CNN). Yan et al. 
modeled spatiotemporal skeletons using a Spatial-Temporal Graph Convolutional Networks (ST-GCN), which learned the importance of skeleton joints and assigned proper weights on graph convolution layers for action representation. Benefiting from effective feature extraction via CNN, these approaches had good performance with small-range view changes. In this paper, we propose a VS-CNN model to deal with action recognition in a large view ranges.
Iii Overview of Related Datasets
Several multi-view human action datasets had been released. Weinland et al.  released the IXMAS dataset containing RGB videos of human actions. The dataset was captured in five fixed viewpoints and contained 11 basic action categories, each performed by 10 actors. With the depth sensor Kinect V1, Cheng et al.  presented the action dataset including the RGB and depth information of 14 daily actions. 24 persons were invited to perform each action, and the dataset was captured in 4 fixed viewpoints. Wei et al.  built a multi-view 3D event dataset which included 8 event categories and 11 interacting object classes. RGB-D data of actions were captured using three stationary Kinect sensors. 8 persons were invited as participants in the data capture. Wang et al.  constructed the Northwestern-UCLA Multiview 3D event dataset which contained RGB, depth and skeleton data of 10 daily actions. Each action was performed by 10 participants, and data was captured in 3 fixed viewpoints. Rahmani et al.  collected the UWA3D Multiview Activity Dataset in 4 viewpoints. The dataset contained 30 daily action categories, and each category was performed by 10 persons. Moreover, Shahroudy et al.  presented a large-scale dataset, the NTU RGB+D action dataset. The dataset includes 60 daily actions, and totally 40 persons were invited for the data collection. Using 3 Kinect sensors, the dataset was captured in major 5 viewpoints. By changing camera-to-subject distances and camera heights, action data of 80 camera views were recorded.
Almost all existing datasets captured actions in limited viewpoints. It can hardly support the research of arbitrary-view action recognition for HRI applications. In addition, there are rare datasets including action samples captured in a very wide range of view angles and even continuously varying views. To provide data for the arbitrary-view recognition, we simulate the HRI scenario and newly collect an action dataset which contains both action samples captured in fixed viewpoints and continuously varying-view action sequences. The varying-view sequence particularly covers the entire view angles, that is significantly different with existing datasets and beneficial for the evaluation of arbitrary-view action recognition.
Iv Varying-View RGB-D Action Dataset
Iv-a Database Description
The action dataset is collected using Microsoft Kinect v2 sensors. We use the sensor to capture 3 modality action data, i.e., RGB videos, Depth images, and skeleton sequences. For RGB videos, we record image frames with the resolution of pixels. Depth images retain the maximum resolution of Kinect v2 sensors, pixels, and 16-bit pixel values. Human skeletons contain 25 body joints per frame, and each joint is recorded as a 3D coordinate in the 3D space centered on the Kinect sensor. We show dataset capture settings in the Figure 2. Camera positions indicate the 8 fixed viewpoints, and red arrows show the moving trajectory of sensors when we capture varying-view action sequences. During the data collection, subjects always face the Kinect sensor in the front viewpoint. 222The dataset has been released on https://github.com/HRI-UESTC/CFM-HRI-RGB-D-action-database. We collect fixed-viewpoint data to train classifiers for the arbitrary-view recognition because it is difficult to train a robust classifier using varying-view sequences due to fast varying views.
|a0||Punching and knee lifting||a1||Marking time and knee lifting||a2||Jumping jack||a3||Squatting||a4||Forward lunging|
|a5||Left lunging||a6||Left stretching||a7||Raising hand and jumping||a8||Left kicking||a9||Rotation clapping|
|a10||Front raising in turn||a11||Pulling a chest expander||a12||Punching||a13||Wrist circling||a14||Single dumbbell raising|
|a15||Shoulder raise||a16||Elbow circling||a17||Dumbbell one-arm shoulder pressing||a18||Arm circling||a19||Dumbbell shrugging|
|a20||Pinching back||a21||Head anticlockwise circling||a22||Shoulder abduction||a23||Deltoid muscle stretching||a24||Straight forward flexion|
|a25||Spinal stretching||a26||Dumbbell side bend||a27||Standing opposite elbow-to-knee crunch||a28||Standing body rotation||a29||Overhead stretching|
|a30||Upper back stretching||a31||Knee to chest stretch||a32||Knee circling||a33||Alternate knee lifting||a34||Bent over twist|
|a35||Rope skipping||a36||Standing toe touches||a37||Standing Gastrocnemius Calf Stretch||a38||Single-leg lateral hopping||a39||High knees running|
Viewpoints For the arbitrary-view action recognition, the dataset is designed to have more viewpoints and covers a wider range of view angles. Our dataset has 8 fixed viewpoints which averagely distribute around subjects, as shown in Figure 2. In order to simulate a real scenario concerning the HRI, we design three capture settings. The setting A in Figure 2(a) is a circle with a radius of , and there are two settings having the rectangle shape in Figure 2(b). The size of setting B and C are shown with black dashes and green dashes, respectively. In addition, we capture varying-view action sequences by moving a sensor around the subject along blue paths in the Figure 2. It captures actions in continuously varying views covering full-circle view angles. We set the height of all Kinect sensors to . The capture in three settings involves various camera-to-subject distances. Varying-view sequences are mainly used to evaluate approaches for arbitrary-view action recognition.
Subjects We totally invite 118 persons to attend the dataset collection. Each person averagely acts 10 action categories, and each action category is performed by 40 subjects in total. Because action categories are fitness actions, the age of subjects is from 18 to 30. We provide each subject a number ID in the collected dataset.
Categories There are total of 40 action categories in the dataset. We show all the categories in the Table I. Among 40 categories, 15 categories are performed in two situations, standing to act and sitting on a chair to act. Other 25 action categories are all performed with the standing pose. These categories are given action IDs of . These categories of actions contain both of large motions of all body parts, e.g. spinal stretch, raising hands and jumping, and small movements of one part, e.g. head anticlockwise circle. They are much more complex than actions in existing datasets, e.g., hand waving, walking etc. Figure 3(a) shows frame samples of 13 action categories captured in 8 fixed viewpoints and in varying-view sequences, while Figure 3(b) displays temporal frames of the action (Standing opposite elbow-to-knee crunch) in a varying-view sequence. As shown in the figure, our dataset consists of complex motions and rapidly changing poses. Captured skeletons have distortions and loss of joints.
Quantities When capturing actions in fixed viewpoints, each subject repeats each action times. For each side-view action capturing, i.e. view1 view7, there lays a sensor in the front view to capture synchronous action sequences. We use three Kinect sensors to synchronously capture two side-view and one front-view actions. Similarly, when we capture varying-view action sequences, we synchronously capture the front-view sequences. Thus each side-view action sequence has a synchronous sequence in the front viewpoint. Totally, 11 sequences are captured in 8 fixed viewpoints and 2 varying-view sequences are recorded per action category per subject. These synchronous action sequences can be used for view transformation between side views and the front viewpoints. In our dataset, one RGB video in fixed viewpoints generally sustains about seconds, and contains frames. RGB videos of varying-view sequences have about seconds, containing about frames. The length of RGB videos in our dataset are more than 83 hours in total. Depth and skeleton sequences have synchronization with RGB videos, thus depth sequences has similar frame numbers with the RGB videos.
Iv-B Comparison with other datasets
We compare our action dataset with other multi-view action datasets in Table II. It shows that our dataset has more subjects and viewpoints. Various subjects may be used to evaluate the generalization of recognition approaches with different persons. In terms of the viewpoint, besides of 8 fixed viewpoints, our dataset captures varying-view sequences which cover the full-circle views, which is superior to other datasets. Varying-view action sequences provide experiment samples for arbitrary-view action recognition.
Our dataset collects fitness actions because they involve both large movements and small motions, and it may be applied to fitness auxiliary with robots. Referring to the sample quantity, our dataset contains a large-scale action samples. The sample number of RGB videos is 25,600, and total 72,709 samples of RGB videos, depth and skeleton sequences. More importantly, each varying-view sequence is ten times the length of sequences in other datasets. We estimate the total length of RGB videos in hours for existing multi-view datasets and list them in the TableII. The comparison indicates that our dataset has the longest video playing hours.
|Databases||Subjects||Categories||Viewpoints||Sensors||Data||Quantity (samples, RGB video length in hours)|
|||24||14||4||Kinect v1||RGB,Depth||6844, 34|
|Multiview 3D Event ||8||8||3||Kinect v1||RGB,Depth,Skeleton||3815,|
|Northwestern-UCLA ||10||10||3||Kinect v1||RGB,Depth,Skeleton||1475,|
|UWA3D Multiview ||10||30||5||Kinect v1||RGB,Depth,Skeleton||1075,|
|NTURGB+D ||40||60||5, 16 settings||Kinect v2||RGB,Depth,Skeleton,IR||56,880,|
|Ours||118||40||8 fixed + varying ()||Kinect v2||RGB,Depth,Skeleton||25,600, 83|
We evaluate approaches in our dataset using four types of evaluations. The standard evaluation of the cross-subject recognition in  is kept in our experiment. To evaluate the recognition performance between any two viewpoints and neighbor viewpoints, we propose two types of cross-view recognition. Furthermore, we evaluate the arbitrary-view action recognition in our dataset.
Cross-subject recognition In our dataset, each subject acts 10 actions, and each action is surely acted by 40 subjects. For cross-subject recognition, we randomly select 51 subjects, and separate action samples acted by these subjects into the training group. The group consists of all action categories. The subject IDs selected for training are 1, 2, 6, 12, 13, 16, 21, 24, 28-31, 33, 35, 39, 41, 42, 45, 47, 50, 52, 54, 55, 57, 59, 61, 63, 64, 67, 69-71, 73, 77, 81, 84, 86-88, 90, 91, 93, 96, 99, 102-104, 107, 108, 112, 113. Action samples of the rest subjects are put into testing groups. The separation rule is used in all cross-subject recognition experiments in this paper.
Cross-view recognition I
To evaluate the performance of action recognition in cross viewpoints, action samples in one of 8 fixed viewpoints are used for training, and the test is operated on samples in another fixed viewpoint. Results of the cross-view recognition are reported using a confusion matrix of all viewpoints.
Cross-view recognition II The evaluation is in order to show the recognition performance between neighbor viewpoints. Action samples in four viewpoints which connect a square crossing shape in the Figure 2 are used as training samples, and samples in the other four viewpoints are regarded as test samples. For example, samples of the front viewpoint, viewpoints 4, 2 and 6 are separated into a training group, and samples of viewpoints 1, 3, 5, and 7 are separated into the test group. Conversely, training is operated using samples captured in viewpoints 1, 3, 5 and 7, and samples of the front viewpoint, viewpoints 4, 2 and 6 are used for the test.
Arbitrary-view recognition We evaluate arbitrary-view recognition in two ways. In a first way (Arbitrary-view I), we use action samples captured in 8 fixed viewpoints to train classifier models, and evaluate trained models on varying-view sequences. In the other way (Arbitrary-view II), we use varying-view sequences for training and also test the trained model on varying-view sequences. Action sequences captured in continuously varying views generally contain 2000+ frames per sequence. The sequence length is 10 times of action sequences captured in fixed viewpoints. In order to do experiments as the same situation as other evaluations, we clip each varying-view action sequence to 10 short clips with an equal length so that each short section has the similar length with sequences captured in fixed views. These short sections are used independently in the evaluation of recognition models.
V View-guided Skeleton-CNN
V-a The VS-CNN network
The architecture of the VS-CNN is shown in Figure 4. Since skeleton visualization  is able to somewhat weaken difference of features in various views, we employ it to generate an initial skeleton representation of actions and use the representation as input features of VS-CNN. Moreover, as action samples in our dataset cover the full-circle view, it is difficult to use one traditional feature learning model to learn correct feature representations for all views. Thus, we separate the full-circle view into four view groups and design four feature learning channels which correspond to four view groups. A view-group predictor is designed to determine view groups for action samples and guides the training of corresponding feature learning channels and classifiers by inputting samples to corresponding channels according to prediction result of the view-group predictor. Then we fuse output score features of four channels and train a classifier to finally determine action categories.
V-A1 View-group predictor
We separate the full-circle view into four view groups, and design a view-group predictor to realize automatically separation of action samples. View group separation is shown in Figure 5. The front viewpoint, viewpoints 1, 2 are separated into the first view group, while viewpoints 2, 3, 4 are defined as the second view group. Similarly, the third view group includes viewpoints 4, 5, and 6, and the fourth group includes viewpoints 6, 7, and the front viewpoint. These groups overlap each other so that samples belonging to overlap views drive the training of two feature learning channels. Since any two feature learning channels share part of common samples during training, each channel learns a common representation for samples in neighbor view groups. In this way, we obtain a common representation of action samples in a full-circle view range. Therefore, our approach is able to overcome view invariance.
The view-group predictor consists of 3 CNN layers, and a SoftMax is employed as classifier to determine the probability of one sample belonging to four view groups. The structure of the CNN network is: layer 1 ( 16 kernels, kernel size
, 1 stride), layer 2 ( 32 kernels, kernel size, 1 stride ), layer 3 ( 32 kernels, kernel size , 2 strides ). Suppose that represents an action sample, we use function to represent the group prediction network, and
refers to network parameters. The view-group predictor outputs a vector,, which indicates prediction scores of four view groups. Here, refers to the th view group. The prediction score is further used to calculate weights by Equation (1), , where . Weights are used for the weighted fusion of score features output from four classifiers. Equation (LABEL:eqn:lossGroup) is designed as the prediction loss to train the view-group prediction network. Here, refers to the ground truth of view groups. refers to parameters of the view-group predictor.
V-A2 View-guided feature learning channels
Corresponding to four view groups, we design four feature learning channels using the SK-CNN  as base network. Each feature learning channel is composed of an SK-CNN backbone and one action classifier. According to , action samples are separated into four view groups, and they are input to corresponding feature learning channels to learn action features and the following classifier gives a prediction score vector of action categories. Suppose that is an action sample, we use to represent feature learning of the th channel. Here,
is network parameter. We employ a Softmax classifier to predict action categories of action samples. The cross entropy is adopted as a loss function for the training of feature learning networks and Softmax classifiers. The Equation (LABEL:eqn:lossChannel) shows a loss function of the th channel. refers to classifier parameter. Here, y refers to the ground truth of action categories, and represents predicted results, . refers to action category.
V-A3 Channel fusion recognition
With action prediction scores output from four channels, a weighted fusion is performed through a fully connected layer with 40 neurons. Following that, a SoftMax classifier is adopted for the final action category determination. We also use the cross-entropy as loss function for classifier training, as shown in Equation (LABEL:eqn:lossClass). Here, y refers to the ground truth of action categories, and represents predicted results of action categories, . is classifier parameter. Here, is used to weight prediction scores of four channels.
V-B Training and Testing
We employ the stochastic gradient descent algorithm to minimize loss functions, and train optimal parameters for VS-CNN. The training is performed in three steps. In the first step, we assign action samples with labels of view groups, and train the view-group predictor. In the second step, we input training samples to the trained view-group predictor and automatically separate samples into different view groups. These separated samples are given to corresponding feature learning channels to train feature learning networks and classifiers with the loss function of Equation (LABEL:eqn:lossChannel). In the final step, we fuse prediction scores of four classifiers with weights and perform an end-to-end training again on the full VS-CNN network. The operation adjusts parameters in the view-group predictor and four feature learning channels, and seeks for optimal parameters for the final classifier. In the experiment of cross-view recognition I, since only one viewpoint is used for training, not all four channels are necessary. Thus we set , to train one channel, ignoring other channels for network training.
Testing phase For testing, one sample is input to the view-group prediction module, and the predictor generates view-group scores . It is used to calculate weight . Following that, the testing sample is input to four feature learning channels for feature learning and obtaining prediction scores of action categories via channel classifier. We fuse prediction results by assigning weights to them, and input fused score feature to the final classifier to determine the final recognition category for the test sample.
In addition, we modify the skeleton visualization approach  by calculating the inter-frame difference of skeletons and adding the motion information to visualized skeleton images. The representation modification further weakens differences in different views so that it deals with the problem of view variety in action recognition.
V-C Analysis of weight parameters
Based on the stochastic gradient descent algorithm, the parameter and in VS-CNN network is updated following the Equation (5) in training processing.
Observing the equation (LABEL:eqn:lossClass), controls the parameter update of the th network during training. When the have a large value, parameter will be updated. Otherwise, the classifier network parameter will be updated slowly or not be updated. In other words, the drives one of four classifiers to be trained during the training. It means that the view-group prediction module guides four classifier training in the VS-CNN. Therefore, the VS-CNN is able to classify action samples correctly in each view group and can deal with the full-circle-view action recognition by fusing four classifiers together.
Vi Experiments and Discussions
We evaluate existing approaches and our proposed approach on the newly collected dataset. Four types of evaluations are performed, i.e. the cross-subject recognition, the cross-view recognition I and II, and the arbitrary-view recognition.
Using RGB videos, we report evaluation results of the joint heterogeneous features learning (JOULE) model [45, 39], the ResNeXt network , C3D (Convolutional 3D Network)  and LRCN (Long-term Recurrent Convolutional Network) . For the LRCN network, we use two feature learning networks, resnet34 and resnet50. We also give the evaluation report of depth sequences using the C3D approach 
. For skeleton data, we evaluate the Temporal Convolutional Neural Networks (TCN) and its modified version Res-TCN , LSTM with 3D rotated skeletons , P-LSTM , SK-CNN , and the ST-GCN  for four types of evaluations. The two-layer LSTM and P-LSTM in  are adopted for the evaluation.
For RGB videos and depth sequences, we evenly select 20 frames in each action sequence, and evenly select 40 frames in each skeleton sequence for experiments. The proposed VS-CNN model is evaluated on our dataset, and the results of four types of evaluations are compared with related approaches. In experiments, the average recognition accuracy is recorded for the comparison of performance. In the following experiment results, refers to the front viewpoint, and refer to the viewpoint .
|Source||Approach||Cross-subject||Cross-View I||Cross-View II||Arbitrary-view I||Arbitrary-view II|
Vi-a Cross-subject recognition
The evaluation uses action samples captured in 8 fixed viewpoints. From 118 subjects, we select 51 subjects and separate action samples acted by these subjects into the training group. We test action samples of rest subjects and record the statistic of the average recognition accuracy for each fixed viewpoint. Statistical results per viewpoint of all evaluated approaches are listed in Table IV. The average result of all viewpoints for each approach is shown in Table III.
Table IV shows that accuracies of the viewpoint 1 7 have a symmetrical distribution around the viewpoint 4 for all approaches. Accuracies of viewpoints 3 and 5 are generally lower than other viewpoints because actions in the two viewpoints suffer heavy occlusions. The front viewpoint always gets the highest accuracy. Comparing results of the RGB video, depth sequence, and skeleton data, results obtained using skeleton data are better than the depth and RGB video, that declares the superiority of skeleton data in action recognition. Comparing the results of the RGB video and the depth sequence, the two action feature modalities have balance performance. According to the Table III, compared with other approaches, the VS-CNN outperforms other methods. Except for the VS-CNN, the JOULE, the ST-GCN, and the Res-TCN also have good performance due to their robust ability of action feature representation.
In addition, we show the confusion matrix of recognition results of cross-subject evaluation using the VS-CNN approach in Fig. 6. Here, only skeleton data is used for the evaluation. The write color represents recognition results with a value of 0, and the red color represents recognition results of 1. As shown in the figure, most action categories have weak confusion with other categories. Thus the dataset provides suitable categories for algorithm evaluation.
Vi-B Cross-view recognition
In the experiment of Cross-view I, we train the VS-CNN network using action samples in one of 8 fixed viewpoints, and the test is performed in other 7 fixed viewpoints. To illustrate experiment performance, we calculate the average recognition accuracy per viewpoint and build a confusion matrix including recognition results of 8 viewpoints. Figure 7 shows confusion matrices of cross-view recognition which are obtained by 9 recognition approaches. In each confusion matrix, the vertical and the horizontal axis refer to training viewpoints and test viewpoints, respectively. In the figure, deep colors represent high recognition accuracies, and light colors describe lower accuracy values. We also calculate the average result of all viewpoints for each recognition approach, and show average results of 9 recognition approaches in the Table III.
As shown in the Figure 7, results of all approaches at corners are better than other positions. These results correspond to the front viewpoint, the viewpoint 1 and 7. There is a angle between the viewpoint 1 and 7 and the front viewpoint. It indicates that view change within has less effect on recognition results. Comparing 9 confusion matrices, we find that rows and columns corresponding to the viewpoint 2 and 6 have lower accuracies. The reason is that only side silhouettes are captured in the two viewpoints, thus skeletons have heavy distortions. So that recognition relying on these skeletons has worse performance. Compared with other approaches in Table III, the VS-CNN obtains a lower accuracy because samples of a single viewpoint are used in both the training and the test processing. Our model can not exploit its advantage in this situation. It is obvious that all approaches perform worse in the cross-view recognition I than the cross-subject recognition and other evaluations. That is reasonable because action samples in different viewpoints have a large variance.
The evaluation of Cross-view II is performed in two procedures. We separate viewpoints into two groups, where one group includes the front viewpoint, viewpoints 4, 2 and 6, and the second group contains viewpoints 1, 3, 5, and 7. Action samples of the two groups are used as training samples and test samples in turn. Results of both of the two experiments are concluded in Table V. In the table, we list average recognition accuracies for each viewpoint in two rounds of evaluation. The final average results for all approaches are recorded in the Table III.
|Training||V1, V3, V5, V7||FV, V4, V2, V6|
The Table V illustrates that results obtained in viewpoints of the front view, viewpoints 2, 4 and 6 are a little worse than the other 4 viewpoints because of noised skeletons in viewpoints of 1, 3, 5 and 7 caused by occlusions. Moreover, the results of the front view, viewpoints of 1 and 7 are better than other viewpoints as always. And nearly all approaches get worse performance in the viewpoint of 2, 3 and 6, corresponding to the view angles , , .
According to the Table III, the VS-CNN performs much better than other approaches. The result of Cross-view II is only 5% lower than the cross-subject recognition, which illustrates that our evaluation rule using samples of 8 fixed viewpoints to train classifiers for the arbitrary-view recognition is reasonable. Compared with the SK-CNN, though the VS-CNN has lower performance in viewpoints of 2, 3, 5 in the Table V, the proposed VS-CNN has a better performance than the SK-CNN considering the average result of all viewpoints, as shown in Table III. In addition, the recognition using RGB videos obtains better performance than using depth sequences in two kinds of cross-view evaluations because the depth data has only one channel so that view variance causes heavy occlusion. Referring to three action modalities, i.e. RGB videos, depth sequences, and skeleton sequences, the depth modality has the worst performance in the cross-view evaluation. It is reasonable that depth images have lower resolution than RGB frames, and it is easy to suffer heavy occlusion when the capture view changes.
Vi-C Arbitrary-view recognition
In the evaluation of Arbitrary-view I, we train recognition models using action samples of 8 fixed viewpoints, and perform the test on short sections of varying-view sequences. For the Arbitrary-view II evaluation, we divide short sections in half according to subjects. The half part is a training set while the other one is a testing set. They are used for model training and testing.
In the experiment of Arbitrary-view I, we evaluate the performance of all approaches in each temporal segmented section and show the recognition accuracy per section in Figure 8. Here, varying-view sequences are segmented to 10 clips. Since each varying-view action sequence covers the entire view angle, one separated short section covers a view angle of . The figure shows that recognition accuracies vary in a shape of “W” in varying-view sequences. At the beginning and the end of varying-view sequences, recognition accuracies are high for all approaches because the view angle of the moving sensor is near the front viewpoint so that captured data quality of actions is better, no distortion and no occlusion. In varying-view sequences, the recognition accuracy gradually decreases and reaches the first valley point, then it increases until a peak point, and decreases again to arrive at the second valley point. Following that, the recognition accuracy increases once again until the end of the sequence. According to the figure, the two valley points lie at the 4th and 7th, 8th sections. If the position of the front view is defined as , the 4th segmented section occupies the view range of . And the 7th, 8th sections cover an angle range of . These two positions correspond to neighbor areas of the viewpoint V3 and V5 in the Figure 2. In these view ranges, occlusions lead to heavy noises in skeleton sequences, and information loss in RGB videos. Therefore, recognition results are worse.
For the experiment of Arbitrary-view II, we use action sections in the full-circle view sequences to train recognition models, and evaluate the performance of trained models on segmented varying-view sections. Figure 9 shows average recognition accuracies of 10 sections in varying-view sequences which are obtained using various recognition approaches. Since the JOULE and the ResNeXt performs worse in most above experiments, we do not list results of them here. As shown in the Figure 9, we obtain better recognition performance using varying-view sequences to train recognition models compared with results of the arbitrary-view recognition I. Furthermore, we can see that curves of recognition accuracies have flat shapes for most of the recognition approaches in the Figure 9, that is much different from the Figure 8. It is because both of the training set and the test set cover the full-circle views of actions which improve recognition performance.
In addition, we further evaluate the performance of section segmentation in the experiment of Arbitrary-view recognition II. We change the number of segmented sections to 15 and evaluate the performance of all recognition approaches again in the experiment of Arbitrary-view II. We average recognition accuracies of all sections in varying-view sequences for all recognition approaches and compare these results with results obtained by segmenting one sequence into 10 sections in the Figure 10. The result comparison shows us that it is a better choice to separate varying-view sequences into 10 sections. In this situation, segmented clips have a similar length with action sequences captured in fixed viewpoints. It is suitable for our experiments.
Above evaluations certified that our proposed VS-CNN network outperforms existing approaches in experiments of cross-subject recognition, cross-view recognition, and arbitrary-view recognition. Comparing different types of evaluations, the cross-subject recognition obtains the highest recognition accuracy, and the cross-view recognition I and the arbitrary-view recognition I perform a little worse. In the cross-subject recognition, action samples have the same viewpoints in the training and the test steps, but it is a totally different situation in other three types of evaluations, especially in the cross-view I and the arbitrary-view recognition. It is mainly due to unequal data distribution in the training and the test set in these experiments. The experiment comparison between the arbitrary-view recognition I and the arbitrary-view recognition II also indicates this problem clearly. However, it is impossible to collect actions in arbitrary views in our real-world HRI applications. It is required to propose approaches to recognize arbitrary-view actions based on training samples captured in limited views. It is still a challenging problem of action recognition with unknown viewpoints, and we will continue with the topic in our future research.
In this paper, we newly collected a large-scale RGB-D action dataset for arbitrary-view action analysis. The dataset contains samples captured in 8 fixed viewpoints and varying-view sequences that cover the entire view angels. Samples captured in fixed viewpoints provide training data for the arbitrary-view recognition, and also may be used for the multi-view recognition. The dataset contained more viewpoints, more subjects, and especially varying-view sequences covering a full-circle view angles. The dataset provided sufficient data for multi-view and arbitrary-view action analysis. We further proposed a VS-CNN network to recognize arbitrary-view actions, and we evaluate the proposed network for the cross-subject recognition, the cross-view recognition, and the arbitrary-view recognition on our dataset. All experiments are compared with related 8 action recognition approaches. These experiments certified the superior performance of the proposed VS-CNN network.
This research is supported by the Natural Science Foundation of China (NSFC) under grant No. 61673088 and grant No. 61305043. This work was partly supported by the 111 Project No. B17008.
-  C. Li, Z. Huang, Y. Yang, J. Cao, X. Sun, and H. T. Shen, “Hierarchical latent concept discovery for video event detection.” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2149–2162, 2017.
-  Y. Shen, R. Ji, S. Zhang, W. Zuo, Y. Wang, and F. Huang, “Generative adversarial learning towards fast weakly supervised detection,” in CVPR, 2018.
-  Y. Bin, Y. Yang, F. Shen, N. Xie, H. T. Shen, and X. Li, “Describing video with attention based bidirectional lstm,” IEEE Transactions on Cybernetics, 2018, doi:10.1109/TCYB.2018.2831447.
F. Zhu, L. Shao, and M. Lin, “Multi-view action recognition using local similarity random forests and sensor fusion,”Pattern Recognition Letters, vol. 34, pp. 20–24, 2013.
-  Z. Gao, S. Li, Y. Zhu, C. Wang, and H. Zhang, “Collaborative sparse representation leaning model for rgbd action recognition,” Journal of Visual Communication and Image Representation, vol. 48, pp. 442–452, 2017.
-  L. Yu, Y. Yang, Z. Huang, P. Wang, J. Song, and H. T. Shen, “Web video event recognition by semantic analysis from ubiquitous documents.” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5689–5701, 2016.
-  M. W. L. Xie, Q. Tian and B. Zhang, “Spatial pooling of heterogeneous features for image classification,” IEEE Transactions on Image Processing, vol. 23, no. 5, pp. 1994–2008, 2014.
-  H. Wu, J. Shao, X. Xu, Y. Ji, F. Shen, and H. T. Shen, “Recognition and detection of two-person interactive actions using automatically selected skeleton features,” IEEE Transactions on Human-Machine Systems, 2017, DOI:10.1109/THMS.2017.2776211.
-  J. Hu, W. Zheng, L. Ma, and et al., “Real-time rgb-d activity prediction by soft regression,” in Proc. of ECCV, 2016.
-  Y. Ji, Y. Ko, A. Shimada, H. Nagahara, and R. Taniguchi, “Cooking gesture recognition using local feature and depth image,” in Proc. of ACMMM in workshop CEA, 2012.
-  C. Li, P. Wang, S. Wang, Y. Hou, and W. Li, “Skeleton-based action recognition using lstm and cnn,” CoRR, vol. abs/1707.02356, 2017.
-  Y. Ji, Y. Yang, X. Xu, and H. T. Shen, “One-shot learning based pattern transition map for action early recognition,” Signal Processing, vol. 140, pp. 364–370, 2018.
-  C. Zhang and W. Zheng, “Semi-supervised multi-view discrete hashing for fast image search,” IEEE Transactions on Image Processing, vol. 26(6), pp. 2604–2617, 2017.
-  D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recognition using motion history volumes,” Computer Vision and Image Understanding, vol. 104, pp. 249–257, 2006.
-  D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbitrary views using 3d exemplars,” in ICCV, 2007.
-  J. Liu, M. Shah, B. Kuipers, and S. Savarese, “Cross-view action recognition via view knowledge transfer,” in CVPR, 2011.
-  Z.Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, “Human daily action analysis with multi-view and color-depth data,” in ECCV, 2012.
-  P. Wei, Y. Zhao, N. Zheng, and S. Zhu, “Modeling 4d human-object interactions for event and object recognition,” in ICCV, 2013.
-  J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu, “Cross-view action modeling, learning, and recognition,” in CVPR, 2014.
-  H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, “Histogram of oriented principal components for cross-view action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, pp. 2430–2443, 2016.
-  A. Shahroudy, J. Liu, T. T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in CVPR, 2016.
-  A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, “3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding,” in CVPR, 2014.
-  H. Rahmani and A. Mian, “Learning a non-linear knowledge transfer model for cross-view action recognition,” in CVPR, 2015.
-  ——, “3d action recognition from novel viewpoints,” in CVPR, 2016.
-  L. Rybok, S. Friedberger, U. D. Hanebeck, and R. Stiefelhagen, “The kit robo-kitchen data set for the evaluation of view-based activity recognition systems,” in 11th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2011), 2011.
-  R. Li and T. Zickler, “Discriminative virtual views for cross-view action recognition,” in CVPR, 2012.
-  Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, “Cross-view action recognition via a continuous virtual path,” in CVPR, 2013.
-  A. Liu, N. Xu, W. Nie, Y. Su, Y. Wong, and M. Kankanhalli, “Benchmarking a multimodal and multiview and interactive dataset for human action recognition,” IEEE Trans. Cybernetics, vol. 47, pp. 1781–1794, 2017.
-  Y. Ji, H. Cheng, Y. Zheng, and H. Li, “Learning contrastive feature distribution model for interaction recognition,” Journal of Visual Communication and Image Representation,, vol. 33, pp. 340–349, Nov. 2015.
-  H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, “Histogram of oriented principal components for cross-view action recognition,” in ECCV, 2014.
-  J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for 3d human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, pp. 914–927, 2014.
-  M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, vol. 68, pp. 346–362, 2017.
-  Y. G. Jiang, Z. Wu, J. Wang, X. Xue, and S. F. Chang, “Exploiting feature and class relationships in video categorization with regularized deep neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 2, pp. 352–364, 2018.
-  Y.-G. Jiang, Q. Dai, W. Liu, X. Xue, and C.-W. Ngo, “Human action recognition in unconstrained videos by explicit motion modeling,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3781–3795, 2015.
-  M. Hu, Y. Yang, F. Shen, L. Zhang, H. T. Shen, and X. Li, “Robust web image annotation via exploring multi-facet and structural knowledge,” IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4871–4884, 2017.
-  P. Yan, S. M. Khan, and M. Shah, “Learning 4d action feature models for arbitrary view action recognition,” in CVPR, 2008.
-  Z. Cai, L. Wang, X. Peng, and Y. Qiao, “Multi-view super vector for action recognition,” in CVPR, 2014.
-  L. Zhang, Y. Yang, M. Wang, R. Hong, L. Nie, and X. Li, “Detecting densely distributed graph patterns for fine-grained image categorization.” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 553–565, 2016.
-  J. Hu, W. S. Zheng, J. Lai, and J. Zhang, “Jointly learning heterogeneous features for rgb-d activity recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, pp. 2186–2200, 2017.
K. Hara, H. Kataoka, Y. Satoh, and Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” inCVPR, 2018.
-  J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in ICCV, 2015.
-  J. Hu, W. Zheng, J. Lai, S. Gong, and T. Xiang, “Exemplar-based recognition of human-object interactions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26(4), pp. 647–660, 2016.
-  T. Kim and A. Reiter, “Interpretable 3d human action analysis with temporal convolutional networks,” in CVPRW, 2017.
-  S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI, 2018.
-  J. Hu, W. S. Zheng, J. H. Lai, and J. Zhang, “Jointly learning heterogeneous features for rgb-d activity recognition,” in CVPR, 2015.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
-  C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017.