Human action recognition has many applications in security, surveillance, sports analysis, human computer interaction and video retrieval. However, automatic human action recognition algorithms are still challenged by noises due to action irrelevant factors such as changing camera viewpoints, clothing textures, body shapes, backgrounds and illumination conditions. In this paper, we address these challenges to perform robust human action recognition in conventional RGB videos and RGB-D videos obtained from range sensors.
A human action can be defined as a collection of sequentially organized human poses where the action is encoded in the way the human pose transitions from one pose to the other. However, for action classification, a human pose must be represented in a way that is invariant to the above conditions. Since some human poses are common between multiple actions and the space of possible human poses is much smaller compared to that of possible human actions, we first model the human pose independently and then model the actions as the temporal variations between human poses.
To suppress action irrelevant information in videos, many techniques use dense trajectories (Gupta et al, 2014; Wang et al, 2011, 2013a; Wang and Schmid, 2013) or Hanklets (Li et al, 2012) which encode only the temporal cues that are essential for action classification. Such methods have shown good performance for human action recognition in conventional videos. However, they are still sensitive to viewpoint variations and do not fully exploit the appearance (human pose) information. Dense trajectories are also noisy and contain self occlusion artefacts.
While appearance is an important cue for action recognition, human poses appear very differently from different camera viewpoints. Research efforts have been made to model these variations. For example, synthetic 2D human poses from many viewpoints and their transitions were used for action recognition in (Lv and Nevatia, 2007). However, 3D viewpoint variations cannot be modelled accurately using 2D human poses. Spatio-temporal 3D occupancy grids built from multiple viewpoints were used in (Weinland et al, 2007) to achieve view-invariant action recognition. However, occupancy grids rely on silhouettes which are noisy in real videos. In this paper, we use full 3D human models to learn a representation of the human pose that is not only invariant to viewpoint but also to other action irrelevant factors such as background, clothing and illumination.
Our contributions can be summarized as follows. Firstly, we propose a method for generating RGB and Depth images of human poses using Computer Graphics and Generative Adversarial Network (GAN) training. We learn representative human poses by clustering real human joint/skeleton data obtained with motion capture technology (CMU MoCap database111http://mocap.cs.cmu.edu). Each representative pose skeleton is fitted with synthetic 3D human models and then placed in random scenes, given different clothes, illuminated from multiple directions and rendered from 180 camera viewpoints to generate RGB and depth images of the human poses with known labels. Figure 1
illustrates the proposed RGB data generation pipeline. Depth images are generated in a similar way except that they are devoid of texture and background. We employ GANs to minimize the gap between the distributions of synthetic and real images. Although used as an essential component of network training in this work, the proposed synthetic data generation technique is generic and can be used to produce large amount of synthetic human poses for deep learning in general.
Secondly, we propose Human Pose Models (HPM) that are Convolutional Neural Networks and transfer human poses to a shared high level invariant space. The HPMs are trained with the images that are refined with GAN and learn to map input (RGB or Depth) images to one of the representative human poses irrespective of the camera viewpoint clothing, human body size, background and lighting conditions. The layers prior to the Softmax label in the CNNs serve as high-level invariant human pose representations. Lastly, we propose to temporally model the invariant human pose features with the Fourier Temporal Pyramid and use SVM for classification. The proposed methods work together to achieve robust RGB-D human action recognition under the modeled variations.
Experiments on three benchmark cross-view human action datasets show that our method outperforms existing state-of-the-art for action recognition in conventional RGB videos as well as RGB-D videos obtained from depth sensors. The proposed method improves RGB-D human action recognition accuracies by 15.4%, 11.9% and 8.4% on the UWA3D-II (Rahmani et al, 2016), NUCLA (Wang et al, 2014) and NTU (Shahroudy et al, 2016a) datasets respectively. Our method also improves RGB human action recognition accuracy by 9%.
This work is an extension of (Rahmani and Mian, 2016) where only depth image based human pose model and action recognition results were presented. However, depth images are almost always accompanied with RGB images. Therefore, we present the following extensions: (1) We present a method222The code for this method will be made public. for synthesizing realistic RGB human pose images containing variations in clothing textures, background, lighting conditions and more variations in body shapes. This method can also synthesize depth images more efficiently compared to (Rahmani and Mian, 2016). (2) We adopt Generative Adversarial Networks (GANs) to refine the synthetic RGB images and depth images so as to reduce their distribution gaps from real images and achieve improved accuracy. (3) We present a Human Pose Model HPM for human action recognition in conventional RGB videos which has wider applications. The proposed HPM achieves state-of-the-art human action recognition accuracy in conventional RGB videos. (4) We re-train the depth model HPM using GoogleNet (Szegedy et al, 2015) architecture which performs similar to the AlexNet (Krizhevsky et al, 2012) model in (Rahmani and Mian, 2016) but with four times smaller feature dimensionality. (5) We perform additional experiments on the largest RGB-D human action dataset (NTU (Shahroudy et al, 2016a)) and report state-of-the-art results for action recognition in RGB videos and RGB-D videos in the cross-view and cross-subject settings. From here on, we refer to the Human Pose Models as HPM and HPM for RGB and depth modalities respectively.
2 Ralated Work
The closest work to our method is the key pose matching technique proposed by Lv and Nevatia (2007). In their approach, actions are modelled as series of synthetic 2D human poses rendered from many viewpoints and the transition between the synthetic poses is represented by an Action Net graph. However, the rendered images are not realistic as they do not model variations in clothing, background and lighting as in our case. Moreover, our method directly learns features from the rendered images rather than hand crafting features.
Another closely related work to our method is the 3D exemplars for action recognition proposed by Weinland et al (2007). In their framework, actions are modelled with 3D occupancy grids built from multiple viewpoints. The learned 3D exemplars are then used to produce 2D images that are compared to the observations during recognition. This method essentially relies on silhouettes which may not be reliably extracted from the test videos especially under challenging background/lighting conditions.
Li et al (2012) proposed Hankelet which is a viewpoint invariant representation that captures the dynamic properties of short tracklets. Hanklets do not carry any spatial information and their viewpoint invariant properties are limited. Very early attempts for view invariant human action recognition include the following methods. Yilmaz and Shah (2005) proposed action sketch, an action representation that is a sequence of the 2D contours of an action in the space-time. Such a representation is not completely viewpoint invariant. Parameswaran and Chellappa (2006) used 2D projections of 3D human motion capture data as well on manually segmented real image sequences to perform viewpoint robust action recognition. Rao et al (2002) used the spatio-temporal 2D trajectory curvatures as a compact representation for view-invariant action recognition. However, the same action can result in very different 2D trajectories when observed from different viewpoints. Weinland et al (2006)
proposed Motion History Volumes (MHV) as a viewpoint invariant representation for human actions. MHVs are aligned and matched using Fourier Transform. This method requires multiple calibrated and background-subtracted video cameras which is only possible in controlled environments.
View knowledge transfer methods transfer features of different viewpoints to a space where they can be directly matched to achieve viewpoint invariant action recognition. Early methods in this category learned similar features between different viewpoints. For example, Farhadi and Tabrizi (2008) represented actions with histograms of silhouettes and optical flow and learned features with maximum margin clustering that are similar in different views. Source views are then transferred to the target view before matching. Given sufficient multiview training instances, it was shown later that a hash code with shared values can be learned (Farhadi et al, 2009). Gopalan et al (2011) used domain adaptation for view transfer. Liu et al (2011) used a bipartite graph to model two view-dependent vocabularies and applied bipartite graph partitioning to co-cluster two vocabularies into visual-word clusters called bilingual-words that bridge the semantic gap across view-dependent vocabularies. More recently, Li and Zickler (2012)
proposed the idea of virtual views that connect action descriptors from one view to those extracted from another view. Virtual views are learned through linear transformations of the action descriptors.Zhang et al (2013)
proposed the idea of continuous virtual path that connects actions from two different views. Points on the virtual path are virtual views obtained by linear transformations of the action descriptors. They proposed a virtual view kernel to compute similarity between two infinite-dimensional features that are concatenations of the virtual view descriptors leading to kernelized classifiers.Zheng and Jiang (2013) learned a view-invariant sparse representation for cross-view action recognition. Rahmani and Mian (2015) proposed a non-linear knowledge transfer model that mapped dense trajectory action descriptors to canonical views. However, this method does not exploit the appearance/shape features.
Deep learning has also been used for action recognition. Simonyan and Zisserman (2014) and Feichtenhofer et al (2016) proposed two stream CNN architectures using appearance and optical flow to perform action recognition.Wang et al (2015) proposed a two stream structure to combine hand-crafted features and deep learned features. They used trajectory pooling for one stream and deep learning for the second and combined the features from the two streams to form trajectory-pooled deep-convolutional descriptors. Nevertheless, the method did not explicitly address viewpoint variations. Toshev and Szegedy (2014)
proposed DeepPose for human pose estimation based on Deep Neural Networks which treated pose estimation as a regression problem and represented the human pose body joint locations. This method is able to capture the context and reasoning about the pose in a holistic manner however, the scale of the dataset used for training was limited and the method does not address viewpoint variations.Pfister et al (2015) proposed a CNN architecture to estimate human poses. Their architecture directly regresses pose heat maps and combines them with optical flow. This architecture relies on neighbouring frames for pose estimations. Ji et al (2013) proposed a 3D Convolutional Neural Network (C3D) for human action recognition. They used a set of hard-wired kernels to generate multiple information channels corresponding to the gray pixel values, () gradients, and () optical flow from seven input frames. This was followed by three convolution layers whose parameters were learned through back propagation. The C3D model did not explicitly address invariance to viewpoint or other factors.
Wang et al (2016b) proposed joint trajectory maps, projections of 3D skeleton sequences to multiple 2D images, for human action recognition. Karpathy et al (2014) suggested a multi-resolution foveated architecture for speeding up CNN training for action recognition in large scale videos. Varol et al (2017a) proposed long-term temporal convolutions (LTC) and showed that LTC-CNN models with increased temporal extents improve action recognition accuracy. Tran et al (2015) treated videos as cubes and performed convolutions and pooling with 3D kernels. Recent methods (Li et al, 2016; Zhu et al, 2016; Wang and Hoai, 2016; Zhang et al, 2016; Su et al, 2016; Wang et al, 2016a) emphasize on action recognition in large scale videos where the background context is also taken into account.
Shahroudy et al (2016b)
divided the actions into body parts and proposed a multimodal-multipart learning method to represent their dynamics and appearances. They selected the discriminative body parts by integrating a part selection process into the learning and proposed a hierarchical mixed norm to apply sparsity between the parts, for group feature selection. This method is based on depth and skeleton data and uses LOP (local occupancy patterns) and HON4D (histogram of oriented 4D normals) as features.Yu et al (2016) proposed a Structure Preserving Projection (SPP) to represent RGB-D video data fusion. They described the gradient fields of RGB and depth data with a new Local Flux Feature (LFF), and then fused the LFFs from RGB and depth channels. With structure-preserving projection, the pairwise structure and bipartite graph structure are preserved when fusing RGB and depth information into a Hamming space, which benefits the general action recognition.
Huang et al (2016) incorporated the Lie group structure into deep learning, to transform high-dimensional Lie group trajectory into temporally aligned Lie group features for skeleton-based action recognition. The incorporated learning structure generalizes the traditional neural network model to non-Euclidean Lie groups. Luo et al (2017)
proposed to use Recurrent Neural Network based Encoder-Decoder framework to learn video representation in capturing motion dependencies. The learning process is unsupervised and it focuses on encoding the sequence of atomic 3D flows in consecutive frames.
Jia et al (2014a)
proposed a latent tensor transfer learning method to transfer knowledge from the source RGB-D dataset to the target RGB only dataset such that the missing depth information in the target dataset can be compensated. The learned 3D geometric information is then coupled with RGB data in a cross-modality regularization framework to align them. However, to learn the latent depth information for RGB data, a RGB-D source dataset is required to perform the transfer learning, and for different source datasets, the learned information may not be consistent which could affect the final performance.Kong and Fu (2017) proposed max-margin heterogeneous information machine (MMHIM) to fuse RGB and depth features. The histograms of oriented gradients (HOG) and histograms of optical flow (HOF) descriptors are projected into independent shared and private feature spaces, and the features are represented in matrix forms to build a low-rank bilinear model for the classification. This method utilizes the cross-modality and private information, which are also de-noised before the final classification.
Kerola et al (2017) used spatio-temporal key points (STKP) and skeletons to represent an action as a temporal sequence of graphs, and then applied the spectral graph wavelet transform to create the action descriptors. Varol et al (2017b) recently proposed SURREAL to synthesize human pose images for the task of body segmentation and depth estimation. The generated dataset includes RGB images, together with depth maps and body parts segmentation information. They then learned a CNN model from the synthetic dataset, and then conduct pixel-wise classification for the real RGB pose images. This method made efforts in diversifying the synthetic data, however, it didn’t address the distribution gap between synthetic and real images. Moreover, this method only performs human body segmentation and depth estimation.
Our survey shows that none of the existing techniques explicitly learn invariant features through a training dataset that varies all irrelevant factors for the same human pose. This is partly because such training data is very difficult and expensive to generate. We resolve this problem by developing a method to generate such data synthetically. The proposed data generation technique is a major contribution of this work that can synthesize large amount of data for training data-hungry deep network models. By easily introducing a variety of action irrelevant variations in the synthetic data, it is possible to learn effective models that can extract invariant information. In this work, we propose Human Pose Models for extracting such information from both RGB and depth images.
3 Generating Synthetic Training Data
The proposed synthetic data generation steps are explained in the following subsections. Unless specified, the steps are shared by RGB and depth data generation.
3.1 Learning Representative Human Poses
Since the space of possible human poses is extremely large, we learn a finite number of representative human poses in a way that is not biased by irrelevant factors such as body shapes, appearances, camera viewpoints, illumination and backgrounds. Therefore, we learn the representative poses from 3D human joints (skeletons), because each skeleton can be fitted with 3D human models of any size/shape. The CMU MoCap database is ideal for this purpose because it contains joint locations of real humans performing different actions resulting in a large number of different poses. This data consists of over 2500 motion sequences and over 200,000 human poses. We randomly sample 50,000 frames as the pose candidates and cluster them with HDBSCAN algorithm (McInnes et al, 2017) using the skeletal distance function (Shakhnarovich, 2005)
where and are the joint locations of two skeletons. By setting the minimum cluster size to 20, the HDBSCAN algorithm outputs 339 clusters and we choose the pose with highest HDBSCAN score in each cluster to form the representative human poses. Figure 2 shows a few of the learned representative human poses.
3.2 Generating 3D Human Models
The 339 representative human pose skeletons are fitted with full 3D human models. We use the open source MakeHuman333http://www.makehuman.org software to generate 3D human models because it has three attractive properties. Firstly, the 3D human models created by MakeHuman contain information for fitting the model to the MoCap skeletons to adopt that pose. Secondly, it is possible to vary the body shape, proportion and gender properties to model different shape humans. Figure 3 shows the four human body shapes we used in our experiments. Thirdly, MakeHuman allows for selecting some common clothing types as shown in Fig. 4. Although, the MakeHuman offers limited textures for the clothing, we write a Python script to apply many different types of clothing textures obtained from the Internet.
3.3 Fitting 3D Human Models to MoCap Skeletons
MakeHuman generates 3D human models in the same canonical pose as shown in Figure 3 and 4. We use the open source Blender444http://www.blender.org software to fit the 3D human models to the 339 representative human pose skeletons. Blender loads the 3D human model and re-targets its rigs to the selected MoCap skeleton. As a result, the 3D human model adopts the pose of the skeleton and we get the representative human poses as full 3D human models with different clothing types and body shapes. Clothing textures are varied later.
3.4 Multiview Rendering to Generate RGB Images
We place each 3D human model (with a representative pose) in different backgrounds and lighting conditions using Blender. In the following, we explain how different types of variations were introduced in the rendered images.
Camera Viewpoint: We place 180 virtual cameras on a hemisphere over the 3D human model to render RGB images. These cameras are 12 degrees apart along the latitude and longitude and each camera points to the center of the 3D human model. Figure 5 illustrates the virtual cameras positioned around a 3D human model where no background has been added yet to make the cameras obvious. Figure 6 shows a few images rendered from multiple viewpoints after adding the background and lighting.
Background and Lighting: We incorporate additional rich appearance variations in the background and lighting conditions to synthesize images that are as realistic as possible. Background variation is performed in two modes. One, we download thousands of 2D indoor scenes from Google Images and randomly select one as Blender background during image rendering. Two, we download 360 spherical High Dynamic Range Images (HDRI) from Google Images and use them as the environmental background in Blender. In the latter case, when rendering images from different viewpoints, the background changes accordingly. Figure 6 shows some illustrations. In total, we use 2000 different backgrounds that are mostly indoor scenes, building lobbies with natural lighting and a few outdoor natural scenes. We place three lamps at different locations in the scene and randomly change their energy to achieve lighting variations.
Clothing Texture: Clothing texture is varied by assigning different textures, downloaded from Google Images, to the clothing of the 3D human models. In total, we used 262 different textures of shirts and 183 textures for trousers/shorts to generate our training data. Figure 7 shows some of the clothing textures we used and Figure 8 shows some rendered images containing all types of variations.
3.5 Multiview Rendering to Generate Depth Images
Depth images simply record the distance of the human from the camera without any background, texture or lighting variation. Therefore, we only vary the camera viewpoint, clothing types (not textures) and body shapes when synthesizing depth images. The virtual cameras are deployed in a similar way to the RGB image rendering. Figure 9 shows some depth images rendered from different viewpoints. In the Blender rendering environment, the bounding box of the human model is recorded as a group of vertices, which can be converted to an bounding box around the human in the rendered image coordinates. This bounding box is used to crop the human in the rendered depth images as well as RGB images. The cropped images are used to learn the Human Pose Models.
3.6 Efficiency of Synthetic Data Generation
To automate the data generation, we implemented a Python script555The data synthesis script will be made public. in Blender on a 3.4GHz machine with 32GB RAM. The script runs six separate processing threads and on the average, generates six synthetic pose images per second. Note that this is an off-line process and can be further parallelized since each image is rendered independently. Moreover, the script can be implemented on the cloud for efficiency without the need to upload any training data. The training data is only required for learning the model and can be deleted afterwards. For each of the 339 representative poses, we generate images from all 180 viewpoints while applying a random set of other variations (clothing, background, body shape and lighting). In total, about 700,000 synthetic RGB and depth images are generated to train the proposed HPM and HPM.
4 Synthetic Data Refinement with GANs
The synthetic images are labelled with 339 different human poses and cover common variations that occur in real data. They can be used to learn Human Pose Models that transfer human poses to a high level view-invariant space. However, it is likely that the synthetic images are sampled from a distribution which is different from the distribution of real images. Given this distribution gap, the Human Pose Models learned from synthetic images may not generalize well to real images. Therefore, before learning the models, we minimize the gap between the distributions. For this purpose, we adopt the simulated and unsupervised learning framework (SimGAN)(Shrivastava et al, 2016) proposed by Shrivastava et al.. This framework uses an adversarial network structure similar to the Generative Adversarial Network (GAN) (Goodfellow et al, 2014), but the learning process is based on synthetic images, instead of random noises as in the original GAN method. The SimGAN framework learns two competing networks, refiner and discriminator , where is synthetic image, is unlabelled real image,
is refined image. The loss function of these two networks are defined asand (Shrivastava et al, 2016)
where is the synthetic image, is its corresponding refined image, is the real image, is norm, and is the regularization factor.
4.1 Implementation Details
We modify the Tensorflow implementation666https://github.com/carpedm20/ of SimGAN to make it suitable for our synthetic RGB images. For the refiner network, we extend the input data channel from 1 to 3. The input images are first convolved with 7 7 filters and then converted into 64 feature maps. The 64-channel feature maps are passed through multiple ResNet blocks. The setting of ResNet blocks and the structure of discriminator network are the same as (Shrivastava et al, 2016).
To get benchmark distribution for the synthetic images, we randomly select 100,000 unlabelled real images from the NTU RGB+D Human Activity Dataset (Shahroudy et al, 2016a). Each image is cropped to get the human body as the region of interest and then resized to 224 224. Through adversarial learning, SimGAN framework (Shrivastava et al, 2016) will force the distribution of synthetic images to approach this benchmark distribution. Although, we use samples from the NTU RGB+D Human Activity Dataset as benchmark to train the SimGAN network, this is not mandatory as any other dataset containing real human images can be used. This is because the SimGAN learning is an unsupervised process, which means no action labels are required. Our experiments in later sections also illustrate that the performance improvement gained from GAN-refinement has no dependence on the type of real images used for SimGAN learning.
4.2 Qualitative Analysis of GAN Refined Images
We compare the real images, raw synthetic images, and GAN-refined synthetic images, to analyse the effect of GAN refinement on our synthetic RGB and depth human pose datasets.
Figure 10 compares the real, raw synthetic and GAN-refined synthetic RGB images. One obvious difference between real and synthetic RGB images is that the synthetic ones are sharper and have more detail than the real images. This is because the synthetic images are generated under ideal conditions and the absence of realistic image noises makes them different from real images. However, with GAN learning, the refined RGB images lose some of the details (i.e. they are not as sharp) and become more realistic. The last row of Figure 10 shows the difference between the raw and refined synthetic images. Notice that the major differences (bright pixels) are at the locations of the humans and especially at their boundaries whereas the backgrounds have minimal differences (dark pixels). The reason for this is that the synthetic images are created using synthetic 3D human models but real background images i.e. rather than building Blender scene models (walls, floors, furnitures, etc.) from scratch, we used real background images for efficiency and diversity of data. Moreover, the Blender lighting function causes shading variation on the human models only, and the shading effects of backgrounds always remain the same. All these factors make the human model stand out of the background. On the other hand, the human subjects are perfectly blended with the background in the real images. The GAN refinement removes such differences in synthetic images and makes the human models blend into the background. Especially, the bright human boundaries (last row) shows that the GAN refinement process is able to sense and remove the difference between human model and background images.
Figure 11 shows a similar comparison for depth images. The most obvious difference between real and synthetic depth images is the noise along the boundary. The edges in the real depth images are not smooth whereas they are very smooth in the synthetic depth images. Other differences are not so obvious to the naked eye but nevertheless, these differences might limit the generalization ability of the Human Pose Model learned from synthetic images. The third row of Fig. 11 shows the refined synthetic depth images and the last column shows the difference between the images in more detail. Thus the GAN refinement successfully learns to model boundary noise and other non-obvious differences of real images and applies them to the synthetic depth maps narrowing down their distribution gap.
In the experiments, we will show quantitative results indicating that the Human Pose Models learned from the GAN refined synthetic images outperform those that are learned from raw synthetic images.
5 Learning the Human Pose Models
Every image in our synthetic data has a label corresponding to one of the 339 representative human poses. For a given human pose, the label remains the same irrespective of the camera viewpoint, clothing texture, body shape, background and lighting conditions. We learn CNN models that map the rendered images to their respective human pose labels. We learn HPM and HPM for RGB and depth images independently and test three popular CNN architectures, i.e. AlexNet (Krizhevsky et al, 2012), GoogLeNet (Szegedy et al, 2015), and ResNet-50 (He et al, 2016a)
, to find the most optimal architecture through controlled experiments. These CNN architectures performed well in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, 2014, and 2015 respectively, and come with increasing number of layers. We fine tune the ILSVRC pre-trained models using our synthetic data and compare their performance on human action recognition.
5.1 Model Learning
The three pre-trained models have a last InnerProduct layer with 1000 neurons. For fine tuning, we replace the last layer with a 339 neuron layer representing the number of classes in our synthetic human pose dataset. All synthetic images are cropped to include only the human body and then resized to 256
256 pixels. During training, these images are re-cropped to the required input dimension for the specific network with default data augmentation, and are also mirrored with a probability of 0.5. We use the synthetic pose images from 162 randomly selected camera viewpoints for training, and the images from the remaining 18 cameras for validation.
The Caffe library(Jia et al, 2014b) is used to learn the proposed HPM and HPM
models. The initial learning rate of the model was set to 0.01 for the last fully-connected layers and 0.001 for all other layers. We used a batch size of 100 and trained the model for 3 epochs. We decreased the learning rate by a factor of 10 after every epoch. Training was done using a single NVIDIA Tesla K-40 GPU.
5.2 Extracting Features from Real Videos
To extract features from real videos, the region containing the human is first cropped from each frame and then the cropped region is resized to match the input of the network. The cropped-resized regions from each frame are passed individually through the learned Human Pose Model (HPM for RGB frames, and HPM for depth frames), and a layer prior to the labels is used as invariant representation of the human pose. Specifically, we use fc7 layer for AlexNet, pool5/7x7s1 layer for GoogLeNet and pool5 for ResNet-50. This representation is unique, compact, invariant to the irrelevant factors and has the added advantage that it aligns the features between multiple images. While the pixels of one image may not correspond to the pixels of another image, the individual variables of the CNN features are aligned. Therefore, we can perform temporal analysis along the individual variables.
6 Temporal Representation and Classification
For temporal representation, we use the Fourier Temporal Pyramid (FTP) (Wang et al, 2013b)
on the features extracted from the video frames. Temporal representation for HPMand HPM features is done in a similar way and explained in general in the next paragraph.
Let denote the -th frame of -th video, where is the total number of frames. Take HPM with GoogleNet structure as an example, denote the pool5/7x7s1 layer activations of frame as and the frame-wise pose features of the -th video as . FTP is applied on for temporal encoding using a pyramid of three levels where is divided in half at each level giving feature groups. Short Fourier Transform is applied to each feature group, and the first four low-frequency coefficients (i.e. ) are used to form a spatio-temporal action descriptor . Finally, is stretched to to get the final spatio-temporal representation of the -th video. When the dimension of frame-wise pose feature changes, the dimension of spatio-temporal descriptor changes accordingly, for example, for AlexNet, and for ResNet-50.
Note that the FTP encodes the temporal variations of the RGB action videos in the HPM feature space. The video frames are first aligned in the HPM feature space which makes it possible to preserve the spatial location of the features while temporal encoding with FTP. On the other hand, dense trajectories model temporal variations in the pixel space (of RGB videos) where pixels corresponding to the human body pose are not aligned. This is the main reason why dense trajectory features are encoded with Bag of Visual Words (BoVW) which facilitates direct matching of dense trajectory features from two videos. However, this process discards the spacial locations of the trajectories. Thus, similar trajectories from different locations in the frame will vote to the same bin in BoVW feature.
An advantage of performing temporal encoding in different feature spaces is that the features are non-redundant. Thus our HPM and dense trajectories capture complementary information. Although dense trajectories cannot capture the appearance information, they are somewhat robust to viewpoint changes as shown in (Rahmani et al, 2017). Therefore, we augment our HPM features with dense trajectory features before performing classification. We use the improved dense trajectories (iDT) (Wang and Schmid, 2013) implementation which provides additional features such as HOG (Dalal and Triggs, 2005), HOF and MBH (Dalal et al, 2006). However, we only use the trajectory part and discard HOG/HOF/MBH features for two reasons. Firstly, unlike our HPMs, HOG/HOF/MBH features are not view-invariant. Secondly, our HPMs already encode the appearance information. We use the NKTM (Rahmani and Mian, 2015) codebook to encode the trajectory features and denote the encoded BoVW as for video .
We use SVM (Fan et al, 2008) for classification and report results in three settings i.e. RGB, depth and RGB-D. In the RGB setting, we represent the HPM features temporally encoded with FTP and then combine them with the trajectory BoVW features since both types of features can be extracted from RGB videos. In the depth setting, we represent the HPM features with FTP but do not combine trajectory features because trajectories cannot be reliably extracted from the depth videos. In the RGB-D setting, we combine the FTP features from both HPM models with the trajectory BoVW features.
Experiments are performed in the following three benchmark datasets for cross-view human action recognition.
7.1 UWA3D Multiview Activity-II Dataset
Figure 12 shows sample frames from this dataset. The dataset (Rahmani et al, 2016) consists of 30 human actions performed by 10 subjects and recorded from 4 different viewpoints at different times using the Kinect v1 sensor. The 30 actions are: (1) one hand waving, (2) one hand punching, (3) two hands waving, (4) two hands punching, (5) sitting down, (6) standing up, (7) vibrating, (8) falling down, (9) holding chest, (10) holding head, (11) holding back, (12) walking, (13) irregular walking, (14) lying down, (15) turning around, (16) drinking, (17) phone answering, (18) bending, (19) jumping jack, (20) running, (21) picking up, (22) putting down, (23) kicking, (24) jumping, (25) dancing, (26) moping floor, (27) sneezing, (28) sitting down (chair), (29) squatting, and (30) coughing. The four viewpoints are: (a) front, (b) left, (c) right, (d) top.
This dataset is challenging because of the large number of action classes and because the actions are not recorded simultaneously leading to intra-action differences besides viewpoint variations. The dataset also contains self-occlusions and human-object interactions in some videos.
We follow the protocol of (Rahmani et al, 2016) where videos from two views are used for training and the videos from the remaining views are individually used for testing leading to 12 different cross-view combinations in this evaluation protocol.
7.2 Northwestern-UCLA Multiview Dataset
This dataset (Wang et al, 2014) contains RGB-D videos captured simultaneously from three different viewpoints with the Kinect v1 sensor. Figure 13 shows sample frames of this dataset from the three viewpoints. The dataset contains RGB-D videos of 10 subjects performing 10 actions: (1) pick up with one hand, (2) pick up with two hands, (3) drop trash, (4) walk around, (5) sit down, (6) stand up, (7) donning, (8) doffing, (9) throw, and (10) carry. The three viewpoints are: (a) left, (b) front, and (c) right. This dataset is very challenging because many actions share the same “walking” pattern before and after the actual action is performed. Moreover, some actions such as “pick up with on hand” and “pick up with two hands” are hard to distinguish from different viewpoints.
We use videos captured from two views for training and the third view for testing making three possible cross-view combinations.
7.3 NTU RGB+D Human Activity Dataset
The NTU RGB+D Human Activity Dataset (Shahroudy et al, 2016a) is a large-scale RGB+D dataset for human activity analysis. This dataset was collected with the Kinect v2 sensor and includes 56,880 action samples each for RGB videos, depth videos, skeleton sequences and infra-red videos. We only use the RGB and depth parts of the dataset. There are 40 human subjects performing 60 types of actions including 50 single person actions and 10 two-person interactions. Three sensors were used to capture data simultaneously from three horizontal angles: , and every action performer performed the action twice, facing the left or right sensor respectively. Moreover, the height of sensors and their distance to the action performer were further adjusted to get more viewpoint variations. The NTU RGB+D dataset is the largest and most complex cross-view action dataset of its kind to date. Figure 14 shows RGB and depth sample frames in NTU RGB+D dataset.
We follow the standard evaluation protocol proposed in (Shahroudy et al, 2016a), which includes cross-subject and cross-view evaluations. For cross-subject protocol, 40 subjects are split into training and testing groups, and each group consists of 20 subjects. For cross-view protocol, the videos captured by sensor C-2 and C-3 are used as training samples, and the videos captured by sensor C-1 are used as testing samples.
We first use raw synthetic images to train HPM and HPM for the three different CNN architectures (AlexNet, GoogLeNet, and ResNet-50), and compare their performance on the UWA and NUCLA datasets. The best performing architecture is then selected and re-trained on the GAN-refined synthetic images. Next, we compare the HPM models trained on raw synthetic images to those trained on the GAN refined synthetic images. Finally, we perform comprehensive experiments to compare our proposed models trained on GAN refined synthetic images to existing methods on all three datasets.
8.1 HPM Performance with Different Architectures
|UWA3D Multiview Activity-II|
We determine the best CNN architecture that maximizes generalization power of HPM and HPM for RGB and depth images respectively. We use the raw synthetic pose images to fine tune AlexNet, GoogLeNet, and ResNet-50, and then test them on the UWA and NUCLA datasets. Since the trained model is to be used as a frame-wise feature extractor for action recognition, we take recognition accuracy and feature dimensionality both into account. Table 1 compares the average results on all possible cross-view combinations for the two datasets. The results show that for RGB videos, GoogLeNet and ResNet-50 perform much better than AlexNet. GoogLeNet also performs the best for depth videos and has the smallest feature dimensionality. Therefore, we select GoogLeNet as the network architecture for both HPM and HPM in the remaining experiments.
8.2 Quantitative Analysis of GAN Refinement
We quantitatively compare the effect of GAN-refinement using the UWA and NUCLA datasets by comparing the action recognition accuracies when the HPM and HPM are fine tuned once on raw synthetic images and once on GAN-refined synthetic images.
Table 2 shows the average accuracies for all cross-view combinations on the respective datasets. We can see that the HPM fine tuned on the GAN-refined synthetic RGB images achieves 3.3% and 1.4% improvement over the one fine tuned with raw synthetic images on the UWA and NUCLA datasets respectively. For HPM, GAN-refined synthetic data also improves the recognition accuracy for the two datasets by 1% and 1.3% respectively. The improvements are achieved because the distribution gap between synthetic and real images is narrowed by GAN refinement.
|UWA3D Multiview Activity-II|
|Raw synthetic images||64.7||73.8|
|GAN-refined synthetic images||68.0||74.8|
|Raw synthetic images||76.4||78.4|
|GAN-refined synthetic images||77.8||79.7|
Recall that the real images used as a benchmark distribution for GAN refinement are neither from UWA nor NUCLA dataset. We impose no dependence on the type of real images used for SimGAN learning, because it is an unsupervised process and no pose labels are required. In the remaining experiments, we use HPM and HPM fine tuned with GAN refined synthetic images for comparison with other techniques.
8.3 Comparison on the UWA3D Multiview-II Dataset
Table 3 compares our method with existing state-of-the-art. The proposed HPM alone achieves 68.0% average recognition accuracy for RGB videos, which is higher than the nearest RGB-only competitor R-NKTM (Rahmani et al, 2017), and for 8 out the 12 train-test combinations, our proposed HPM features provide significant improvement in accuracy. This shows that the invariant features learned by the proposed HPM are effective.
Combining HPM and dense trajectory features (Traj) gives a significant improvement in accuracy. It improves the RGB recognition accuracy to 76.4%, which is 9% higher than the nearest RGB competitor. It is also higher than the depth only method HPM. This shows that our method exploits the complementary information between the two modes of spatio-temporal representation, and enhances the recognition accuracy especially when there are large viewpoint variations. State-of-the-art RGB-D action recognition accuracy of 82.8% on UWA dataset is achieved when we combine HPM, HPM and dense trajectory features.
In Table 3, the results reported for the existing methods are achieved by using the original public models and fine tuning them on UWA3D dataset under the used protocol. In contrast, HPMs are not fine tuned to any dataset once their training on the proposed synthetic data is completed. These models are used out-of-the-box for the test data. These settings hold for all the experiments conducted in this work. Indeed, fine tuning HPMs on the real test datasets further improves the results but we avoid this step to show their generalization power.
Our data generation technique endows HPMs with inherent robustness to viewpoint variations along robustness to changes in background, texture and clothing etc. The baseline methods lack in these aspects which is a one of the reasons for the improvement achieved by our approach over those methods. Note that, HPMs are unique in the sense that they model individual human poses in frames instead of actions. Therefore, our synthetic data generation method, which is an essential part of HPM training, also focuses on generating human pose frames. One interesting enhancement of our data generation technique is to produce synthetic videos instead. We can then analyze the performance gain of (video-based) baseline methods trained with our synthetic data. To explore this direction, we extended our technique to generate synthetic videos and applied it to C3D (Tran et al, 2015) and LRCN (Donahue et al, 2015) methods as follows.
For transparency, we selected all ‘atomic’ action sequences from CMU MoCap. Each of these sequences presents a single action, which also serves as the label of the video clip. To generate synthetic videos, the frames in the training clips were processed according to the procedure described in Section 3 and 4 with the following major differences. (1) No clustering was performed to learn representative poses because it was not required. (2) The parameters (i.e. camera viewpoints, clothing etc.) were kept the same within a single synthetic video but different random settings were adopted for each video. The size of the generated synthetic data was matched to our “pose” synthetic data. We took the original C3D model that is pre-trained on the large scale dataset Sports-1M (Karpathy et al, 2014) and the original LRCN model that is pre-trained on UCF-101 (Soomro et al, 2012) and fine-tuned these models using our synthetic videos. The fine-tuned models were then employed under the used protocol.
We report the results of these experiments in Table 3 by denoting our enhancements of C3D and LRCN as C3D and LRCN. The results demonstrate that our synthetic data can improve the performance of baseline models for multi-view action recognition. The results also ascertain that the proposed approach exploits the proposed data very effectively to achieve significant performance improvement over the existing methods. We provide further discussion on the role of synthetic data in the overall performance of our approach in Section 9.
8.4 Comparison on the Northwestern-UCLA Dataset
Table 4 comparative results on the NUCLA dataset. The proposed HPM alone achieves 77.8% average accuracy which is 8.4% higher than the nearest RGB competitor NKTM (Rahmani and Mian, 2015). HPM +Traj further improves the average accuracy to 78.5%. Our RGB-D method (HPM+HPM+Traj) achieves 81.3% accuracy which is the highest accuracy reported on this dataset.
8.5 Comparison on the NTU RGB+D Dataset
Table 5 compares our method with existing state-of-the-art on the NTU dataset. The proposed HPM uses RGB frames only and achieves 68.5% cross-subject recognition accuracy, which is comparable to that of the best joints-based method ST-LSTM(Liu et al, 2016) 69.2% even though joints have been estimated from depth data and do not contain action irrelevant noises. This demonstrates that the HPM effectively learns features that are invariant to action irrelevant noises such as background, clothing texture and lighting etc.
Comparison of RGB Results:
Note that this paper is the first to provide RGB only human action recognition results on the challenging NTU dataset (see Table 5). Our method (HPM+Traj) outperforms all others by a significant margin while using only RGB data in both cross-subject and cross-view settings. In the cross-subject setting, our method achieves 75.8% accuracy which is higher than state-of-the-art DSSCA-SSLM (Shahroudy et al, 2017) even though DSSCA-SSLM uses both RGB and depth data whereas our method HPM+Traj uses only RGB data. DSSCA-SSLM does not report cross-view results as it did not perform well in that setting (Shahroudy et al, 2017) whereas our method achieves 83.2% accuracy for the cross-view case which is 7.7% higher than the nearest competitor ST-LSTM (Liu et al, 2016) which uses Joints data that is estimated from depth images. In summary, our 2D action recognition method outperforms existing 3D action recognition methods.
Comparison of RGB-D Results:
From Table 5, we can see that our RGB-D method (HPM+ HPM+Traj) achieves state-of-the-art results in both cross-subject and cross-view settings outperforming the nearest competitors by 6% and 8.4% respectively.
Table 6 shows the computation time for the major steps of our proposed method. Using a single core of a 3.4GHz CPU and the Tesla K-40 GPU, the proposed RGB
(HPM+Traj) and RGB-D (HPM+HPM+Traj) methods run at about 20 frames per second whereas the depth only method runs at about 46 frames per second.
When uncropped video frames are used to learn a neural network model, the background context is more dominant as it occupies more pixels. A recent study showed that by masking the human in the UCF-101 dataset, a 47.4% “human” action recognition accuracy could still be achieved which, using the same algorithm, is only 9.5% lower than when the humans are included (He et al, 2016b). Our HPM learns human poses rather than the background context which is important for surveillance applications where the background is generally static and any action can be performed in the same background. Moreover, HPM and HPM are not fine tuned on any of the datasets on which they are tested. Yet, our models outperform all existing methods by a significant margin. For applications such as robotics and video retrieval where the background context is important, our HPM models can be used to augment the background context. For optimal performance, the cropped human images must be passed through the HPM and HPM. However, both HPMs are robust to cropping errors as many frames in the UWA dataset (especially view 4 in Fig. 12) and the NTU dataset have cropping errors.
9.1 Comparison with existing synthetic data
One of the major contributions of this work is synthetic data generation for robust action recognition. We note that SURREAL (Synthetic hUmans foR REAL tasks) (Varol et al, 2017b) is another recent method to generate synthetic action data that can be used to train the proposed HPMs. However, there are some major differences between SURREAL and the proposed synthetic data. (1) SURREAL was originally proposed for body segmentation and depth estimation whereas our dataset aims at modeling distinctive human poses from multiple viewpoints. While both datasets provide sufficient variations in clothing, human models, backgrounds, and illuminations; our dataset systematically covers 180 of view to enable viewpoint invariance, which is not the case for SURREAL. (2) To achieve realistic viewpoint variations, our dataset uses 360 spherical High Dynamic Range Images whereas SURREAL uses the LSUN dataset (Yu et al, 2015) for backgrounds. Hence, our approach is more suitable for large viewpoint variations. (3) Finally, we use Generative Adversarial Network to reduce the distribution gap between synthetic and real data. Our results in Table 2 already verified that this provides additional boost to the action recognition performance.
To demonstrate the use of SURREAL with our pipe-line and quantitatively analyze the advantages of the proposed dataset for robust action recognition, we compare the performance of our underlying approach using the two datasets on UWA3D and NUCLA databases. We repeated our experiments using SURREAL as follows. First, we computed 339 representative poses from SURREAL using the HDBSCAN algorithm (McInnes et al, 2017) and used the skeletal distance function in Eq. (1) to assign frames in the dataset to these poses. HPMs are then trained on these frames using the 339 pose labels, followed by temporal encoding and classification. This pipeline is exactly the same as the one used for our data in Section 8, except that the representative poses are now computed using SURREAL.
|UWA3D Multiview Activity-II|
Table 7 reports the mean recognition accuracies of HPMs on UWA3D and NUCLA datasets when trained using the SURREAL dataset and the proposed data. The table also includes results for the most challenging viewpoints in the datasets. For UWA3D, View 4 is challenging due to the large variations in both azimuth and elevation angles (see Fig. 12). For NUCLA, View 3 is particularly challenging as compared to the other viewpoints (see Fig. 13). From the results, we can see that the proposed data is able to achieve significant performance gain over SURREAL for these viewpoints. In our opinion, systematic coverage of 180 of view in our data is the main reason behind this fact. Our data also achieves a consistent overall gain for both RGB and depth modalities of the test datasets.
9.2 Improvements with synthetic data
Our experiments in Section 8 demonstrate a significant performance gain over the current state-of-the-art. Whereas the contribution of the network architecture, data modalities and GAN to the overall performance is clear from the presented experiments, we further investigate the performance gain contributed by the proposed synthetic data. To that end, we first compare HPM, which has been fine-tuned with the proposed data, to HPM which is the original GoogLeNet model - not fine-tuned with our synthetic data. To ensure a fair comparison, all the remaining steps in the proposed pipeline, including temporal encoding and classification, are kept exactly the same for the two cases. The first two rows of Table 8 compare the mean recognition accuracies of HPM and HPM for UWA3D and NUCLA datasets. These results ascertain a clear performance gain with the proposed synthetic dataset.
The last two rows of Table 8 examine the performance gain of two popular baseline methods when fine-tuned on our synthetic data. Although significant, the average improvement in the accuracies of these methods is rather small compared to that of our method on our synthetic data (first two rows). Recall that our data generation method generates synthetic “poses” rather than videos and we had to extend our method to generate synthetic videos for the sake of this experiment. Details on synthetic video generation and training of the baseline methods are already provided in Section 8.3. From the results in Table 8, we conclude that our proposed method exploits our synthetic data more effectively, and both the proposed method and our synthetic data contribute significantly to the overall performance gain.
|HPM||without synthetic data||62.8||66.7|
|HPM||with synthetic data||68.0||77.8|
|C3D||with synthetic data||2.3||2.3|
|LRCN||with synthetic data||3.5||1.2|
9.3 Role of synthetic RGB data in action recognition
Although our approach deals with RGB, depth and RGB-D data; we find it necessary to briefly discuss the broader role of synthetic RGB data in human action recognition. In contrast to depth videos, multiple large scale RGB video action datasets are available to train deep action models. Arguably, this diminishes the need of synthetic data in this domain. However, synthetic data generation methods such as ours and (Varol et al, 2017b) are able to easily ensure a wide variety of action irrelevant variations in the data, e.g. in camera viewpoints, textures, illuminations; up to any desired scale. In natural videos, such variety and scale of variations can not be easily guaranteed even in large scale datasets. For instance, in our experiments in Sections 8.3 and 8.4, both C3D and LRCN were originally pre-trained on large scale RGB video datasets in Table 3 and 4, yet our RGB synthetic data was able to boost their performance. Our synthetic data method easily and efficiently captures as many variations of the exact same action as desired, a real-world analogous to which is extremely difficult.
We also tested the performance of our approach on UCF-101 dataset (Soomro et al, 2012) to analyze the potential of synthetic data and HPMs for the standard action recognition benchmarks in the RGB domain. UCF-101 is a popular RGB-only action dataset, which includes video clips of 101 action classes. The actions covers 1) Human-Object Interaction, 2) Human Body Motion, 3) Human-Human Interaction, 4) Playing Musical Instruments, and 5) Sport Actions. Since we trained our HPMs to model human poses, the appearances of human poses in the test videos are important for a transparent analysis. Therefore, we selected 1910 videos from the dataset with 16 classes of Human Body Motion, and classified them using the proposed HPMs. Table 9 reports the performance of our approach along the accuracy of C3D on the same subset for comparison. The table also reports the accuracy of HPM, for which we used twice as many pose labels and training images as used for training HPM. This improved the performance of our approach, indicating the advantage of easily producible synthetic data. Notice that, whereas the performance of HPMs remains comparable to C3D, the latter is trained on Millions of ‘videos’ as compared to the few hundred thousand ‘frames’ used for training our model. Moreover, our model is nearly 7 times smaller than C3D in size. These facts clearly demonstrate the usefulness of the proposed method and synthetic data generation technique for standard RGB action recognition.
|Method||Human Body Motion|
We proposed Human Pose Models for human action recognition in RGB, depth and RGB-D videos. The proposed models uniquely represent human poses irrespective of the camera viewpoint, clothing textures, background and lighting conditions. We proposed a method for synthesizing realistic RGB and depth training data for learning such models. The proposed method learns 339 representative human poses from MoCap skeleton data and then fits 3D human models to these skeletons. The human models are then rendered as RGB and depth images from 180 camera viewpoints where other variations such as body shapes, clothing textures, backgrounds and lighting conditions are applied. We adopted Generative Adversarial Networks (GAN) to reduce the distribution gap between the synthetic and real images. Thus, we were able to generate millions of realistic human pose images with known labels to train the Human Pose Models. The trained models contain complementary information between RGB and depth modalities, and also show good compatibility to the hand-crafted dense trajectory features. Experiments on three benchmark RGB-D datasets show that our method outperforms existing state-of-the-art on the challenging problem of cross-view and cross-person human action recognition by significant margins. The HPM, HPM and Python script for generating the synthetic data will be made public.
Acknowledgements.This research was sponsored by the Australian Research Council grant DP160101458. The Tesla K-40 GPU used for this research was donated by the NVIDIA Corporation.
- Dalal and Triggs (2005) Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In:
- Dalal et al (2006) Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision, pp 428–441
- Donahue et al (2015) Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2625–2634
- Du et al (2015) Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1110–1118
- Evangelidis et al (2014) Evangelidis G, Singh G, Horaud R (2014) Skeletal quads: Human action recognition using joint quadruples. In: International Conference on Pattern Recognition, pp 4513–4518
Fan et al (2008)
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for
large linear classification.
Journal of Machine Learning Research9:1871–1874
- Farhadi and Tabrizi (2008) Farhadi A, Tabrizi MK (2008) Learning to recognize activities from the wrong view point. In: European Conference on Computer Vision, pp 154–166
- Farhadi et al (2009) Farhadi A, Tabrizi MK, Endres I, Forsyth D (2009) A latent model of discriminative aspect. In: IEEE International Conference on Computer Vision, pp 948–955
- Feichtenhofer et al (2016) Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941
- Gkioxari and Malik (2015) Gkioxari G, Malik J (2015) Finding action tubes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 759–768
- Goodfellow et al (2014) Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
- Gopalan et al (2011) Gopalan R, Li R, Chellappa R (2011) Domain adaptation for object recognition: An unsupervised approach. In: IEEE International Conference on Computer Vision, pp 999–1006
- Gupta et al (2014) Gupta A, Martinez J, Little JJ, Woodham RJ (2014) 3d pose from motion for cross-view action recognition via non-linear circulant temporal encoding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2601–2608
- He et al (2016a) He K, Zhang X, Ren S, Sun J (2016a) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
- He et al (2016b) He Y, Shirakabe S, Satoh Y, Kataoka H (2016b) Human action recognition without human. In: European Conference on Computer Vision Workshops, pp 11–17
- Hu et al (2015) Hu JF, Zheng WS, Lai J, Zhang J (2015) Jointly learning heterogeneous features for rgb-d activity recognition. In: IEEE conference on Computer Vision and Pattern Recognition, pp 5344–5352
- Huang et al (2016) Huang Z, Wan C, Probst T, Van Gool L (2016) Deep learning on lie groups for skeleton-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition
- Ji et al (2013) Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE transactions on Pattern Analysis and Machine Intelligence 35(1):221–231
- Jia et al (2014a) Jia C, Kong Y, Ding Z, Fu YR (2014a) Latent tensor transfer learning for rgb-d action recognition. In: ACM international conference on Multimedia, pp 87–96
- Jia et al (2014b) Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014b) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:14085093
- Karpathy et al (2014) Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1725–1732
- Kerola et al (2017) Kerola T, Inoue N, Shinoda K (2017) Cross-view human action recognition from depth maps using spectral graph sequences. Computer Vision and Image Understanding 154:108–126
- Kong and Fu (2017) Kong Y, Fu Y (2017) Max-margin heterogeneous information machine for rgb-d action recognition. International Journal of Computer Vision 123(3):350–371
- Krizhevsky et al (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
- Li et al (2012) Li B, Camps OI, Sznaier M (2012) Cross-view activity recognition using hankelets. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1362–1369
- Li and Zickler (2012) Li R, Zickler T (2012) Discriminative virtual views for cross-view action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2855–2862
Li et al (2016)
Li Y, Li W, Mahadevan V, Vasconcelos N (2016) Vlad3: Encoding dynamics of deep features for action recognition. In:IEEE Conference on Computer Vision and Pattern Recognition, pp 1951–1960
- Liu et al (2011) Liu J, Shah M, Kuipers B, Savarese S (2011) Cross-view action recognition via view knowledge transfer. In: IEEE conference on Computer Vision and Pattern Recognition, pp 3209–3216
- Liu et al (2016) Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European Conference on Computer Vision, pp 816–833
- Luo et al (2017) Luo Z, Peng B, Huang DA, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. In: IEEE Conference on Computer Vision and Pattern Recognition
- Lv and Nevatia (2007) Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1–8
- McInnes et al (2017) McInnes L, Healy J, Astels S (2017) hdbscan: Hierarchical density based clustering. The Journal of Open Source Software
- Ohn-Bar and Trivedi (2013) Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 465–470
- Oreifej and Liu (2013) Oreifej O, Liu Z (2013) Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 716–723
- Parameswaran and Chellappa (2006) Parameswaran V, Chellappa R (2006) View invariance for human action recognition. International Journal of Computer Vision 1(66):83–101
- Pfister et al (2015) Pfister T, Charles J, Zisserman A (2015) Flowing convnets for human pose estimation in videos. In: IEEE International Conference on Computer Vision, pp 1913–1921
- Rahmani and Mian (2015) Rahmani H, Mian A (2015) Learning a non-linear knowledge transfer model for cross-view action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2458–2466
- Rahmani and Mian (2016) Rahmani H, Mian A (2016) 3d action recognition from novel viewpoints. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1506–1515
- Rahmani et al (2016) Rahmani H, Mahmood A, Huynh D, Mian A (2016) Histogram of oriented principal components for cross-view action recognition. IEEE transactions on Pattern Analysis and Machine Intelligence 38(12):2430–2443
- Rahmani et al (2017) Rahmani H, Mian A, Shah M (2017) Learning a deep model for human action recognition from novel viewpoints. IEEE Transactions on Pattern Analysis and Machine Intelligence
- Rao et al (2002) Rao C, Yilmaz A, Shah M (2002) View-invariant representation and recognition of actions. International Journal of Computer Vision 50(2):203–226
- Shahroudy et al (2016a) Shahroudy A, Liu J, Ng TT, Wang G (2016a) Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1010–1019
- Shahroudy et al (2016b) Shahroudy A, Ng TT, Yang Q, Wang G (2016b) Multimodal multipart learning for action recognition in depth videos. IEEE transactions on Pattern Analysis and Machine Intelligence 38(10):2123–2129
- Shahroudy et al (2017) Shahroudy A, Ng TT, Gong Y, Wang G (2017) Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Transactions on Pattern Analysis and Machine Intelligence
- Shakhnarovich (2005) Shakhnarovich G (2005) Learning task-specific similarity. PhD thesis, Massachusetts Institute of Technology
- Shrivastava et al (2016) Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2016) Learning from simulated and unsupervised images through adversarial training. arXiv preprint arXiv:161207828
- Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
- Soomro et al (2012) Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:12120402
- Su et al (2016) Su B, Zhou J, Ding X, Wang H, Wu Y (2016) Hierarchical dynamic parsing and encoding for action recognition. In: European Conference on Computer Vision, pp 202–217
- Szegedy et al (2015) Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9
- Toshev and Szegedy (2014) Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. In: IEEE conference on Computer Vision and Pattern Recognition, pp 1653–1660
- Tran et al (2015) Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on Computer Vision, pp 4489–4497
- Varol et al (2017a) Varol G, Laptev I, Schmid C (2017a) Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence
- Varol et al (2017b) Varol G, Romero J, Martin X, Mahmood N, Black MJ, Laptev I, Schmid C (2017b) Learning from Synthetic Humans. In: IEEE Conference on Computer Vision and Pattern Recognition
- Vemulapalli et al (2014) Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 588–595
- Wang and Schmid (2013) Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp 3551–3558
- Wang et al (2011) Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3169–3176
- Wang et al (2013a) Wang H, Kläser A, Schmid C, Liu CL (2013a) Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision 103(1):60–79
- Wang et al (2013b) Wang J, Liu Z, Wu Y, Yuan J (2013b) Learning actionlet ensemble for 3d human action recognition. IEEE transactions on Pattern Analysis and Machine Intelligence
- Wang et al (2014) Wang J, Nie X, Xia Y, Wu Y, Zhu SC (2014) Cross-view action modeling, learning and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2649–2656
- Wang et al (2015) Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE conference on Computer Vision and Pattern Recognition, pp 4305–4314
- Wang et al (2016a) Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016a) Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp 20–36
- Wang et al (2016b) Wang P, Li Z, Hou Y, Li W (2016b) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM on Multimedia Conference, pp 102–106
- Wang and Hoai (2016) Wang Y, Hoai M (2016) Improving human action recognition by non-action classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2698–2707
- Weinland et al (2006) Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Computer vision and image understanding 104(2):249–257
- Weinland et al (2007) Weinland D, Boyer E, Ronfard R (2007) Action recognition from arbitrary views using 3d exemplars. In: IEEE International Conference on Computer Vision, pp 1–7
Yang and Tian (2014)
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In:IEEE Conference on Computer Vision and Pattern Recognition, pp 804–811
- Yilmaz and Shah (2005) Yilmaz A, Shah M (2005) Actions sketch: A novel action representation. In: IEEE conference on Computer Vision and Pattern Recognition, vol 1, pp 984–989
- Yu et al (2015) Yu F, Zhang Y, Song S, Seff A, Xiao J (2015) Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. CoRR
- Yu et al (2016) Yu M, Liu L, Shao L (2016) Structure-preserving binary representations for rgb-d action recognition. IEEE transactions on Pattern Analysis and Machine Intelligence 38(8):1651–1664
- Zhang et al (2016) Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2718–2726
- Zhang et al (2013) Zhang Z, Wang C, Xiao B, Zhou W, Liu S, Shi C (2013) Cross-view action recognition via a continuous virtual path. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2690–2697
- Zheng and Jiang (2013) Zheng J, Jiang Z (2013) Learning view-invariant sparse representations for cross-view action recognition. In: IEEE International Conference on Computer Vision, pp 3176–3183
- Zhu et al (2016) Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1991–1999