Video based human action recognition has many applications in human-computer interaction, surveillance, video indexing and retrieval. Actions or movements generate varying patterns of spatio-temporal appearances in videos that can be used as feature descriptors for action recognition. Based on this observation, several visual representations have been proposed for discriminative human action recognition such as space-time pattern templates , shape matching [2, 3, 4], spatio-temporal interest points [5, 6, 7, 8, 9, 10], and motion trajectories based representation [11, 12, 13, 14]. Especially, dense trajectory based methods [12, 13, 14] have shown impressive results for action recognition by tracking densely sampled points through optical flow fields. While these methods are effective for action recognition from a common viewpoint, their performance degrades significantly under viewpoint changes. This is because the same action appears different and results in different trajectories when observed from different viewpoints.
A practical system must recognize human actions from unknown and more importantly unseen viewpoints. One approach for recognizing actions across different viewpoints is to collect data from all possible views and train a separate classifier for each case. This approach does not scale well as it requires a large number of labelled samples for each view. To overcome this problem, some techniques infer 3D scene structure and use geometric transformations to achieve view invariance[24, 25, 3, 26, 27]
. These methods often require robust joint estimation which is still an open problem in real-world settings. Other methods focus on view-invariant spatio-temporal features[28, 29, 30, 31, 32]. However, the discriminative power of these methods is limited by their inherent structure of view-invariant features .
Knowledge transfer-based methods [21, 17, 34, 22, 18, 19, 20, 23] have recently become popular for cross-view action recognition. These methods find a view independent latent space in which features extracted from different views are directly comparable. For instance, Li and Zickler  proposed to construct virtual views between action descriptors from source and target views. They assume that an action descriptor transforms continuously between two viewpoints and the virtual path connecting two views lies on a hyper-sphere (see Fig. 1-(a)). Thus,  computes virtual views as a sequence of linearly transformed descriptors obtained by making a finite number of stops along the virtual path. This method requires samples from both source and target views during training to construct virtual views.
To relax the above constraint on training data, Wang et al.  used a set of discrete views during training to interpolate arbitrary unseen views at test time. They learned a separate linear transformation between different views for each human body part using a linear SVM solver as shown in Fig. 1-(b), thereby limiting the scalability and increasing the complexity of their approach.
Existing view knowledge transfer approaches are unable to capture the non-linear manifolds where realistic action videos generally lie, especially when actions are captured from different views. This is because they only seek a set of linear transformations to construct virtual views between the descriptors of action videos captured from different viewpoints. Furthermore, such methods are either not applicable or perform poorly when recognition is performed on videos acquired from unknown and, more importantly, unseen viewpoints.
In this paper, we propose a different approach to view knowledge transfer that relaxes the assumptions on the virtual path and the requirements on the training data. We approach cross-view action recognition as a non-linear knowledge transfer learning problem where knowledge from multiple views is transferred to a shared compact high-level space. Our approach consists of three phases. Figure2 shows an overview of the first phase where a Robust Non-linear Knowledge Transfer Model (R-NKTM) is learned. The proposed R-NKTM is a deep fully-connected network with weight decay and sparsity constraints which learns to transfer action video descriptors captured from different viewpoints to a shared high-level representation. The strongest point of our technique is that we learn a single R-NKTM for mapping all action descriptors from all camera viewpoints to a shared compact space. Note that the labels used in Fig. 2 are dummy labels where every sequence is given a unique label that does not correspond to any specific action. Thus, action labels are not required while R-NKTM learning or while transferring training and test action descriptors to the shared high-level space using the R-NKTM. The second phase is training where action descriptors from unknown views are passed through the learned R-NKTM to construct their cross-view action descriptors. Action labels of training data are now required to train the subsequent classifier. In the test phase, view-invariant descriptors of actions observed from unknown and previously unseen views are constructed by forward propagating their view dependent action descriptors through the learned R-NKTM. Any classifier can be trained on the cross-view action descriptors for classification in a view-invariant way. We used a simple linear SVM classifier to show the strength of the proposed R-NKTM.
Our R-NKTM learning scheme is based on the observation that similar actions, when observed from different viewpoints, still have a common structure that puts them apart from other actions. Thus, it should be possible to separate action related features from viewpoint related features. The main challenge is that these features cannot be linearly separated. The second challenge comes from learning a non-linear model itself which requires a large amount of training data. Our solution is to learn the R-NKTM from action trajectories of synthetic 3D human models fitted to real motion capture (mocap) data. By projecting these 3D human models to different views, we can generate a large corpus of synthetic trajectories to learn the R-NKTM. We use k-means to generate a general codebook for encoding the action trajectories. The same codebook is used to encode dense trajectories extracted from real action videos in the training and test phases.
The major contribution of our approach is that we learn a single Robust Non-linear Knowledge Transfer Model (R-NKTM) which can bring any action observed from an unknown viewpoint to its compact high-level representation. Moreover, our method encodes action trajectories using a general codebook learned from synthetic data and then uses the same codebook to encode action trajectories of real videos. Thus, new action classes from real videos can easily be added using the same learned NTKM and codebook. Comparison with eight existing cross-view action recognition methods on four benchmark datasets including the IXMAS , UWA3D Multiview Activity II , Northwestern-UCLA Multiview Action3D , and UCF Sports  datasets shows that our method is faster and achieves higher accuracy especially when there are large viewpoint variations.
This paper is an extension of our prior work  where we transferred a given action acquired from any viewpoint to its canonical view. Knowledge of the canonical view was required for NKTM learning in . This is a problem because the canonical view is not only action dependent, it is ill-defined. For example, what would be the canonical view of a person walking in a circle? Another limitation of  is that cylinders were fitted to the mocap data to approximate human limbs, head and torso. The trajectories generated from such models do not accurately represent human actions. In this paper, we extend our work by removing both limitations. Firstly, we no longer require identification of the canonical view for learning the new R-NKTM and use dummy labels instead. Secondly, we fit realistic 3D human models to the mocap data and hence generate more accurate trajectories. Using 3D human models also enables us to vary, and hence model, the human body shape and size. Besides these extensions, we also perform additional experiments on two more datasets namely, the UWA3D Multiview Activity II  and UCF Sports  datasets. We denote our prior model  by NKTM and the one proposed in this paper by R-NKTM.
2 Related work
The majority of existing literature [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 38, 39, 40, 41] deals with action recognition from a common viewpoint. While these approaches are quite successful in recognizing actions captured from similar viewpoints, their performance drops sharply as the viewpoint changes due to the inherent view dependence of the features used by these methods. To tackle this problem, geometry based methods have been proposed for cross-view action recognition. Rao et al.  introduced an action representation to capture the dramatic changes of actions using view-invariant spatio-temporal curvature of 2D trajectories. This method uses a single point (e.g. hand centroid) trajectory. Yilmaz and Shah  extended this approach by tracking the 2D points on human contours. Given the human contours for each frame of a video, they generate an action volume by computing point correspondences between consecutive contours. Maximum and minimum curvatures on the spatio-temporal action volume are used as view-invariant action descriptors. However, these methods require robust interest points detection and tracking, which are still challenging problems.
Instead of using geometry constraints, Junejo et al.  proposed Self-Similarity Matrix that is constructed by computing the pairwise similarity between any pair of frames. Hankelet  represents actions with the dynamics of short tracklets, and achieves cross-view action recognition by finding the Hankelets that are invariant to viewpoint changes. These methods perform poorly on videos acquired from viewpoints that are significantly different from those of the training videos (e.g. the top view of IXMAS dataset) [37, 20].
Recently, transfer learning approaches have been employed to address cross-view action recognition by exploring some form of statistical connections between view-dependent features extracted from different viewpoints. A notable example of this category is the work of Farhadi et al. , who employed Maximum Margin Clustering to generate split-based features in the source view, then trained a classifier to predict split-based features in the target view. Liu et. al.  learned a cross-view bag of bilingual words using the simultaneous multiview observations of the same action. They represented the action videos by bilingual words in both views. Zheng  proposed to build a transferable dictionary pair by forcing the videos of the same action to have the same sparse coefficients across different views. However, these methods require feature-to-feature correspondence at the frame-level or video-level during training, thereby limiting their applications.
Li and Zickler 
assume that there is a smooth virtual path connecting the source and target views. They uniformly sampled a finite number of points along this virtual path and considered each point as a virtual view i.e. a linear transformation function. Action descriptors from both views are augmented into cross-view feature vectors by applying a finite sequence of linear transformations to each descriptor. Recently, Zhang et al. extended this approach by applying an infinite sequence of linear transformations. Although these methods can operate in the absence of feature-to-feature correspondence between source and target views, they still require the samples from target view during training.
More recently, Wang et al.  proposed cross-view action recognition by discovering discriminative 3D Poselets and learning the geometric relations among different views. However, they learn a separate transformation between different views using a linear SVM solver. Thus many linear transformations are learned for mapping between different views. For action recognition from unseen views, all learned transformations are used for exhaustive matching and the results are combined with an AND-OR Graph (AOG). This method also requires 3D skeleton data for training which is not always available. Gupta et al.  proposed to find the best match for each training video in large mocap sequences using a Non-linear Circular Temporary Encoding method. The best matched mocap sequence and its projections on different angles are then used to generate more synthetic training data making the process computationally expensive. Moreover, the success of this approach depends on the availability of a large mocap dataset which covers a wide range of human actions [15, 16].
Deep Learning Models:Deep learning models [42, 43, 44] can learn a hierarchy of features by constructing high-level representations from low-level ones. Due to the impressive results of such deep learning on handwritten digit recognition , image classification  and object detection , several methods have been recently proposed to learn deep models for video based action recognition. Ji et al. 
extended the deep 2D convolutional neural network (CNN) to 3D where convolutions are performed on 3D feature maps from spatial and temporal dimensions. Simonyan and Zisserman trained two CNNs, one for RGB images and one for optical flow signals, to learn spatio-temporal features. Gkioxari and Malik  extended this approach for action localization. Donahue et al. 
proposed an end-to-end trainable recurrent convolutional network which processes video frames with a CNN, whose outputs are passed through a recurrent neural network. None of these methods is designed for action recognition in videos acquired from unseen views. Moreover, learning deep models for the task of cross-view action recognition requires a large corpus of training data acquired from multiple views which is unavailable and very expensive to acquire and label. These limitations motivate us to propose a pipeline for generating realistic synthetic training data and subsequently learn a Robust Non-linear Knowledge Transfer Model (R-NKTM) which can transfer action videos from any view to a high level space where actions can be matched in a view-invariant way. Although learned from synthetic data, the proposed R-NKTM is able to generalize to real action videos and achieve state-of-the-art results.
3 Proposed technique
The proposed technique comprises three main stages including feature extraction, Robust Non-linear Knowledge Transfer Model (R-NKTM) learning, and cross-view action description. In the feature extraction stage, synthetic dense trajectories are first generated by fitting 3D human models to mocap sequences and projecting the resulting 3D videos on plains corresponding to different viewpoints. The 2D dense trajectories are then represented by bag-of-features. In the model learning stage, a deep fully-connected network, called R-NKTM, is learned such that it transfers the view-dependent trajectory descriptors of the same action observed from different viewpoints to a shared high-level virtual view. In the third stage, the dense trajectory descriptors of real action videos are passed through the learned R-NKTM to construct cross-view action descriptors. Details of each stage are given below.
3.1 Feature extraction
Dense trajectories have shown to be effective for action recognition [14, 12, 13, 15]. Our motivation for using dense trajectories is that they can be easily extracted from conventional videos as well as the synthetic 3D videos generated from mocap data.
3.1.1 Dense trajectories from videos
To extract trajectories from videos, Wang et al. [13, 12] proposed to sample dense points from each frame and track them using displacement information from a dense optical flow field. The shape of a trajectory encodes the local motion pattern. Given a trajectory of length , a sequence of displacement vectors is formed and normalized as:
The descriptor encodes the shape of the trajectory. To embed appearance and motion information, a spatio-temporal volume aligned with the trajectory is subdivided into a spatio-temporal grid and HOG, HOF and MBH descriptors are computed in each cell of the grid. The bag-of-features approach is then employed to construct a histogram of visual word occurrences for each descriptor (trajectory shape, HOG, HOF, MBH) separately. The final descriptor is a concatenation of these four histograms. However, it is important to note that unlike [13, 12] we only use the trajectory descriptors since their extraction using multiple viewpoints and scales is computationally efficient as shown in Section 3.1.2. The same process, on the other hand, is computationally very expensive for the remaining three descriptors i.e. HOG, HOF, and MBH. Moreover, using trajectories only is also robust to changes in visual appearance due to clothing and lighting conditions.
3.1.2 Dense trajectories from mocap sequences
Figure 2 gives an overview of the steps involved in generating synthetic dense trajectories using different human body shapes performing a large number of actions rendered from numerous viewpoints. Details are below.
3D Human body models: There are different ways to generate 3D human models. For example, Bogo et al.  developed the FAUST dataset containing full 3D human body scans of individuals in poses. However, the skeleton data is not provided for these scans. Another way to generate a 3D human model is to use the open source MakeHuman software  which can synthesize different realistic 3D human shapes in a predefined pose and also provide the joints positions which can be used for generating human models in different poses. We use this technique for generating the 3D human models in our work.
Fitting 3D human models to mocap sequences: Several approaches [53, 54] have been proposed in the literatures to fit a 3D human model to the motion capture skeleton data of a human subject. For instance, the SCAPE method  learns pose and body-shape deformation models from the training scans of different human bodies in a few poses. Given a set of markers, SCAPE constructs a full mesh which is consistent with the SCAPE models, best matches with the given markers and maintains realistic muscle deformations. This method takes approximately minutes to generate each frame. Another example is the MoSh method  which estimates an accurate 3D body shape directly from the mocap skeleton without the use of 3D human scans. MoSh is also able to estimate soft-tissue motions from mocap data and subsequently use them to produce animations with subtlety and realism. MoSh requires about minutes to estimate a subject’s shape. However, these methods are computationally expensive to apply on a large corpus of mocap sequences. Thus, we use the open source Blender package  to fit 3D human models to mocap data. Given a 3D human model generated by the MakeHuman software and a mocap sequence, Blender normalizes the mocap skeleton data with respect to the skeleton data of the human model and then fits the model to the normalized mocap data. This process results in a synthetic but realistic full 3D human body video corresponding to a mocap sequence.
Projection from multiple viewpoints: We deploy a total of synthetic cameras (at distinct latitudes and longitudes) on a sphere surrounding the subject performing an action, as shown in Fig. 3. Given a perspective camera and a frame of a synthetic full 3D human body sequence, we deal with self-occlusions by removing points that are not visible from the given camera viewpoint. First, we perform back-face culling by removing 3D points which their normals face away from the camera. Then, the hidden point removal technique  is applied on the remaining 3D points. This gives us a set of visible 3D points corresponding to the given viewpoint. The visible 3D points are projected to the plain using perspective projection resulting in a 2D pointcloud. We repeat this process for all cameras and all frames of the synthetic full 3D human body sequence, thereby, sequences of 2D pointclouds are generated for each synthetic full 3D human action sequence corresponding to a mocap sequence.
Dense trajectory extraction: Since we already have dense correspondence between the 3D human models in each pose, it is straight forward to extract trajectory features from their projected sequence of 2D pointclouds by simply connecting them in time over a fixed horizon of frames. A sequence of normalized displacement vectors is calculated for each point (1). Note that we use the same for both synthetic and real videos. We represent each video (synthetic or real) by a set of motion trajectory descriptors. We construct a codebook of size by clustering the trajectory descriptors with -means. It is important to note that clustering is performed only over the synthetic trajectory descriptors to learn the codebook. Thus, unlike existing cross-view action recognition techniques [15, 16, 23, 17, 20] the codebook we learn does not use the trajectory descriptors of real videos from IXMAS , UWA3DII  or Northwestern-UCLA  datasets. We call this the general codebook. We consider each cluster as a codeword that represents a specific motion pattern shared by the trajectory descriptors in that cluster. One codeword is assigned to each trajectory descriptor based on the minimum Euclidean distance. The resulting histograms of codeword occurrences are used as trajectory descriptors. Real action videos are encoded with the same codebook. Recall that unlike dense trajectory-based methods [12, 13] which use HOF, HOG, and MBH descriptors along with trajectories, our method only uses trajectory descriptors.
3.2 Non-linear Knowledge Transfer Model
Besides the limitations of employing linear transformation functions between views, existing cross-view action recognition methods [21, 15, 18, 20, 22, 23] are either not applicable to unseen views or require augmented training samples which cover a wide range of human actions. Moreover, these methods do not scale well to new data and need to repeat the computationally expensive model learning process when a new action class is to be added. To simultaneously overcome these problems, we propose a Robust Non-linear Knowledge Transfer Model (R-NKTM) that learns to transfer the action trajectory descriptors from all possible views to a shared compact high-level virtual view. Our R-NKTM is learned using synthetic training data and is able to generalize to real data without the need for retraining or fine-tuning, thereby increasing its scalability.
As depicted in Fig. 4, our R-NKTM is a deep network, consisting of fully-connected layers (where
) followed by a softmax layer andunits in the -th fully-connected layer where and . For a given training sample , where is the -th sample in -th view, the output of the first layer is , where is a weight matrix to be learned in the first layer,
is a bias vector, and, does not suffer from the gradient vanishing problem like the sigmoid and tangent hyperbolic functions do. Moreover, it has been shown that deep networks can be trained efficiently using the ReLU function even without the need for pre-training . Finally, ReLU generates sparse representations with true zeros that are suitable for exploiting sparsity in the data which is the case for histogram of codeword occurrences . Therefore, we use ReLU as the activation function in our proposed model.
The output of the first layer is used as the input of the second layer. The output of the second layer is computed as :
where , , and are the weight matrix, bias, and non-linear activation function of the second layer, respectively. Similarly, the output of the second layer is used as the input of the third layer and the output of the third layer is computed as
where , , and are the weight matrix, bias, and non-linear activation function of the third layer, respectively. The output of the last fully-connected layer is computed as
where is a non-linear transformation function determined by the parameters and . The output of the last fully-connected layer is passed through a softmax layer to find the appropriate class label.
We use this structure to find a shared high-level space among all possible views. Specifically, in our problem, the inputs to the R-NKTM are synthetic trajectory descriptors corresponding to mocap sequences over different views, while the output is their dummy class labels. Since we use the CMU mocap dataset  consisting of action sequences, the last fully-connected layer has units whose outputs are given to the softmax layer. The basic idea of this R-NKTM is that regardless of the input view of an unknown action (recall that we do not use the action labels of the mocap sequences), we encourage the output class label of the R-NKTM to be the same for all views of the given action. We explain this idea in the following.
Assume that there is a virtual path which connects any view to a single shared high-level virtual view. Therefore, there are different virtual paths connecting input views to the shared virtual view as shown in Fig. 4. We consider each virtual path as a set of non-linear transformations of action descriptors. Moreover, assume that the videos of the same action over different views share the same high-level feature representation. Given these two assumptions, our objective is to find this shared high-level virtual view and the intermediate virtual views connecting the input views to the shared virtual view.
The learning of the proposed R-NKTM is carried out by updating its parameters , where and , for minimizing the following objective function over all samples of the input views:
where is the number of viewpoints, is the number of samples in the mocap dataset (for CMU mocap dataset : ), denotes class label of the -th mocap sequence i.e. and
denotes softmax loss function.
Due to the high flexibility of the proposed R-NKTM (e.g. number of units in each layer , ), appropriate settings in the configuration of the R-NKTM are needed to ensure that it learns the underlying data structure. Since the input data , where , we discard the redundant information in the high dimensional input data by mapping it to a compact, high-level and low dimensional representation. This operation is performed by fully-connected layers () of the R-NKTM.
To avoid over-fitting and improve generalization of the R-NKTM, we add weight decay and sparsity regularization terms to the training criterion i.e. the loss function (5) [59, 60]. Large weights cause highly curved non-smooth mappings. Weight decay keeps the weights small and hence the mappings smooth to reduce over-fitting . Similarly, sparsity helps in selecting the most relevant features to improve generalization.
where and are the weight decay and sparsity parameters respectively. The penalty tends to decrease the magnitude of the weights :
where returns the Frobenius norm of the weight matrix of the -th layer. Let
be the mean activation of the -th unit of the -th layer (averaged over all the training samples ). The penalty forces the to be as close as possible to a sparsity target
and is defined in terms of the Kullback-Leibler (KL) divergence between a Bernoulli random variable with meanand a Bernoulli random variable with mean as
The reasons for using these two regularization terms are twofold. Firstly, not all features are equally important. Secondly, sparsity forces the R-NKTM to find a compact, shared and high-level virtual view, , by selecting only the most critical features. A dense representation may not learn a good model because almost any change in the input layer modifies most of the entries in the output layer.
Our goal is to solve the optimization problem in (6) as a function of and
. Therefore, we use stochastic gradient descent through back-propagation to minimize this function over all training samples in the mocap data.
Figure 5 visualizes the output features of the learned R-NKTM layers for four mocap actions that were not used during learning. In each case, a 3D human model was fitted to the mocap sequence and projected from viewpoints. Dense trajectories of each view were calculated to get descriptors which were then individually passed through the learned R-NKTM. Figure 5 shows the outputs of each layer as an image. As expected, the outputs of the shared virtual view are very similar for all views. Note that we drop the outputs of the last fully-connected and softmax layers because they are the class scores which correspond to dummy labels.
3.3 Cross-View Action Description
So far we have learned an R-NKTM whose input is a synthetic trajectory descriptor corresponding to a mocap sequence fitted with a 3D human model and observed from any arbitrary view. The output of the model is the class label which is the same for all views of the sequence. However, our aim is to extract cross-view action descriptors from real videos acquired from any arbitrary view.
Given a real human action video, the view-dependent descriptor is constructed by extracting dense trajectories from multiple spatial scales of the given video and then building the histogram of codeword occurrences using the learned general codebook as discussed in Section 3.1. Recall that the R-NKTM learns to find a shared high-level virtual view, , and the intermediate virtual views, , lie on the virtual path connecting the input view and the shared virtual view. This means that we have a set of non-linear transformation functions which transfer the view-dependent action trajectory descriptor from an unknown view to the shared high-level virtual view. Recall that we remove the last fully-connected and softmax layers because these layers correspond to dummy labels which do not provide any useful information for representing real videos.
We describe an action video as alterations of its view-dependent descriptor along the virtual path. The cross-view action descriptor is constructed by concatenating the transformed features along the virtual path into a long feature vector . This new descriptor implicitly incorporates the non-linear changes from the unknown input view to the shared high-level virtual view. Since the feature vector contains all the virtual views from the source to the shared view, it is more robust to viewpoint variations. To perform cross-view action recognition on any real action video dataset, we use the samples with their corresponding labels from a source view i.e. training data, and extract their cross-view action descriptors. Then, we train a linear SVM classifier to classify these actions. For a given sample at test time (i.e. samples from target view), we simply extract its cross-view descriptor and feed it to the trained SVM classifier to find its label. Figure 6 shows an overview of the proposed method for extracting cross-view action descriptors from real videos.
We evaluate our proposed method on four benchmark datasets including the INRIA Xmas Motion Acquisition Sequences (IXMAS) , UWA3D Multiview Activity3DII (UWA3DII) , Northwestern-UCLA Multiview Action3D (N-UCLA) , and UCF Sports  datasets. We compare our performance to the state-of-the-art action recognition methods including Dense Trajectories (DT) , Hankelets , Discriminative Virtual Views (DVV) , Continuous Virtual Path (CVP) , Non-linear Circulant Temporal Encoding (nCTE) , AND-OR Graph (AOG) , Long-term Recurrent Convolutional Network (LRCN) , and Action Tube.
We report action recognition results of our method for unseen and unknown views i.e. unlike DVV  and CVP  we assume that no videos, labels or correspondences from the target view are available at training time. More importantly, unlike existing techniques [23, 22, 15, 21, 49, 50] we learn our R-NKTM and build the codebook using only synthetic motion trajectories generated from mocap sequences. Therefore, the R-NKTM and the codebook are general and can be used for cross-view action recognition on any action video without the need for retraining or fine-tuning. More precisely, we use the same learned R-NKTM to evaluate our algorithm on IXMAS , UWA3DII  and N-UCLA  datasets. However, nCTE , DVV , CVP  and AOG  need to learn different models to transfer knowledge across views for different datasets. Action Tube  and LRCN  require to fine-tune a pre-trained model for each action video dataset.
In addition to the accuracy of our method, we report the recognition accuracy of the NKTM proposed in our prior work . As shown in Fig. 7, the view knowledge transfer model in  uses a different architecture consisting of units at the input/output layers and units at the two hidden layers. Moreover, it learns to transfer actions observed from unknown viewpoints to their canonical view.
4.1 Implementation Details
For a fair comparison, we pass the dense trajectory descriptors, instead of spatio-temporal interest point descriptors, to DVV  and CVP . Moreover, we use virtual views, each with a -dimensional features. The baseline results are obtained using publicly available implementations of DT , Hankelets , nCTE , DVV , LRCN  and Action Tube  or from the original papers.
Dense Trajectories Extraction: To generate synthetic dense trajectory descriptors from multiple viewpoints, we use the CMU Motion Capture dataset  which contains over mocap sequences of different subjects performing a variety of daily-life actions. We remove the short sequences containing less than frames since dense trajectories require minimum frames. The remaining mocap sequences are used for generating synthetic training data to learn the R-NKTM. Each sequence is treated as a different action and given a unique dummy label. We can generate as many different views from the 3D videos as we desire. Using azimuthal angle , and zenith angle , we generate () camera viewpoints and project the 3D videos. Dense trajectories are then extracted from the 2D projections and clustered into clusters using -means to make the general codebook. From real videos, we extract dense trajectories using the method by Wang et al. . We take the length of each trajectory for both mocap and video sequences. As recommended by , we use spatial scales spaced by a factor of and the dense sampling step size for video samples.
R-NKTM Configuration: We used multi-resolution search  to find optimal hyper-parameter values such as weight decay, sparsity and units per layer. The idea is to test some values from a larger parameter range, select a few best configurations and then test again with smaller steps around these values. To optimize the number of R-NKTM layers, we tested networks with increasing number of layers  and stopped where the performance peaked on our validation data. We used a momentum of , weight decay , sparsity parameter , and sparsity target .
4.2 IXMAS Dataset
This dataset  consists of synchronized videos observed from different views including four side views and a top view. It contains daily-life actions including check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, and pick up. Each action was performed three times by subjects. Figure 8 shows examples from this dataset.
We follow the same evaluation protocol as in [23, 28, 15] and verify our algorithm on all possible pairwise view combinations. In each experiment, we use all videos from one camera as training samples and then evaluate the recognition accuracy on the video samples from the remaining cameras. Comparison of the recognition accuracy for possible combinations of training and test cameras is shown in Table I.
R-NKTM achieves better recognition accuracy than the NKTM  which requires to define a same canonical view for all actions. Moreover, the proposed R-NKTM outperforms the state-of-the-art methods on most view pairs and achieves average recognition accuracy which is about higher than the nearest competitor nCTE . It is interesting to note that our R-NKTM can perform much better (about on average) than the nearest competitor nCTE  when camera is considered as either source or target view (see Table II). As shown in Fig. 8, camera captured videos from the top view, so the appearance of these videos is completely different from the videos captured from the side views (i.e. camera to ). Hence, we believe that the recognition results on camera are the most important for evaluating cross-view action recognition. Moreover, some actions such as check watch, cross arms, and scratch head are not available in the mocap dataset. However, our R-NKTM achieves average accuracy on these three actions which is about higher than nCTE . This demonstrates that the proposed R-NKTM is able to transfer knowledge across views without requiring all action classes in the learning phase.
Among the knowledge transfer based methods, DVV  and CVP  did not perform well. The deep learning based methods such as LRCN  and Action Tube  achieve low accuracy because they were originally proposed for action recognition from a common viewpoint. DT  achieves a high overall recognition accuracy because the motion trajectories of action videos captured from the side views are similar. However, its average accuracy when camera is considered as either source or target view, is over lower than our proposed method.
4.3 UWA3D Multiview Activity II Dataset
This dataset  consists of a variety of daily-life human actions performed by subjects with different scales. It includes action classes: one hand waving, one hand Punching, two hand waving, two hand punching, sitting down, standing up, vibrating, falling down, holding chest, holding head, holding back, walking, irregular walking, lying down, turning around, drinking, phone answering, bending, jumping jack, running, picking up, putting down, kicking, jumping, dancing, moping floor, sneezing, sitting down (chair), squatting, and coughing. Each subject performed actions times. Each time the action was captured from a different viewpoint (front, top, left and right side views). Video acquisition from multiple views was not synchronous thus there are variations in the actions besides viewpoints. This dataset is challenging because of varying viewpoints, self-occlusion and high similarity among actions. For instance, action drinking and phone answering have very similar motion, but the location of hand in these two actions is slightly different. Also, actions like holding head and holding back have self-occlusion. Moreover, in the top view, the lower part of the body was not properly captured because of occlusion. Figure 10 shows four sample actions observed from viewpoints.
We follow  and use the samples from two views as training data, and the samples from the remaining views as test data. Table III summarizes our results. The proposed R-NKTM significantly outperforms NKTM  and the state-of-the-art methods on all view pairs. The overall accuracy of the view knowledge transfer based methods such as DVV  and CVP  is low because motion and appearance of many actions look very similar across view changes.
It is interesting to note that our method achieves average recognition accuracy which is about higher than than the nearest competitor nCTE  when view is considered as the test view. As shown in Fig. 10, view is the top view which is challenging because the lower part of the subject’s body was not fully captured by the camera.
Figure 11 compares the class specific action recognition accuracies of R-NKTM and NKTM . The proposed R-NKTM achieves better recognition accuracy on most action classes. The easiest action to identify is jumping jack with an average accuracy of and the hardest is phone answering with an average accuracy of . These results are not surprising, since jumping jack is one of the activities with the most discriminative trajectories while phone answering is confused with drinking because the motion of these actions is very similar.
It is important to note that for many actions in the UWA3D Multiview ActivityII dataset such as holding chest, holding head, holding back, sneezing and coughing, there are no similar actions in the CMU mocap dataset. However, our method still achieves high recognition accuracies for these actions. This demonstrates the effectiveness and generalization ability of our proposed model for representing human actions from unseen and unknown views in a view-invariant space.
4.4 N-UCLA Multiview Action3D Dataset
This dataset  contains RGB, depth and skeleton data captured simultaneously by Kinect cameras. The dataset consists of action categories including pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throw, and carry. Each action was performed by subjects from to times. Fig. 12 shows some examples. This dataset is very challenging because the subjects performed some walking within most actions and the motion of some actions such as carry and walk around are very similar. Moreover, most activities involve human-object interactions.
We follow  and use the samples from the first two cameras for training and samples from the remaining camera for testing. The comparison of the recognition accuracy is shown in Table IV. The proposed R-NKTM again outperforms the NKTM  and achieves the highest recognition accuracy.
Figure 13 compares the per action class recognition accuracy of our proposed R-NKTM and NKTM . Our method achieves higher accuracy than NKTM  for most action classes. Note that a search for some actions such as donning, doffing and drop trash returns no results on the CMU mocap dataset  used to learn our R-NKTM. However, our method still achieves average recognition accuracy on these three actions which is about higher than nCTE . Moreover, walk around and carry have maximum confusion with each other because the motion of these actions are very similar.
4.5 UCF Sports Dataset
While the focus of the proposed approach is on action recognition from unknown and unseen views, we also evaluate its performance for recognizing actions from previously seen views to have a baseline and to show that our method performs equally good when the viewpoint of the test action is not novel. The evaluation is performed on the UCF Sports dataset  containing videos from sports broadcasts in a wide range of scenes. As recommended in , we use the Leave-One-Out (LOO) cross-validation scheme. We compare our proposed method to the dense trajectory based method (DT) . We choose DT  as our baseline because it is most relevant to our work as it employs dense trajectory descriptors. As shown in Table VI, using only trajectory descriptors, our method achieves higher accuracy than DT . However, combining HOG, HOF, and MBH descriptors with the trajectory descriptor significantly increases the recognition accuracy of DT  by . Similarly, adding these features to our cross-view action descriptor significantly improves the accuracy of our method to which is about higher than DT .
Combining the view dependent HOG, HOF and MBH descriptors with our cross-view descriptor also improves the recognition accuracy for the multiview case especially when the difference between the viewpoints is not large. Table V shows comparative results of combined descriptors and the cross-view trajectory only descriptors on the IXMAS dataset. The accuracy of most sourcetarget combinations from side views have improved by using the combined features. This is because the appearance of these views is quite similar.
|R-NKTM (Traj. only)||92.7||80.3||83.9||55.2||95.5||80.6||86.4||47.0||82.7||83.6||83.6||75.5||85.8||85.2||84.9||44.2||56.0||53.0||79.0||52.4||74.1|
4.6 Effects of Concatenating Virtual Views
We evaluate the intermediate performance of our cross-view descriptor by sequentially adding the virtual views. Figures 14 and 15 show the recognition accuracy on IXMAS and UWA3DII datasets respectively for all possible sourcetarget view pairs. For most sourcetarget view pairs of IXMAS dataset, the accuracy increases as more virtual views are added to the cross-view action descriptor. The maximum incremental gain is obtained when camera (top view) is used as training or test view. The minimum gain is for view pair because the viewpoints of these cameras are very similar. Thus the raw trajectory descriptors already achieve high accuracy. Fig. 15 shows that for all sourcetarget view pairs of UWA3DII dataset, the recognition accuracy increases by adding virtual views to the descriptor.
4.7 Computation Time
It is interesting to note that our technique outperforms the current cross-view action recognition methods on the IXMAS , UWA3DII  and N-UCLA  datasets by transferring knowledge across views using the same R-NKTM learned without supervision (without real action labels). Therefore, compared to existing cross-view action recognition techniques, the proposed R-NKTM is more general and can be used in on-line action recognition systems. More precisely, the cost of adding a new action class using our approach in an on-line system is equal to SVM training. On the other hand, this situation is computationally expensive for most existing techniques especially for our nearest competitors [21, 15] as shown in Table VII. For instance nCTE  requires to perform computationally expensive spatio-temporal matching for each video sample of the new action class. Similarly, AOG  needs to retrain the AND/OR structure and tune its parameters. Table VII compares the computational complexity of the proposed method with AOG  and nCTE . Compared to AOG  and nCTE , the training time of the proposed method for adding a new action class is negligible. Thus, it can be used in an on-line system. Moreover, the test time of the proposed method is much faster than AOG  and comparable to nCTE . However, nCTE  requires GB memory to store the augmented samples whereas our model requires MB memory to store the learned R-NKTM and the general codebook.
We presented an algorithm for unsupervised learning of a Robust Non-linear Knowledge Transfer Model (R-NKTM) for cross-view action recognition. We call it unsupervised because the labels used to learn the R-NKTM are just dummy labels and do not correspond to actions that we want to recognize. The proposed R-NKTM is scalable as it needs to be trained only once using synthetic data and generalizes well to real data. We presented a pipeline for generating a large corpus of synthetic training data required for deep learning. The proposed method generates realistic 3D videos by fitting 3D human models to real motion capture data. The 3D videos are projected on 2D plains corresponding to a large number of viewing directions and their dense trajectories are calculated. Using this approach, the dense trajectories are realistic and easy to compute since the correspondence between the 3D human poses is known a priori. A general codebook is learned from these trajectories using k-means and then used to represent the synthetic trajectories for R-NKTM learning as well as the trajectories extracted from real videos during training and testing. The major strength of the proposed R-NKTM is that a single model is learned to transform any action from any viewpoint to its respective high level representation. Moreover, action labels or knowledge of the viewing angles are not required for R-NKTM learning or R-NKTM based representation of real video data. To represent actions in real video sequences, their dense trajectories are coded with the general codebook and forward propagated through the R-NKTM. A simple linear SVM classifier was used to show the strength of our model. Experiments on benchmark multiview datasets show that the proposed approach outperforms existing state-of-the-art.
This research was supported by ARC Discovery Grants DP110102399 and DP160101458.
-  M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in ICCV, 2005.
-  Z. Lin, Z. Jiang, and L. Davis, “Recognizing actions by shape-motion prototype trees,” in ICCV, 2009.
-  F. Lv and R. Nevatia, “Single view human action recognition using key pose matching and viterbi path searching,” in CVPR, 2007.
-  S. Xiang, F. Nie, Y. Song, and C. Zhang, “Contour graph based human tracking and action sequence recognition,” Pattern Recognition, 2008.
-  P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in ICCV, 2005.
-  I. Laptev, “On space-time interest point,” IJCV, 2005.
-  G. Willems, T. Tuytelaars, and L. Gool, “An efficient dense and scale-invariant spatio-temporal interest point detector,” in ECCV, 2008.
-  H. Rahmani, A. Mahmood, D. Q Huynh, and A. Mian, “HOPC: Histogram of oriented principal components of 3D pointclouds for action recognition,” in ECCV, 2014.
-  J. Liu and M. Shah, “Learning human actions via information maximization,” in CVPR, 2008.
-  J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabularies using diffusion distance,” in CVPR, 2009.
-  S. Wu, O. Oreifej, and M. Shah, “Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories,” in ICCV, 2011.
-  H. Wang, A. Kläser, C. Schmid, and C. Liu, “Dense trajectories and motion boundary descriptors for action recognition,” IJCV, 2013.
-  H. Wang, A. Kläser, C. Schmid, and C. Liu, “Action recognition by dense trajectories,” in CVPR, 2011.
-  H. Wang and C. Schmid, “Action recognition with improved trajectories,” in ICCV, 2013.
-  A. Gupta, J. Martinez, J. J. Little, and R. J. Woodham, “3D pose from motion for cross-view action recognition via non-linear circulant temporal encoding,” in CVPR, 2014.
-  A. Gupta, A. Shafaei, J. J. Little, and R. J. Woodham, “Unlabelled 3D motion examples improve cross-view action recognition,” in BMVC, 2014.
-  R. Gopalan, R. Li, and R. Chellapa, “Domain adaption for object recognition: An unsupervised approach,” in ICCV, 2011.
-  A. Farhadi and M. K. Tabrizi, “Learning to recognize activities from the wrong view point,” in ECCV, 2008.
-  A. Farhadi, M. K. Tabrizi, I. Endres, and D. A. Forsyth, “A latent model of discriminative aspect,” in ICCV, 2009.
-  J. Liu, M. Shah, B. Kuipersy, and S. Savarese, “Cross-view action recognition via view knowledge transfer,” in CVPR, 2011.
-  J. Wang, X. Nie, Y. Xia, Y. Wu, and S. Zhu, “Cross-view action modeling, learning and recognition,” in CVPR, 2014.
-  Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, “Cross-view action recognition via a continuous virtual path,” in CVPR, 2013.
-  R. Li and T. Zickler, “Discriminative virtual views for cross-view action recognition,” in CVPR, 2012.
-  A. Yilmaz and M. Shah, “Action sketch: a novel action representation,” in CVPR, 2005.
-  T. Syeda-Mahmood, A. Vasilescu, and S. Sethi, “Action recognition from arbitrary views using 3D exemplars,” in ICCV, 2007.
-  D. Gavrila and L. Davis, “3D model-based tracking of humans in action: a multi-view approach,” in CVPR, 1996.
-  T. Darrell, I. Essa, and A. Pentland, “Task-specific gesture analysis in real-time using interpolated views,” PAMI, 1996.
-  B. Li, O. Camps, and M. Sznaier, “Cross-view activity recognition using hankelets,” in CVPR, 2012.
-  V. Parameswaran and R. Chellappa, “View invariance for human action recognition,” IJCV, 2006.
-  C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation and recognition of actions,” IJCV, 2002.
-  D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recognition using motion history volumes,” CVIU, 2006.
-  I. N. Junejo, E. Dexter, I. Laptev, and P. Patrick, “Cross-view action recognition from temporal self-similarities,” in ECCV, 2008.
-  X. Yang and Y. Tian, “A survey of vision-based methods for action representation, segmentation and recognition,” CVIU, 2011.
-  J. Zheng and Z. Jiang, “Learning view-invariant sparse representations for cross-view action recognition,” in ICCV, 2013.
-  H. Rahmani, A. Mahmood, D. Q Huynh, and A. Mian, “Histogram of oriented principal components for cross-view action recognition,” arXive: 1409.6813, 2015.
-  M. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-temporal maximum average correlation height filter for action recognition,” in CVPR, 2008.
-  H. Rahmani and A. Mian, “Learning a non-linear knowledge transfer model for cross-view action recognition,” in CVPR, 2015.
-  H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, “Real time action recognition using histograms of depth gradients and random decision forests,” in WACV, 2014.
-  H. Rahmani, D. Q. Huynh, A. Mahmood, and A. Mian, “Discriminative human action classification using locality-constrained linear coding,” Pattern Recognition Letters, 2015.
-  A. Shahroudy, T.-T. Ng, Q. Yang, and G. Wang, “Multimodal multipart learning for action recognition in depth videos,” PAMI, 2016.
-  H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, “Action classification with locality-constrained linear coding,” in ICPR, 2014.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, 2006.
-  G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, 2006.
Y. Bengio, “Learning deep architectures for AI,”
Foundations and Trends in Machine Learning, 2009.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
-  S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” PAMI, 2013.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NIPS, 2014.
-  G. Gkioxari and J. Malik, “Finding action tubes,” in CVPR, 2015.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in CVPR, 2015.
-  F. Bogo, J. Romero, M. Loper, and M. J. Black, “FAUST: Dataset and evaluation for 3D mesh registration,” in CVPR, 2014.
-  “MakeHuman: an open source 3D computer graphics software,” http://www.makehuman.org/.
-  M. Loper, N. Mahmood, and M. J. Black, “MoSh: motion and shape capture from sparse markers,” ACM Transactions on Graphics, 2014.
-  D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis, “SCAPE: Shape completion and animation of people,” ACM Transactions on Graphics, 2005.
-  “Blender: a 3D modelling and rendering package,” http://www.blender.org/.
-  S. Katz, A. Tal, and R. Basri, “Direct visibility of point sets,” ACM Transactions on Graphics, 2007.
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,”
International Conference on Artificial Intelligence and Statistics, 2011.
-  CMU Motion Capture Database, http://mocap.cs.cmu.edu/,.
-  Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” Neural Network: Tricks of the Trade, 2012.
G. Hinton, “A practical guide to training restricted boltzmann machines,”Neural Network: Tricks of the Trade, 2012.
S. Lawrence, C. L. Giles, and A. C. Tsoi, “What size neural network gives optimal generalization? convergence properties of backpropagation,” Tech. Rep., 1996.
-  H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in ICML, 2007, pp. 473–480.