Human action recognition is an active research topic with several applications in surveillance and security baptista2018anticipating , healthcare and assisted living baptista2017flexible ; baptista2017video , and human-computer interaction song2012continuous . Nevertheless, due to large differences within the same class of actions, viewpoint variations, occlusions and changes in lighting conditions, action recognition still remains a challenging problem.
Consequently, there is a wide variety of action recognition approaches in the literature. One way to categorize them is based on the area features are computed on; global approaches, where the entire image is used to generate features weinland:inria-00544629 ; 910878 , and local approaches, where specific regions of interest are selected to generate features. One of the most popular approaches belonging to the second category is Dense Trajectories wang:2011:inria-00583818:1 , in which every action is represented by a set of motion trajectories along which features are aligned and encoded using the Bag-of-Words (BoW) model Li:2005:BHM:1068508.1069129 .
Approaches based on Dense Trajectories are particularly effective when the amount of motion is high Koperski14 . This is mainly because images in a video are densely sampled and tracked for generating the trajectories. However, Dense Trajectories, by definition, include trajectories of points that are irrelevant for action recognition due to background motion, noise, etc.; thus, resulting in the inclusion of irrelevant information. Furthermore, Dense Trajectories are typically generated using optical flow which fails to describe motion with radial orientation with respect to the image plane. Therefore, taking advantage of the availability of RGB-D cameras, we propose to redefine Dense Trajectories by giving them a local description power. This is achieved by clustering Dense Trajectories around human body joints provided by RGB-D sensors, which we will refer to as Localized Trajectories henceforth.
The proposed approach offers two main advantages. First, since we only consider trajectories that are localized around human body joints, our approach is more robust to large irrelevant motion estimates. As a consequence, actions which have similar motion patterns, but involving different body parts, are more easily distinguished. Second, our approach allows the description of the relationship of “action-motion-joint”, i.e. an action is associated with both; a type of motion and joint location, in contrast to classical Dense Trajectories described by the relationship “action-motion” where action is associated with a type of motion only. This is done by generating features around the Localized Trajectories based on the concept of local BoWs lazebnik2006beyond . One codebook is therefore constructed per group of Localized Trajectories. Each codebook corresponds to a specific body joint.
For a better description of radial motion, we further propose to explore Localized Trajectories using the three modalities provided by RGB-D cameras. Specifically, we introduce the 3D Localized Trajectories
concept, which requires the estimation of scene flow, the displacement vector field in 3D, instead of optical flow. Coupling 3D Trajectories and the corresponding motion descriptors with Localized Trajectories offers richer localized motion information, in both lateral and radial directions, allowing a better discrimination of actions. However, scene flow estimation is generally more noisy resulting in a less accurate temporal tracking of points. Thus, we propose to construct local codebooks by sampling trajectory-aligned features based on confidence and ambiguity metricsWang12 .
This paper is an extended version of papadopoulos2017enhanced . Compared to our previous work, the main contribution is the generalization of the proposed Localized Trajectories to 3D using RGB-D data. This extension is combined with a novel codebook construction scheme, suitable for tackling noisy feature samples. Moreover, an extensive comparison with state-of-the-art approaches is presented, along with evaluation on multiple datasets and novel discussions and analysis.
In summary, the contributions of this paper are listed as follows:
A novel 2D Localized Trajectories concept is introduced, which utilizes body pose information in order to spatially group similar trajectories together.
Localized Trajectories are extended from 2D to 3D thanks to the availability of depth data, which are directly used for 3D motion estimation.
A novel feature selection concept for a robust codebook construction is introduced.
An extensive experimental evaluation on several RGB-D datasets is presented to validate the discriminative power of the proposed approach.
The remainder of the paper is organized as follows: in Section 2, a literature review of related works is given, followed by a detailed overview of background material in Section 3. The proposed approach is described in Section 4 and Section 5. In Section 6, descriptions of different datasets, experimental setups, and results are presented. Finally, Section 7 concludes the paper and provides a perspective on future research directions.
2 Related Work
In this section, we present some of the established action recognition approaches in the literature. First, we start by giving a general overview of RGB-D based action recognition approaches. Then, we focus on representations inspired by Dense Trajectories which are directly related to our work.
2.1 Dense Trajectories Related Approaches
Initially introduced by Wang et al. wang:2011:inria-00583818:1 , Dense Trajectories are classically generated by computing motion and texture features around motion trajectories. Due to their popularity, many researchers have extended this original formulation in order to enhance their performance wang2013action ; Koperski14 ; wang2015action ; jiang2012trajectory ; ni2015motion .
As a first attempt, Wang et al. wang2013action proposed to reinforce Dense Trajectories by using the Random Sampling Consensus (RANSAC) algorithm to reduce the noise caused by motion. In addition to that, they have replaced the Bag-of-Visual-Words representation with Fisher Vectors.
Then, Koperski et al. Koperski14 suggested enriching motion trajectories using depth information. They proposed a model grouping the videos in two types: videos with high level of motion and others with low amount of motion. For the first group, an extension of Trajectory Shape Descriptor wang:2011:inria-00583818:1 which includes depth information has been used, while for the second group a novel descriptor called Speeded Up Robust Features (SURF) has been introduced in order generate local depth patterns.
To further improve the accuracy of recognition, Wang et al. wang2015actionwang:2011:inria-00583818:1 , Histogram of Oriented Gradients (HOG) dalal2005histograms , Histogram of Optical Flow (HOF) CRHV:CVPR09 , and Motion Boundary Histogram (MBH) wang:2011:inria-00583818:1 .
On the other hand, in jiang2012trajectory , a novel approach to encode relations between motion trajectories has been presented. Global and local reference points have been used to compute Dense Trajectories, offering robustness to camera motion.
Finally, Ni et al. ni2015motion had the idea of focusing on trajectory groups which contribute more importantly to a specific action by defining an optimization problem. Towards the same direction, Jhuang et al. jhuang2013towards proposed the extraction of features around joint trajectories, increasing the discriminative power of the original Dense Trajectories approach wang:2011:inria-00583818:1 .
Although all the aforementioned methods have shown their effectiveness, they unfortunately lack locality information related to the human body. This piece of information is crucial when actions include similar motion patterns performed by different body parts. For this reason, we propose a novel dense trajectory-based approach by taking into consideration the local spatial repartition of motion with respect of the human body.
2.2 Action Recognition From RGB-D Data
With the recent availability of affordable RGB-D cameras, a large effort in action recognition using both RGB and depth modalities has been made. For a more comprehensive state-of-the-art, we refer the reader to a recent survey survey , where RGB-D based action recognition methods have been grouped in two distinct categories (according to the nature of the descriptor), namely, learned representations deep1 ; deep2 ; deep3 and hand-crafted representations Wang12 ; oreifej2013hon4d ; wang2012robust
. Since this work bears interest to the description of actions using Dense Trajectories, we mainly focus on hand-crafted based approaches. In turn, they can be classified as follows: depth-based approaches, skeleton-based approaches and hybrid approaches.
The first class of methods extracts directly human motion information from depth maps Xia ; HOG3D ; HOG2 ; foggia2013 ; shukla ; oreifej2013hon4d ; SNVpami ; Slama ; rahmani . The second group gathers approaches which make use of the 3D skeletons extracted from depth maps. During the past few years, a wide range of methods have been designed using this high-level modality zanfir ; yang2012eigenjoints ; LieGroup ; devanne20153 ; amor2016action ; Demisse_2018_CVPR_Workshops ; ghorbel2018kinematic .
Compared to depth-based descriptors, skeleton-based descriptors require low computational time, are easier to manipulate and can better discriminate local motions. However, they are more sensitive to noise since they widely depend on the quality of the skeleton. Thus, to reinforce action recognition, a third class of methods called hybrid makes use of more than two modalities. These approaches usually exploit the skeleton information to compute local features using RGB and/or depth images. These local RGB-D based features have shown noteworthy potential Wang12 ; wang2012robust ; li2010action . Inspired by this relevant concept which aims at computing local depth-based and RGB-based features around specific joints, we propose to adapt the same idea to Dense Trajectories which have been proven to be one of the most powerful action representations.
3 Background: Dense Trajectories for Action Recognition
Dense Trajectories have been initially introduced by Wang et al. wang:2011:inria-00583818:1 . They are constructed by densely tracking sampled points over an RGB video stream and constructing representative features around the detected trajectories. As mentioned in Section 1, Dense Trajectories have been proven to be very effective in action recognition. They owe mainly their success to the fact that they incorporate low-level motion information. Below, we overview the Dense Trajectories approach.
Let be a sequence of images. Subsequently, representative points are sampled from each image grid with a constant stepping size – we denote each sampling grid position at frame as . The point is then estimated in the next frame using a motion field , derived by optical flow estimation farneback2003two such that:
where is a median filter kernel at the position . As a result, large motion changes between subsequent frames are smoothed. Furthermore, to avoid drifting, trajectories longer than the assigned fixed length are rejected. Applying (1) on frames results a smoothed trajectory estimation of the point . We denote the dense trajectory as:
with , , the first frame of the sequence and the total number of generated trajectories.
The set of trajectories generated in (2) is used to construct descriptors aligned along a spatio-temporal volume. In wang:2011:inria-00583818:1 , four types of descriptors are used: TSD wang:2011:inria-00583818:1 , HOG dalal2005histograms , HOF CRHV:CVPR09 , and MBH wang:2011:inria-00583818:1
. Each of the above descriptors is designed to capture distinctive spatio-temporal features of the occurring motion. As a final step, all of the descriptors are aggregated and encoded using BoWs – one codebook of visual words per descriptor is constructed using K-means clustering so that the final features are represented by a unified histogram of word appearances.
One of the main drawbacks of Dense Trajectories is that points on the image grid are sampled uniformly, which potentially leads to the inclusion of a significant amount of noise. Furthermore, the generated Dense Trajectories do no take into account the spatial human body structure. Thus, actions with similar motion patterns can potentially be confused during classification.
4 Localized Trajectories for Action Recognition
To enhance their robustness to irrelevant information, a reformulation of Dense Trajectories is proposed, called Localized Trajectories. The main idea of this new approach consists in attributing Dense Trajectories a local description in order to: 1) track the motion in specific and relevant spatial regions of the human body, more specifically around the joints. 2) remove redundant and irrelevant motion information, which can negatively affect the classifier performance.
To that end, the pose information through estimated 3D skeletons is used as prior information to estimate an optimal clustering configuration.
Let us consider the human skeleton extracted from RGB-D cameras composed of joints and let us denote the trajectory of each skeleton joint as . Note that we assume that the joints are always well detected. We use the distance proposed by Raptis et al. Raptis12 to group Dense Trajectories of an action around joints. Given a pair of dense and joint trajectories, respectively, and , which co-exist in the temporal range , the spatio-temporal distance between two given trajectories is expressed using (3) as follows:
such that is the spatial distance and is the velocity difference between trajectories and
. Then, an affinity matrix is computed between every pair of trajectoriesusing (3) as:
where the measure
penalizes trajectories with significant variation in spatial location and velocity. After a hierarchical clustering procedure which is based on the affinity scoreRaptis12 , a membership indicator function specifies the cluster of joint each trajectory belongs to.
Furthermore, trajectories that are above a certain threshold of distance are rejected. This condition ensures that irrelevant and noise-resulting trajectories will not be considered, e.g, background motion.
Feature Representation: As discussed in wang:2011:inria-00583818:1 , features can be computed along each trajectory and BoWs can be used to aggregate and encode the information. In such a case, however, a descriptor associated with each trajectory carries no locality information. On the contrary, we propose to exclusively assign trajectories and their corresponding descriptors to trajectory clusters. The main advantage of such a construction is that every trajectory-aligned descriptor does not only capture the spatio-temporal characteristics of the trajectory but it carries its location as well. Thus, we construct a local codebook for each trajectory group . During feature encoding, one histogram is constructed per joint cluster and per descriptor denoted by :
The subscripts of the individual histograms identify the type of descriptors. Finally, an action video is represented by the concatenation of the individual joint histograms in a final histogram , as follows:
The general overview of our approach is illustrated in Fig. 1.
5 3D Trajectories and Aligned Descriptors
Dense Trajectories, generated via optical flow, offer adequate performance when used for tracking movements that are lateral to the image plane. However, they struggle to track motion that happens radially, due to the fact that the occurring motion is subtle with respect to the 2D image plane. Consequently, in this subsection, we propose to extend localized Dense Trajectories to RGB-D input video stream by replacing optical flow with scene flow. The generated 3D trajectories are suitable for tracking motion in both lateral and radial directions as illustrated in Fig. 2.
5.1 Scene Flow Estimation Using RGB-D Data
To generalize the concept of Dense Trajectories from 2D to 3D, we propose to make use of the 3D extension of optical flow, called scene flow. Thanks to the emergence of RGB-D cameras, numerous approaches have been proposed to estimate scene flow from depth maps, e.g. the Primal-Dual Framework for Real-Time Dense RGB-D Scene Flow (PD-Flow) algorithm jaimez2015primal , the Dense semi-rigid scene flow estimation quiroga2014dense and the Layered RGBD scene flow estimation sun2015layered .
The scene flow is linearly dependent on the depth motion field , where is the range flow. It is computed by mapping to the 3D world coordinate system as below:
where and are the camera focal lengths, and are the 3D world coordinates of a specific point. On the other hand, The depth motion fields are estimated as a solution of a global variational problem, defined as:
where is a data term defined as the combined measure of the photometric and geometric inconsistency of successive depth and intensity images and is defined as a regularizer term.
We choose PD-Flow jaimez2015primal to estimate a dense scene flow field from an RGB-D video stream, since it has been shown to be one of the fastest and most accurate algorithms.
5.2 3D Localized Trajectories
To estimate the 3D trajectories using the scene flow, we start by uniformly sampling points from the 2D image grid. In this context, we define pixel coordinates as . Similar to wang:2011:inria-00583818:1 , we reject points belonging to homogeneous areas. Next, each of the sampled points are mapped to a standard 3D world coordinate system using the inverse of the intrinsic camera parameter matrix as described below:
where , are the image plane central point coordinates, and are the respective x and y components of the focal length and is the depth value. Subsequently, trajectories of the mapped 3D points are estimated using (1), except that the motion field is now based on an estimated scene flow. The estimated 3D Dense Trajectories are denoted as:
where is the scene flow field. Correspondence between estimated 3D points, with scene flow, and image pixels is derived by solving (10) in terms of .
The above procedure is repeated recurrently until each of the 3D trajectories reach the fixed temporal length we have set. Similar to wang:2011:inria-00583818:1 , trajectories with sudden displacements or small overall spatial length are considered irrelevant and are removed.
In depth maps, texture information is not present. Thus, in our case, only motion descriptors are considered. Three types of descriptors are used: 3D Trajectory Shape Descriptor (3DTSD), Histogram of Scene Flow holte2012local (HSF), and 3D Motion Boundary Histogram (3DMBH). 3DTSD is based on the original idea of the TSD for Dense Trajectories wang:2011:inria-00583818:1 . For each trajectory, the normalized displacement vector is computed. The HSF descriptor captures the orientation and the magnitude of the local scene flow field. For a spatio-temporal volume aligned around a 3D trajectory, the orientation of the 3D displacement is calculated using the azimuth and elevation angles formed by consecutive points as:
For the histogram construction, the 4D space is quantized into a fixed number of bins. Similarly, the 3DMBH is based on the same idea as HSF. First, the derivative of the scene flow field is computed and, then, for every pair of coordinates, the orientation angle is estimated.
3D Trajectories are adapted to 3D Localized Trajectories by following the procedure described in Section 4. Similarly as before, we propose to enhance the discriminative power of 3D Trajectories by grouping them around 3D body joints. Hence, (3), (4) and (5) are adapted accordingly to incorporate all three dimensions of 3D trajectories and 3D joint trajectories . Then, during feature encoding, every histogram of joint clusters defined in (6) is modified to include the descriptors used in this context, becoming:
5.3 Feature Selection for Codebook Construction
While 3D Trajectories are advantageous in capturing radial motion, they are notably more noisy compared to Dense Trajectories, due to the scene flow estimation. As a result, the quality of the codebooks is degraded, unfavorably affecting the general performance of the proposed approach. In order to enhance it, we propose to select features according to the classifier confidence and ambiguity probabilistic metrics. Confidence is the classifier ability to quantify its predictions reliability, while ambiguity indicates the number of classes the classifier outputs for every prediction. The goal is to encode trajectory features using codebooks constructed by sampling features from a training set , which maximize the classifier confidence and minimize ambiguity metrics. The confidence and ambiguity metrics are defined as:
is the posterior probability of labelgiven feature .
6 Experimental Evaluation
In this section, we evaluate the proposed approaches on challenging datasets: MSRDailyActivity3D Wang12 , Online RGB-D Action (ORGBD) yu2014discriminative , G3D Gaming Action bloom2016hierarchical , Watch-n-Patch Wu_2015_CVPR and KARD datasets gaglio2015human . First, a brief description of each dataset is given followed by description of the experimental setups. Then, the obtained results are reported and extensively analyzed.
Datasets and Experimental Settings: The first dataset used for the experimental evaluation is the MSRDailyActivity 3D Wang12 . In this dataset, actors perform daily activities, which in some cases involve human-object interaction. The dataset is captured by the Kinect v1 device, providing therefore RGB, depth and skeleton modalities. A distinctive characteristic of this dataset is that every actor repeats each action twice in both sitting and standing position. For the experiments, we follow a cross-splitting protocol as in Wang12 , where half of the subjects are used for training and the rest for testing.
The second dataset is called Online RGB-D Action (ORGBD) yu2014discriminative . It can be used for both action recognition and action detection and includes common types of human-object interaction related to the living room environment. Three sets of video sequences are collected using a Kinect sensor. Thus, RGB, depth and skeleton modalities are available. The first set is captured in the context of action recognition in the Same Environment, whereas the second set is acquired for cross-environment action recognition and the third for on-line action detection. The splitting protocol requires two fold cross-validation for the same-environment scenario, whereas, for cross-environment action recognition, training and testing sets should include different environments yu2014discriminative .
One challenging dataset used for the evaluation is the G3D Gaming Action Dataset bloom2016hierarchical . This Kinect-acquired dataset can be used for both action recognition and temporal action detection. It consists of subjects performing gaming actions which are grouped into gaming scenarios, which are: Fighting, playing golf, playing tennis, bowling, first person shooter, driving a car and miscellaneous. The first actors are used for training and the rest are used for testing bloom2016hierarchical .
Watch-n-Patch Wu_2015_CVPR dataset, which was introduced by the Cornell University, is also utilized. This dataset includes types of actions ( in an office and in a kitchen) which involve interactions with types of objects. subjects perform - actions in every of the videos. The dataset was recorded using a Kinect v2 camera. This dataset distinguishes itself by a high intra-class variability since the subjects perform different combinations of actions and order them differently each time. For the experiments, we use the provided splitting protocol proposed in Wu_2015_CVPR , where, for every environment, almost half of the videos are used for training and the rest for testing.
The last dataset used for evaluation is called Kinect Activity Recognition Dataset (KARD) gaglio2015human . It contains action classes which are performed by subjects ( males and female) where half of them are used for training and the other half for testing, as proposed in gaglio2015human . The dataset was captured by a Kinect device and consequently contains the three RGB-D modalities: RGB images, depth maps and 3D skeletons.
Implementation Details: For extracting Dense Trajectories and features from videos, we use the implementation provided by the authors in wang:2011:inria-00583818:1 111https://lear.inrialpes.fr/people/wang/dense_trajectories. The trajectory temporal length is fixed to frames. The features are computed on a spatio-temporal volume of aligned on the trajectory, as suggested in wang:2011:inria-00583818:1 . This volume is further divided into cells, where the histograms of the descriptors are computed. In the case of 3D trajectories, we use the same parameters for the spatio-temporal volume. The number of histogram bins for the 2D trajectories is set to for HOG and MBH descriptors and for HOF descriptor, whereas for 3D trajectories case we use -bin histograms for every descriptor. The distance threshold for each trajectory is set to . Moreover, a linear SVM is employed for classification.
For each one of the aforementioned datasets, we report the obtained recognition accuracy using the proposed Localized Trajectories and compare it to the classical Dense Trajectories and recent state-of-the-art approaches. In the following, we denote the original dense trajectory approach wang:2011:inria-00583818:1 by Dense Trajectories. We refer to the 2D proposed approach as 2D Localized Trajectories. Similarly, the proposed 3D extension of the classical and the local Dense Trajectories are respectively called 3D Dense Trajectories and 3D Localized Trajectories.
The number of skeleton joints defines the number of clusters. Subsequently, in MSRDailyActivity3D, ORGBD and G3D datasets, the skeletons are composed of joints, while in Watch-n-Patch and KARD datasets, they are respectively formed by and joints. We, also, choose empirically trajectories per video in order to construct the codebooks and words per cluster and per descriptor for every dataset.
|Dynamic Temporal Warping muller2006motion||54.0%|
|Local HON4D oreifej2013hon4d||80.0%|
|Moving Pose zanfir2013moving||73.8%|
|3D Trajectories Koperski14||72.0%|
|Skeleton only Wang12||68.0%|
|Skeleton & LoP Wang12||85.8%|
|Dense Trajectories wang:2011:inria-00583818:1||64.4%|
|3D Dense Trajectories||48.8%|
|2D Localized Trajectories||74.4%|
|3D Localized Trajectories||76.3%|
|Dynamic Time Wrapping leightley2014exemplar||86.3%|
|Weighted Graph Matching xiao2015motion||89.2%|
|Adaptive Graph Kernels li20163d||84.8%|
|LPP & BoW fotiadou2014activity||87.5%|
|Spatial Graph Kernels kishore2018spatial||95.7%|
|Dense Trajectories wang:2011:inria-00583818:1||80.1%|
|2D Localized Trajectories||87.8%|
6.1 Performance of 2D Localized Dense Trajectories
In this subsection, an analysis of the obtained results is provided. First, we compare the performance of our approach against Dense Trajectories and other state-of-the-art methods. Later, we discuss some of the limitation of 2D Localized Trajectories.
6.1.1 2D Localized Dense Trajectories vs Dense Trajectories
Since the aim of this work is to improve the discriminative power of classical Dense trajectories, we start by comparing our proposed 2D Localized Dense Trajectories with them. The results obtained on the five benchmarks prove the superiority of the proposed 2D Localized Trajectories. As reported in Table 1, Table 2, Table 3, Table 4 and Table 5, 2D Localized Dense Trajectories improve the accuracy by , , , , and on MSRDailyAvtivity3D, G3D, ORGB (same-environment settings), ORGB (cross-environment settings), Watch-n-Patch and KARD, respectively, compared to the classical Dense Trajectories wang:2011:inria-00583818:1 .
The reported results reflect the ability of 2D Localized Trajectories to distinguish actions with similar motion patterns that are performed by different body parts. This is shown in various cases when comparing confusion matrices obtained for 2D Localized Trajectories and Dense Trajectories. For instance, in the confusion matrices of G3D dataset in Fig. 4, 2D Localized Trajectories boost the performance of the following action pairs: Punch Right-Punch Left and Kick Right-Kick Left. Also, in the same dataset, the recognition accuracy of both Tennis Swing Backhand and Throwing Bowling Ball activities which include similar motion shapes is improved by and , respectively. Furthermore, the accuracy of Drinking and Reading Book classes in ORGBD dataset is increased by and , respectively (see Fig. 5).
Another example of this enhancement can be the pair of actions Defend and Aim & Fire Gun in G3D dataset. The motion shapes of both action classes are similar, since both of them include arm raising. Nevertheless, the first is performed using both arms and the second by using only one arm. As we can see in Fig. 4, the performance obtained for the action Defend is improved by and the confusion with the action Aim & Fire Gun is reduced by . In addition, in the same dataset, actions Wave and Clap have similar lateral motion and using the classical Dense Trajectories made their distinction challenging. However, with the use of 2D Localized Trajectories, motion trajectories were assigned to only one hand cluster in Wave action and to both hands in Clap action, reducing the confusion between these classes. This results in an accuracy boost of in Wave class, as it is shown in Fig. 4.
Moreover, in scenarios with full-body motion, such as the kitchen environment in Watch-n-Patch dataset, 2D Localized Trajectories outperform the Dense Trajectories approach as shown in Fig. 6. Clusters isolate specific motion of body parts, therefore motion patterns related to the action can be identified more effectively.
6.1.2 Comparison with 3D-Based State-of-the-Art Approaches
Our 2D Localized Trajectories approach has shown competitive performance compared to 3D-based state-of-the-art approaches. In ORGBD dataset, we achieve the second best performance in Same Environment setting (Table 3). We manage to match the state-of-the-art results of Wang12 in Cross Environment settings and, at the same time, increase the mean accuracy by over the Dense Trajectories.
In Watch-n-Patch dataset, the 2D Localized Trajectories improved the performance of the Dense Trajectories by in the office environment and by in the kitchen environment, as illustrated in Table 4. The discriminative power of our approach boosts the performance of every action class, especially in the kitchen environment, as it can be observed in Fig. 6. On this dataset, we compare our work only with Dense Trajectories. To the best of our knowledge, there is no work in the literature reporting offline action recognition accuracy on it, since this dataset has been initially acquired for action detection.
In G3D dataset case, the results in Table 2 indicate that the 2D Localized Trajectories approach performes adequately enough compared to state-of-the-art 3D concepts, despite the fact that it includes a significant amount of radial motion. The obtained results in Table 2 show that our method was the third best performing, without utilizing depth or 3D skeleton modalities.
In KARD dataset, our approach based on the 2D Localized Trajectories outperforms almost all state-of-the-art approaches, with a score of , except JTMI & LBP & FLD ahmed2016joint which reaches a slightly superior score with only of difference.
The 2D Localized Trajectories approach offers the second largest improvement on MSRDailyActivity3D dataset, by compared to Dense Trajectories as it is depicted in Table 1. Apart from that, its performance was slightly inferior to the performance other state-of-the-art approaches, since it came third in average accuracy, behind Local HON4D oreifej2013hon4d and Skeleton & LoP Wang12 .
6.1.3 Limitations of 2D Localized Dense Trajectories
Despite its strong performances, 2D Localized trajectories action representation suffers from two limitations. First, 2D Localized Trajectories approach presents low performance when the motion amount is small. This attribute is inherited from Dense Trajectories approach and is clearly depicted in action classes such as Call Cellphone in both MSR DailyActivity 3D and ORGBD as it is shown in Fig. 3 and Fig 5, respectively, and Write on a Paper in MSR DailyActivity 3D. Nonetheless, Sit Still class achieves adequate performance with the use of 2D Localized Trajectories, since it is an action class with almost no motion.
Second, 2D Localized Trajectories approach does not capture radial motion sufficiently. Action classes such as Playing the guitar in MSRDailyActivity3D dataset include a notable amount of radial motion and the accuracy results were consequently low, as demonstrated in Fig. 3a and Fig. 3b. For that reason, as mentioned earlier, the proposed 3D Localized Trajectories presents as a good alternative to solve these two issues. Performance of the 3D Localized Trajectories are reported in the next section.
|Same Env.||Cross Env.|
|Moving Pose zanfir2013moving||38.4%||28.5%|
|DSTIP & DCSF xia2013spatio||61.7%||21.5%|
|Skeleton & LoP Wang12||66.0%||59.8%|
|Pairwise joint distance yu2014discriminative||63.3%||–|
|Dense Trajectories wang:2011:inria-00583818:1||64.3%||43.8%|
|2D Localized Trajectories||67.4%||59.8%|
|Dense Trajectories (office)wang:2011:inria-00583818:1||68.8%|
|Dense Trajectories (kitchen)wang:2011:inria-00583818:1||56.2%|
|2D Localized Trajectories (office)||71.1%|
|2D Localized Trajectories (kitchen)||81.5%|
|JTMI & LBP & FLD ahmed2016joint||98.5%|
|JTMI & Gabor features tian2002evaluation||96.0%|
|Dense Trajectories wang:2011:inria-00583818:1||97.8%|
|2D Localized Trajectories||98.2%|
6.2 Performance of 3D Localized Trajectories
The proposed 3D Localized trajectories approach was evaluated on MSRDailyActivity3D dataset. The results reported in Fig. 1 show its superiority against Dense Trajectories and 2D Localized Trajectories. In fact, the accuracy of Dense Trajectories and 2D Localized Trajectories are improved by and , respectively.
The performance improvement happens mainly because of the inclusion of depth information in 3D trajectories. This helps in distinguishing actions which are performed radially with respect to the camera. The latter is particularly reflected in the confusion matrix of MSR DailyActivity 3D dataset in Fig.3, where actions like play game and play guitar are more effectively discriminated using 3D information. The reported accuracies for the actions play game and play guitar are significantly improved. In particular, from and using Dense Trajectories and and using 2D Localized Trajectories, the accuracy climbed to and with the use of 3D Localized Trajectories, respectively.
These promising results highlight the potential of our first attempt to generalize Dense Trajectories to 3D and opens up new perspectives. Indeed, many components of this 3D concept can be reinforced to increase its effectiveness. For example, 3D trajectories are slightly more noisy than the Dense trajectories mainly because depth sensors introduce additional noise. This noise translated to a significant number of points belonging to the background which appeared to move radially, creating a lot of irrelevant 3D trajectories. Most importantly, the scene flow estimation is not optimal, since it relies on two different modalities which often appear to be misaligned. This fact is reflected in the performance of the 3D Trajectories (without locality), resulting in a notably lower accuracy than the Dense Trajectories, as demonstrated in Table 1. Nevertheless, the trajectory clustering around body joints is still able to remove a significant amount of noisy and irrelevant trajectories in 3D Localized Trajectories case.
6.3 Global BoW vs. Local BoW
To experimentally motivate the use of local BoWs, we compare the results obtained for 2D Localized trajectories using both a global BoW and a local BoWs. Hence, the experiments are conducted on the cross-environment scenario of the ORGBD dataset. The mean accuracy is notably lower compared to the 2D Localized Trajectories approach with Local BoW, reaching vs. . The results suggest that trajectories clustering combined with local BoWs contribute significantly to the enhancement of the local discriminative power of the overall approach. They, also, suggest that the local encoding is more effective, since the codebooks are constructed using features which are specific to the motion of each body part.
In this paper, we proposed to solve two major shortcomings of the original Dense Trajectories approach using additional modalities provided by RGB-D cameras: the lack of locality information and the ineffectiveness in describing radial motion. Our contribution is two-fold. First, we enhanced the discriminative power and locality-awareness of Dense Trajectories by clustering them around human body joints. This method is coupled with the local Bag-of-Words concept, strengthening further the framework. Second, we constructed 3D Localized Trajectories for action recognition. For this purpose, we used a) scene flow instead of optical flow for the generation of the 3D Trajectories and b) 4D extension of the originally used spatio-temporal descriptors. The reported results show the robustness of the two proposed representations in various challenging datasets. As future work, we intend to develop an automatic way of choosing the optimal parameters. In addition, we intend to estimate more reliable and robust to noise 3D Trajectories directly from point cloud data for the purposes of enhancing our current approach and extending it to view-invariant action recognition.
This work was funded by the European Union’s Horizon 2020 research and innovation project STARR under grant agreement No.689947, and by the National Research Fund (FNR), Luxembourg, under the project C15/IS/10415355/ 3DACT/Björn Ottersten. Moreover, the experiments presented in this paper were carried out using the HPC facilities of the University of Luxembourg VBCG_HPCS14 – see https://hpc.uni.lu.
R. Baptista, M. Antunes, D. Aouada, B. Ottersten, Anticipating suspicious actions using a small dataset of action templates, in: 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), 2018.
- (2) R. Baptista, M. Antunes, A. E. R. Shabayek, D. Aouada, B. Ottersten, Flexible feedback system for posture monitoring and correction, in: Image Information Processing (ICIIP), 2017 Fourth International Conference on, IEEE, 2017, pp. 1–6.
- (3) R. Baptista, M. Goncalves Almeida Antunes, D. Aouada, B. Ottersten, Video-based feedback for assisting physical activity, in: 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), 2017.
- (4) Y. Song, D. Demirdjian, R. Davis, Continuous body and hand gesture recognition for natural human-computer interaction, ACM Transactions on Interactive Intelligent Systems (TiiS) 2 (1) (2012) 5.
D. Weinland, R. Ronfard, E. Boyer,
Free viewpoint action recognition
using motion history volumes, Computer Vision and Image Understanding
104 (2-3) (2006) 249–257.
- (6) A. F. Bobick, J. W. Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3) (2001) 257–267. doi:10.1109/34.910878.
H. Wang, A. Kläser, C. Schmid, C.-L. Liu, Action Recognition by Dense Trajectories, in: IEEE Conference on Computer Vision & Pattern Recognition, Colorado Springs, United States, 2011, pp. 3169–3176.
F.-F. Li, P. Perona, A bayesian
hierarchical model for learning natural scene categories, in: Proceedings of
the 2005 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’05) - Volume 2 - Volume 02, CVPR ’05, IEEE Computer
Society, Washington, DC, USA, 2005, pp. 524–531.
- (9) M. Koperski, P. Bilinski, F. Bremond, 3D Trajectories for Action Recognition, in: IEEE International Conference on Image Processing, Paris, France, 2014.
- (10) S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: null, IEEE, 2006, pp. 2169–2178.
- (11) J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining Actionlet Ensemble for Action Recognition with Depth Cameras, in: IEEE Conference on Computer Vision & Pattern Recognition, Providence, Rhode Island, United States, 2012.
- (12) K. Papadopoulos, M. Antunes, D. Aouada, B. Ottersten, Enhanced trajectory-based action recognition using human pose, in: Image Processing (ICIP), 2017 IEEE International Conference on, IEEE, 2017, pp. 1807–1811.
- (13) H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.
- (14) L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4305–4314.
- (15) Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, C.-W. Ngo, Trajectory-based modeling of human actions with motion reference points, in: European Conference on Computer Vision, Springer, 2012, pp. 425–438.
- (16) B. Ni, P. Moulin, X. Yang, S. Yan, Motion part regularization: Improving action recognition via trajectory selection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3698–3706.
- (17) N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 886–893.
- (18) R. Chaudhry, A. Ravichandran, G. D. Hager, R. Vidal, Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1932–1939.
- (19) H. Jhuang, J. Gall, S. Zuffi, C. Schmid, M. J. Black, Towards understanding action recognition, in: Proceedings of the IEEE international conference on computer vision, 2013, pp. 3192–3199.
- (20) F. Zhu, L. Shao, J. Xie, Y. Fang, From handcrafted to learned representations for human action recognition: A survey, Image and Vision Computing 55 (2016) 42–52.
- (21) Q. Ke, M. Bennamoun, S. An, F. Sohel, F. Boussaid, A new representation of skeleton sequences for 3d action recognition, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 4570–4579.
S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition from skeleton data., in: AAAI, Vol. 1, 2017, pp. 4263–4270.
- (23) Q. Ke, S. An, M. Bennamoun, F. Sohel, F. Boussaid, Skeletonnet: Mining deep part features for 3-D action recognition, Signal Processing Letters, IEEE 24 (6) (2017) 731–735.
- (24) O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 716–723.
- (25) J. Wang, Z. Liu, J. Chorowski, Z. Chen, Y. Wu, Robust 3d action recognition with random occupancy patterns, in: Computer vision–ECCV 2012, Springer, 2012, pp. 872–885.
- (26) L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, in: Conference on Computer Vision and Pattern Recognition, IEEE, 2013, pp. 2834–2841.
- (27) A. Klaser, M. Marszałek, C. Schmid, A spatio-temporal descriptor based on 3D-gradients, in: British Machine Vision Conference, 2008, pp. 275–1.
- (28) E. Ohn-Bar, M. M. Trivedi, Joint angles similarities and HOG2 for action recognition, in: Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 2013, pp. 465–470.
- (29) P. Foggia, G. Percannella, A. Saggese, M. Vento, Recognizing human actions by a bag of visual words, in: International Conference on Systems, Man, and Cybernetics, IEEE, 2013, pp. 2910–2915.
- (30) P. Shukla, K. K. Biswas, P. K. Kalra, Action recognition using temporal bag-of-words from depth maps, in: International Conference on Machine Vision Applications, IEEE, 2013, pp. 41–44.
- (31) X. Yang, Y. Tian, Super normal vector for human activity recognition with depth cameras, IEEE transactions on pattern analysis and machine intelligence 39 (5) (2017) 1028–1039.
- (32) R. Slama, H. Wannous, M. Daoudi, Grassmannian representation of motion depth for 3D human gesture and action recognition, in: International Conference on Pattern Recognition, IEEE, 2014, pp. 3499–3504.
- (33) H. Rahmani, A. Mahmood, D. Huynh, A. Mian, Histogram of oriented principal components for cross-view action recognition, Transactions on Pattern Analysis and Machine Intelligence, IEEE 38 (12) (2016) 2430–2443.
- (34) M. Zanfir, M. Leordeanu, C. Sminchisescu, The moving pose: An efficient 3D kinematics descriptor for low-latency action recognition and detection, in: International Conference on Computer Vision, IEEE, 2013, pp. 2752–2759.
- (35) X. Yang, Y. L. Tian, Eigenjoints-based action recognition using naive-bayes-nearest-neighbor, in: Computer vision and pattern recognition workshops (CVPRW), 2012 IEEE computer society conference on, IEEE, 2012, pp. 14–19.
- (36) R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3D skeletons as points in a Lie group, in: Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 588–595.
- (37) M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, A. Del Bimbo, 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold, IEEE transactions on cybernetics 45 (7) (2015) 1340–1352.
- (38) B. B. Amor, J. Su, A. Srivastava, Action recognition using rate-invariant analysis of skeletal shape trajectories, IEEE transactions on pattern analysis and machine intelligence 38 (1) (2016) 1–13.
- (39) G. Demisse, K. Papadopoulos, D. Aouada, B. Ottersten, Pose encoding for robust skeleton-based action recognition, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018.
- (40) E. Ghorbel, R. Boutteau, J. Boonaert, X. Savatier, S. Lecoeuche, Kinematic spline curves: A temporal invariant descriptor for fast action recognition, Image and Vision Computing 77 (2018) 60–71.
- (41) W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE, 2010, pp. 9–14.
- (42) G. Farnebäck, Two-frame motion estimation based on polynomial expansion, in: Scandinavian conference on Image analysis, Springer, 2003, pp. 363–370.
- (43) M. Raptis, I. Kokkinos, S. Soatto, Discovering Discriminative Action Parts from Mid-Level Video Representations, in: IEEE Conference on Computer Vision & Pattern Recognition, 2012.
- (44) M. Jaimez, M. Souiai, J. Gonzalez-Jimenez, D. Cremers, A primal-dual framework for real-time dense rgb-d scene flow, in: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE, 2015, pp. 98–104.
- (45) J. Quiroga, T. Brox, F. Devernay, J. Crowley, Dense semi-rigid scene flow estimation from rgbd images, in: European Conference on Computer Vision, Springer, 2014, pp. 567–582.
- (46) D. Sun, E. B. Sudderth, H. Pfister, Layered rgbd scene flow estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 548–556.
- (47) M. B. Holte, B. Chakraborty, J. Gonzalez, T. B. Moeslund, A local 3-d motion descriptor for multi-view human action recognition from 4-d spatio-temporal interest points, IEEE Journal of Selected Topics in Signal Processing 6 (5) (2012) 553–565.
- (48) G. Yu, Z. Liu, J. Yuan, Discriminative orderlet mining for real-time recognition of human-object interaction, in: Asian Conference on Computer Vision, Springer, 2014, pp. 50–65.
V. Bloom, V. Argyriou, D. Makris, Hierarchical transfer learning for online recognition of compound actions, Computer Vision and Image Understanding 144 (2016) 62–72.
- (50) C. Wu, J. Zhang, S. Savarese, A. Saxena, Watch-n-patch: Unsupervised understanding of actions and relations, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- (51) S. Gaglio, G. L. Re, M. Morana, Human activity recognition process using 3-d posture data, IEEE Trans. Human-Machine Systems 45 (5) (2015) 586–597.
- (52) M. Müller, T. Röder, Motion templates for automatic classification and retrieval of motion capture data, in: Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, Eurographics Association, 2006, pp. 137–146.
- (53) M. Zanfir, M. Leordeanu, C. Sminchisescu, The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection, in: Proceedings of the IEEE international conference on computer vision, 2013, pp. 2752–2759.
- (54) D. Leightley, B. Li, J. S. McPhee, M. H. Yap, J. Darby, Exemplar-based human action recognition with template matching from a stream of motion capture, in: International Conference Image Analysis and Recognition, Springer, 2014, pp. 12–20.
- (55) Q. Xiao, Y. Wang, H. Wang, Motion retrieval using weighted graph matching, Soft Computing 19 (1) (2015) 133–144.
- (56) M. Li, H. Leung, Z. Liu, L. Zhou, 3d human motion retrieval using graph kernels based on adaptive graph construction, Computers & Graphics 54 (2016) 104–112.
- (57) M. Barnachon, S. Bouakaz, B. Boufama, E. Guillou, Ongoing human action recognition with motion capture, Pattern Recognition 47 (1) (2014) 238–247.
- (58) E. Fotiadou, N. Nikolaidis, Activity-based methods for person recognition in motion capture sequences, Pattern Recognition Letters 49 (2014) 48–54.
- (59) P. Kishore, P. S. Kameswari, K. Niharika, M. Tanuja, M. Bindu, D. A. Kumar, E. K. Kumar, M. T. Kiran, Spatial joint features for 3d human skeletal action recognition system using spatial graph kernels, International Journal of Engineering & Technology 7 (1.1) (2018) 489–493.
- (60) F. Ahmed, P. P. Paul, M. L. Gavrilova, Joint-triplet motion image and local binary pattern for 3d action recognition using kinect, in: Proceedings of the 29th International Conference on Computer Animation and Social Agents, ACM, 2016, pp. 111–119.
- (61) L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2834–2841.
- (62) Y.-l. Tian, T. Kanade, J. F. Cohn, Evaluation of gabor-wavelet-based facial action unit recognition in image sequences of increasing complexity, in: Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, IEEE, 2002, pp. 229–234.
- (63) L. Xia, C.-C. Chen, J. K. Aggarwal, View invariant human action recognition using histograms of 3d joints, in: Computer vision and pattern recognition workshops (CVPRW), 2012 IEEE computer society conference on, IEEE, 2012, pp. 20–27.
- (64) S. Varrette, P. Bouvry, H. Cartiaux, F. Georgatos, Management of an academic hpc cluster: The ul experience, in: Proc. of the 2014 Intl. Conf. on High Performance Computing & Simulation (HPCS 2014), IEEE, Bologna, Italy, 2014, pp. 959–967.