Simultaneous Feature and Body-Part Learning for Real-Time Robot Awareness of Human Behaviors

02/24/2017 ∙ by Fei Han, et al. ∙ Arizona State University mines 0

Robot awareness of human actions is an essential research problem in robotics with many important real-world applications, including human-robot collaboration and teaming. Over the past few years, depth sensors have become a standard device widely used by intelligent robots for 3D perception, which can also offer human skeletal data in 3D space. Several methods based on skeletal data were designed to enable robot awareness of human actions with satisfactory accuracy. However, previous methods treated all body parts and features equally important, without the capability to identify discriminative body parts and features. In this paper, we propose a novel simultaneous Feature And Body-part Learning (FABL) approach that simultaneously identifies discriminative body parts and features, and efficiently integrates all available information together to enable real-time robot awareness of human behaviors. We formulate FABL as a regression-like optimization problem with structured sparsity-inducing norms to model interrelationships of body parts and features. We also develop an optimization algorithm to solve the formulated problem, which possesses a theoretical guarantee to find the optimal solution. To evaluate FABL, three experiments were performed using public benchmark datasets, including the MSR Action3D and CAD-60 datasets, as well as a Baxter robot in practical assistive living applications. Experimental results show that our FABL approach obtains a high recognition accuracy with a processing speed of the order-of-magnitude of 10e4 Hz, which makes FABL a promising method to enable real-time robot awareness of human behaviors in practical robotics applications.



There are no comments yet.


page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In a wide variety of human-centered robotics applications, including human-robot teaming, human-robot collaboration, and robot-assisted living, robot awareness of human actions (or behaviors) is essential for intelligent robots to understand humans, make situationally appropriate decisions, and interact with and assist people. However, robot awareness of human behaviors in real-world environments is a challenging problem caused by significant variations of human motion, diversity of human appearance, and vision difficulties, including illumination variations and occlusion. When implemented on robots, additional challenges are encountered, such as uncertainty in movement and dynamic backgrounds; Most importantly, the requirement of real-time performance demands timely robot planning and decision making.

Although human action understanding has been researched in robotics and computer vision communities, most previous techniques are based on local spatio-temporal visual features

[1, 2], which are generally incapable of dealing with the challenges introduced by robotics applications (e.g., real-time performance). With the emergence of affordable structured-light or time-of-flight depth sensing technologies, color-depth cameras have generally become a standard 3D visual sensing device for modern indoor robots. The skeletal data of humans acquired from such sensors, as shown in Fig. 1(a), provides the possibility to achieve real-time robot awareness of human behaviors [3], which also provides benefits in comparison to local features, including the invariance to viewpoint, human body scale and motion speed [4, 5].

((a)) Skeletal data
((b)) Discriminative features and body parts
Fig. 1: A motivating example of the FABL approach, which simultaneously learns discriminative skeleton joints and multimodal heterogeneous features to enable real-time robot awareness of human behaviors.

Because of these advantages, skeleton-based action understanding methods have attracted increasing attention, and many skeletal features and representations have been implemented during the last few years , see [4] and references therein, i.e. joint rotation matrix [6], BIPOD [5], etc. However, most existing methods apply only one type of skeletal feature [7, 6]

, while others simply concatenate several types of skeletal features together into a single bigger vector to encode human actions

[8, 9]. The problem of autonomously learning the importance of skeletal features and optimally integrating the multimodal features (different human activity representations extracted from skeletal data) together has not yet been well addressed for real-time robot awareness of human behaviors. Recently, methods based on body parts (represented as joints in skeletal data) instead of using complete skeleton data were studied to improve action recognition accuracy [5, 10, 11]. To remove irrelevant joints for specific behaviors, these methods use a subset of or select skeletal joints. Although these methods obtained promising accuracy, the selection is manual based upon fixed criteria and is not robust to various scenarios. Furthermore, the question of how to integrate multimodal skeletal features into body-part methods has not been well answered.

In this paper, we introduce a novel Feature And Body-part Learning (FABL) method to enable real-time robot awareness of human behaviors, through learning discriminative skeletal features and body parts simultaneously in the same optimization framework. For learning the importance of body parts, our approach is inspired by the insight that typically a subset of body parts are more discriminative to recognize an action. For example, as demonstrated in Fig. 1(b)

, only the waving arm and hand are important for the action of “hand waving.” Our FABL method is able to select discriminative body parts automatically for different behaviors. Simultaneously, FABL learns the importance of heterogeneous skeletal features, and integrates multimodal features to build a more discriminative representation to enable robot awareness of human behaviors. Classification is seamlessly integrated in the FABL approach (i.e., no external classifier is required), which further increases processing efficiency, resulting in high-speed performance that is suitable for applications with real-time requirements.

The contributions of this paper are twofold:

  • We propose a novel formulation and the FABL approach to perform simultaneous learning of discriminative body parts and skeletal features for real-time robot awareness of human behaviors.

  • We develop a new optimization algorithm to efficiently solve the formulated robot learning problem, which has a theoretical guarantee to converge to the global optimal solution.

We make the code that implements our FABL approach available at:

The remainder of this paper is structured as follows. Related work is described in Section II. Then, our FABL approach is detailed in Sections III and IV. Experimental results are presented in Section V. After discussing several attributes of the proposed FABL method in Section VI, we conclude this paper in Section VII.

Ii Related Work

In this section, we conduct a review of techniques to understand human actions using skeletal data, including both complete skeletal data and partial body parts.

Ii-a Behavior Understanding Based on Skeletal Data

Methods using 3D skeletal data to identify human actions attracted increasing attention after the release of the affordable structured-light 3D sensing technology [4]. A widely applied representation for human action understanding is based on skeletal joint displacements. Chen and Koskela [12]

implemented a feature extraction method based on pairwise relative position of skeletal joints with normalization, and actions were classified by multiple extreme learning machines. Wei

et al. [13] implemented a hierarchical graph to represent spatio-temporal joint positions and displacements, where the differences in skeletal joint positions between two successive frames were defined as features. Besides joint displacements, many methods based on joint orientations were also implemented. Sung et al. [6] computed the orientation matrix of each joint with respect to the camera, then transformed the matrix to obtain this joint orientation with respect to the human torso, showing their representation was invariant to the sensor’s location. Another popular category of skeleton-based methods directly use raw joint position information for human action understanding. Wei et al. [14] developed wavelet features to represent a sequence of 3D skeletal joints, and a concurrent action detection model to understand human behaviors.

Most of the previous skeleton-based methods utilized only one category of skeleton-based features. Several recent studies indicate that recognition accuracy can be improved by combining multiple skeletal features together. A feature construction approach was introduced in [7]

that concatenates static posture, movement, and offset values into a single bigger feature vector, and utilizes a naive Bayes classifier to perform multi-class action classification. Yu

et al. [15] used three categories of skeletal features, including pairwise joint distance, spatial joint coordinate, and temporal variation of joint locations, to construct a mixed representation. A similar skeleton-based representation was implemented by [16], incorporating pairwise joint distances and temporal joint location changes together. However, most previous techniques simply concatenated different categories of features without considering the importance of each skeletal feature category. The research problem of how to autonomously learn and fuse heterogeneous skeletal features for real-time robot awareness of human actions has not yet been well studied.

The proposed FABL approach addresses this problem by integrating heterogeneous multimodal skeletal features through learning the importance of each feature category, along with learning discriminative body parts, to accurately interpret human actions.

Ii-B Representation Based on Body Parts

Skeletal human representations based on body part models have been widely studied in the past few years. Because these mid-level body part models can partially take into account the physical structure of human body, they can yield improved discrimination power to represent humans [5].

Wang et al. [17] implemented a method that decomposed a body model into five parts, including left/right arms/legs and the torso, each consisting of a set of joints, to represent human behaviors in space and time dimensions. A spatial-temporal And-Or graph model was implemented in [18]

to represent humans at three levels including poses, spatiotemporal-parts, and parts. The hierarchical human body structure captures the geometric and appearance variation of humans at each frame. A deep neural network was introduced in

[19] to create a body part model and the correlation of body parts was investigated, which can automatically obtain mid-level features that were more descriptive than low-level features extracted from individual human skeleton joints. Several methods were also proposed to select more descriptive human body joints [2, 10, 11, 20, 21, 22, 23].

Bio-inspired body part models are also commonly applied to extract mid-level features for skeleton-based representation construction, which are typically based on body kinematics or human anatomy. Chaudhry et al. [24]

implemented bio-inspired mid-level features to represent human activities based on 3D skeleton data, by leveraging the findings in the research area of static shape encoding in the primate cortex’s neural pathway. By showing different 3D shapes to primates and measuring their neural responses, the primates’ internal shape representation was estimated, which was then used to extract body parts to create skeleton-based representations. Zhang and Parker

[5] proposed a new bio-inspired predictive orientation decomposition representation, which was inspired by the biological research in human anatomy. This approach decomposed a body model into five body parts, and projected 3D human skeleton trajectories onto three anatomical planes. Through estimating future skeleton trajectories, this method is able to predict future human motions.

Despite the promising results obtained by the methods based on body parts, which mutually partition the body model into several body parts or select a set of skeletal joints according to predefined criteria, previous techniques did not model the discrimination difference of human joints but simply include or exclude certain joints. In this paper, we introduce a new approach to automatically learn discriminative skeletal joints without predefined manual selection criteria.

Iii The FABL Approach

In this section, we describe our FABL method that simultaneously learns discriminative skeletal features and body parts to enable real-time robot awareness of human behaviors.

Notation. In this paper, we denote matrices using boldface capital letters, and vectors using boldface lowercase letters. We represent the -norm of a vector using , and the -norm of as . Given a matrix , we refer to its -th row as and the -th column as . We denote the Frobenius norm of the matrix as .

Iii-a Problem Formulation

Given a collection of data instances, the skeletal matrix is denoted as , where is the vector of all skeletal features for the -th data instance. When heterogeneous skeletal features are used, each vector consists of modalities such that . Within each modality, the skeletal features are further divided into partitions, and each partition contains features from a skeleton joint. Then, we formulate robot awareness of human behaviors as a problem of dividing into behavior categories through exploiting all available information from heterogeneous feature modalities and skeleton joints, using a regression-like classification objective as follows:


where is the constant vector of all ’s, is the intercept vector, denotes the behavior category indicator matrix, and denotes the category indicator vector for the feature vector with indicating how likely belongs to the -th category. The label matrix of the data instances is given in the training phase. Then, the value of in Eq. (1) can be calculated by .

The solution of the optimization problem in Eq. (1) is the parameter matrix , which contains the weights of each feature modality and skeletal joint with respect to the -th behavior category. The parameter matrix is denoted as:


where indicates the weights of the -th modality including all skeleton joints with respect to the -th behavior category, which is denoted as , and represents the weights of the -th skeleton joint within the -th modality with respect to the -th human behavior category, where is the dimension of features that are obtained from the -th skeleton joint in the -th modality, satisfying , and is the number of skeleton joints in each modality. An illustration of the weight matrix is presented in Fig. 2.

Fig. 2: Illustration of the structured sparsity-inducing norms introduced in our FABL method. Given the parameter matrix , we arrange each column vector of the -th action category into a matrix, where rows represent modalities and columns denote skeletal joints. We model the interrelationships of the feature modalities using the -norm regularization term, and the interrelationships of the skeletal joints using the -norm regularization to model the representative joints.

Iii-B Learning of Discriminative Body Parts

For specific behaviors, a small set of body parts (represented as joints in human skeletal data) are more discriminative than others. For example, in the behavior of hand waving as depicted in Fig. 1(b), the forehand and hand joints are more discriminative. Such discriminative human skeletal joints are typically not shared by all behavior categories (i.e. the joints to recognize waving and kicking are substantially different). To learn discriminative body parts, we introduce a new joint-based group -norm (named -norm) as a regularizer of the problem in Eq. (1). The -norm is mathematically defined as , where denotes the weights of the -th human skeletal joint with respect to the -th behavior category for all feature modalities, which is expressed as , and . Then, we can rewrite the objective function as:



is a trade-off hyperparameter.

The -norm applies the -norm within each skeletal joint and the -norm between the joints, which enforces sparsity among different joints. For example, if the skeletal features obtained from a human skeleton joint are not discriminative for a specific behavior category, the objective in Eq. (3) will assign zeros (in the ideal case, usually very small values) to them for this behavior category; otherwise, their weights have large values. As shown in Fig. 2, the -norm regularization term captures the interrelationship among body parts, and estimates the importance of each body part to identify certain human behaviors.

Iii-C Learning of Multimodal Skeletal Features

When heterogeneous multimodal features are available, it is well accepted that different types of skeletal features show varying performance on recognizing different behaviors [4]. That is, the features from a specific modality can be more or less discriminative for recognizing specific human behaviors. For example, comparing to pose features, motion features are generally less helpful to identify a still human behavior such as sitting. To integrate multiple feature modalities and model their interrelationships, we introduce another group -norm (-norm) as a new regularizer in Eq. (3), which is defined as . Then, incorporating both multi-feature and multi-joint group sparsity-inducing norms, the final objective function becomes:


where and are trade-off hyperparameters.

The -norm uses the -norm within each feature modality and the -norm between these modalities, which enforces the sparsity of these modalities. For example, if a modality is not discriminative enough to recognize a certain behavior category, the objective in Eq. (4) will assign zeros (in the ideal case, usually very small values) to the features within this modality with respect to the behavior category; otherwise, their weights are large. As demonstrated in Fig 2., the proposed -norm regularization term captures the interrelationship between feature modalities and estimates their importance to recognize certain behaviors.

Iii-D Human Behavior Understanding

After solving the optimization problem in Eq. (4) during the training phase (solution is detailed in Section IV), we can obtain the optimal weight matrix . Then, in the testing phase, given a new multisensory instance , its behavior category is decided by:


An advantage of our formulation utilizing the regression-like objective function is that classification is integrated with feature learning; thus, we do not require additional classifiers (e.g., SVMs). This significantly improves processing efficiency, resulting in high-speed recognition of human behaviors that can benefit real-time human-centered robotics applications.

Iv Optimization Algorithm

Since the objective in Eq. (4) comprises two non-smooth regularization terms: the -norm and -norm, it is difficult to solve in general. To this end, we implement a new iterative algorithm to solve the optimization problem in Eq. (4) with non-smooth regularization terms. The proposed optimization solver has a theoretical guarantee to find the optimal solution.

To learn the value of the weight matrix , we compute the derivative of the objective with respect to and set it to zero vector. Then, we obtain


where is a block diagonal matrix with the -th diagonal block as , is the -th segment of consisting of the weights of the -th feature, is a diagonal matrix with the -th diagonal block as , is the -th segment of including the weights of skeletal features calculated from the -th skeleton joint, and

is the identity matrix of size

. Thus we have


Both and are dependent on and thus also unknown variables. An iterative algorithm is implemented to solve this problem, which is described in Algorithm 1.

Before analyzing convergence of Algorithm 1, we describe a lemma from [25] as follows.

Lemma 1

Given vectors and , the following equation holds

Theorem 1

Algorithm 1 converges to the optimal solution to the optimization problem in Eq. (4).

According to Step 3 of Algorithm 1, we know


Then, we can derive that


where .

After substituting the definition of and , we obtain


From Lemma 1, we can derive




Adding Eqs. (11)-(IV) on both sides, we obtain

Therefore, Algorithm 1 decreases the objective value in each iteration. Since the optimization problem defined in Eq. (4) is convex, and the objective is lower-bounded by zero due to the definition of matrix and vector norms, thus the algorithm converges to the optimum.

Input :  and
1 Let . Initialize by solving . while not converge do
2       Calculate the block diagonal matrix , where the -th diagonal block of is .
Calculate the block diagonal matrix , where the -th diagonal block of is . For each , . .
Output : 
Algorithm 1 An iterative algorithm to solve the problem in Eq. (4)

V Experiments

To quantitatively assess the performance of the proposed FABL method, we conduct experiments using public benchmark datasets. Furthermore, to evaluate the benefits of our FABL method in real-world robotics applications, we deploy FABL on a Baxter robot to perform online, real-time behavior recognition for human-robot interaction.

V-a Implementation

Our FABL approach is implemented using a combination of Matlab and C++ on a Linux machine with an i7 3.4GHz CPU and 16GB memory. The Matlab code is used to validate our approach on two public datasets: MSR Action3D Dataset [26] and Cornell Activity Dataset [6], while the C++ program is employed for validation on a Baxter robot in a real-world “serving drinks” task.

We intentionally designed and applied four simple skeletal features to emphasize the performance gain resulted from our FABL method instead of sophisticated features. These simple skeletal features include: (1) spatial joint displacement that is the 3D coordinate difference of each body part with respect to the torso: , where represents the coordinates of each skeletal joint, and denotes the coordinates of the center torso joint in skeletal data, (2) temporal joint displacement, which is defined as the temporal location difference of the same body joint in the current frame with respect to the previous frame: , where is the joint location at time , (3) long-term temporal joint displacement, defined as the temporal 3D location difference between the current frame and the initial frame: , where is the coordinates of a joint in the initial frame, and (4) spatial joint distance, which is defined as the geometrical distance of a joint to the torso center joint: . Then, we compute a histogram of each feature type to build a vector that is used as a feature modality in our experiment.

V-B Results on MSR Action3D Dataset

We evaluate the performance of the proposed approach to recognize human behaviors when interacting with structured-light cameras, using the MSR Action3D benchmark dataset [26]. This dataset contains 20 categories of human actions performed by 7 subjects for three times. The skeleton sequence of “high arm waving” is shown in Fig. 3.

Fig. 3: The MSR Action3D dataset is utilized in the experiment to evaluate the proposed FABL approach, which contains 20 activities recorded using Kinect, which are (M1) high arm wave, (M2) horizontal arm wave, (M3) hammer, (M4) hand catch, (M5) forward punch, (M6) high throw, (M7) draw x, (M8) draw tick, (M9) draw circle, (M10) hand clap, (M11) two hand wave, (M12) side boxing, (M13) bend, (M14) forward kick, (M15) side kick, (M16) jogging, (M17) tennis swing, (M18) tennis serve, (M19) golf swing, and (M20) pick up & throw. This figure shows a sample skeleton sequence of the action (M1) high arm waving in the dataset

We evaluate the recognition performance using a challenging subject-wise setting. That is, the training dataset does not contain any data instances from the subjects who participate in testing. When combined both structured sparsity-inducting norms to perform simultaneous feature and skeletal joint learning, our FABL method obtains an accuracy of 91.67%, The confusion matrix obtained by our method is shown in Fig.

4(a), which demonstrates our FABL approach is able to well recognize most of the behaviors. The actions that are not well identified is (M4) hand-catch, and (M7) draw-x that is always misclassified as the action of (M8) draw tick or (M9) draw circle, which have similar, small motions.

((a)) MSR Action3D dataset
((b)) CAD-60 dataset
Fig. 4: Confusion matrices obtained by our FABL method over the MSR Action3D and CAD-60 dataset datasets. The behavior category labels M1-M16 and C1-C14 are described in Fig. 3 and Fig. 5, respectively.

We compare with two baseline methods including feature-learning-only () and body-part-learning-only (). As presented in Table I, the feature-learning-only method obtains an average recognition accuracy of 85.00%, while the body-part-learning-only obtains an average accuracy of 86.67%. This indicates that FABL outperforms baseline approaches using a single norm for regularization. In addition, we compare our FABL method with previous activity recognition techniques based on skeleton features. FABL achieves promising recognition accuracy (with the high-speed performance) on the MSR Action3D dataset.

Reference Method Accuracy
Ofli et al. [11] Sequence of Most Informative Joints 41.18%
Wang et al. [10] Dynamic Temporal Warping 54.0%
Ellis et al. [27] Joints Distance + Key Poses 65.7%
Li et al. [26] Action Graph 74.7%
Xia et al. [28] HOJ3D 78%
Yang and Tian [9] EigenJoints 83.3%
Wang et al. [2] Actionlet Ensemble 88.2%
Ben Amor et al. [29] Skeleton Trajectories 89%
Feature Learning Only 85.00%
Our Methods Body-Part Learning Only 86.67%
FABL 91.67%
TABLE I: Comparison of average accuracy with previous skeleton-based methods on the MSR Action3D dataset

V-C Results on Cornell Activity Dataset

The Cornell Activity Dataset 60 (CAD-60) [6] is a widely applied benchmark for human activity recognition in robotics applications. This dataset includes color-depth and skeleton information of twelve daily activities as well as two motions “still” and “random” recorded by a Kinect sensor in various environments, including office, kitchen, bedroom, bathroom, and living room. Each activity is performed by four subjects with two males and two females (one subject is left-handed). The skeleton data in each frame contains 15 joints, as shown in Figure 5. We evaluate FABL’s performance in a subject-wise cross-validation setup [30], where actions performed by new subjects are used for testing.

Fig. 5: The CAD-60 dataset contains 14 behaviors, including (C1) standing still, (C2) talking on the phone, (C3) writing on whiteboard, (C4) drinking water, (C5) rinsing mouth with water, (C6) brushing teeth, (C7) wearing contact lenses, (C8) talking on couch, (C9) relaxing on couch, (C10) cooking (chopping), (C11) cooking (stirring), (C12) opening pill container, (C13) working on computer, (C14) random. RGB images are depicted in the top row, and the depth images with the human skeleton in yellow are shown in the bottom row.

As demonstrated in Table II, the FABL method using both regularization terms obtain an average accuracy of 83.93%, and its detailed confusion matrix is graphically presented in Fig. 4(b), which generally indicates that most of the activities can be well classified by our approach.

Reference Method Accuracy
Ni et al. [30] Order-Preserving Sparse Coding 65.32%
Piyathilaka and
Kodagoda [31]
Hidden Markov Model 78.38%
Wang et al. [2] Skeleton-based Actionlet Ensemble 74.70%
Zhang and Tian [32] Bag of Features 80.77%
Feature Learning Only 78.57%
Our Methods Body-Part Learning Only 79.46%
FABL 83.93%
TABLE II: Comparison of average recognition accuracy with previous skeleton-based methods on the CAD-60 dataset

We implemented two baseline techniques under the same formulation. First, we set to evaluate the performance of the feature learning scheme, and obtain an accuracy of 78.57%. Then, is set to zero to evaluate the performance of the body-part learning scheme, and we obtain an average accuracy of 79.46%. It is observed that both baseline methods perform worse than the full FABL approach using both regularization terms. Moreover, we implemented a third baseline method with no regularization terms, which obtains an accuracy of 76.79% and performs worse than the methods with the regularization terms. In addition, we compare our FABL method with previous state-of-the-art skeleton-based techniques for activity recognition, as reported in Table II, which shows our FABL method outperforms these skeleton-based techniques over the CAD-60 dataset.

V-D Behavior Recognition for Human-Robot Interaction

Besides using public benchmark datasets to evaluate and compare our FABL method’s accuracy, we also implemented and deployed the method on a physical robot to validate its performance in real-world robotics applications. The robot employed in this experiment is a Baxter robot, as shown in Fig. 6(a), which uses a structured-light sensor for onboard 3D perception and the same workstation (Intel i7 3.4GHz CPU and 16GB memory) for onboard control and data processing.

((a)) Baxter performing “serving drinks”
((b)) Confusion matrix
Fig. 6: We evaluate our FABL approach using a Baxter robot to recognize behaviors for real-time human-robot interaction. The tasks focus on the robot-assisted living application such as “serving drinks” as shown in Fig. 6(a). The confusion matrix is illustrated in Fig. 6(b).

In this experiment, the task focuses on the robot-assisted living application, where the Baxter robot needs to recognize the activities of a subject and perform a collection of predefined robot actions, such as “serving drinks,” as demonstrated in Fig. 6(a), in response to the subject’s activity. We define six robot actions, including fetching a drinking bottle with one hand, fetching an empty cup with the other hand, pouring the drinks into the cup, putting back the bottle, serving the drinking cup to the subject, and finally putting back the cup. Each robot action is triggered by a specific command gesture performed by a subject in front of the robot, which must be recognized by the Baxter robot. The skeleton data is captured onboard and in real time using ROS and the OpenNI package.

Eight human behavior categories are defined and used to interact with the robot, including lifting up left/right arms, pouring with left/right hands, serving with left/right hands, and putting down left/right arms. We specifically distinguish between left side and right side, because this is critical to take into account human preference in practical, real-world scenarios. Two human subjects having different body scales and motion patterns are involved in this experiment. Each subject performs each of the eight behaviors 20 times. Actions by one subject were used for training, while other subject’s actions were used for testing. Ground truth is manually recorded and used to compare with recognition results obtained by the robot for quantitative evaluation. After extracting multimodal features from training instances, our method computes the optimal weight matrix by Algorithm 1 using the training data. Then, the learned FABL approach is deployed on the robot for online, onboard behavior recognition to enable real-time human-robot interaction.

Similar to the experiments using public datasets, we also quantitatively assess FABL’s performance and compare with baseline and existing skeleton-based techniques. The average accuracy obtained by the complete FABL method is 77.19% with both regularization terms. The confusion matrix obtained by our FABL approach is demonstrated in Fig. 6(b). For comparison, the baseline technique based only on feature learning () obtains an accuracy of 76.56%, while the baseline based only on body part learning () obtains an average recognition accuracy of 76.25%. In addition, we compare our FABL method with several previous skeleton-based recognition techniques and present the results in Table III. We can observe that the FABL method is able to obtain better performance over baseline and used previous methods. Since only one subject’s actions were used to train the FABL model, the recognition accuracy was not as significant as that using public benchmark datasets. More training data will improve the testing performance.

Reference Method Accuracy
[27] Relative Angles and Distances 15.00%
[2] Histogram of Joint Position Differences 48.13%
[8] Histogram of Oriented Displacements 51.25%
Feature Learning Only 76.56%
Our Methods Body-Part Learning Only 76.25%
FABL 77.19%
TABLE III: Comparison of average recognition accuracy with previous methods for real-time human-robot interaction

Vi Discussion

High-Speed Processing. Due to the capability of our FABL approach to integrate both feature learning and classification in the same formulation, and the efficiency of our regression-like objective function, our FABL approach is able to achieve high-speed processing. To validate this strong advantage, we perform additional experiments over the MSR Action3D and CAD-60 datasets using Matlab implementations without any optimization, and utilizing the real Baxter robot using a C++ implementation. The runtime results on all used datasets are presented in Table IV, which shows our FABL approach can achieve a significantly high processing speed at the order of Hz. This indicates the promise of our FABL approach to identify human behaviors in real-time robotics applications.

Runtime MSR Action3D CAD-60 Baxter
Processing speed (Hz)
Time per observation (s)
TABLE IV: Runtime Analysis Over Different Datasets

Generalizability. FABL is a general approach that can work with different body kinematic models obtained by a variety of sensing devices and skeleton generation packages, including the OpenNI package in ROS, Microsoft SDKs, and MoCap systems. Given any kinematic body model from the devices, we can downsample the body model into 15 body parts, and apply FABL to automatically identify the most representative parts. In this case, FABL can achieve cross-training [5], i.e., methods trained on a kinematic body model from one device can be directly applied to other models by a different device, which can significantly save design labor.

Hyperparameter Selection. The regularization hyperparameters and are utilized to control the effect of feature learning and the strength of body-part learning, respectively. Their optimal values can be decided using cross-validation during the training process. In general, we observe that the values and

usually result in satisfactory recognition accuracy, which shows that both regularization terms are necessary. When the values of hyperparameters become too large, the performance decreases, because the loss function that models the recognition error is more ignored. When

and take too small values, the approach cannot well capture the interrelationships of feature modalities and body parts, thus decreasing the recognition accuracy.

Vii Conclusion

In this paper, we introduce a novel FABL approach that is able to simultaneously learn discriminative feature modalities and body parts to perform high-speed human behavior recognition. The proposed FABL method automatically identifies discriminative feature modalities and important body parts using two structured sparsity-inducing norms to model their interrelationships. Our FABL approach formulates behavior recognition as a regression-like optimization problem, which is solved by an efficient iteration algorithm that possesses a theoretical guarantee to find the optimal solution. To evaluate the performance of the proposed FABL method, we perform empirical studies using two public benchmark datasets and a physical Baxter robot. The experimental results have indicated that FABL is able to outperform existing skeleton-based methods. More importantly, our FABL approach achieves a high processing speed of more than Hz, which can enable realistic, self-contained, intelligent robots to recognize human behaviors and interact with humans in real time.


  • [1] H. Zhang, W. Zhou, C. Reardon, and L. E. Parker, “Simplex-based 3D spatio-temporal feature description for action recognition,” in CVPR, 2014.
  • [2] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for 3D human action recognition,” TPAMI, vol. 36, no. 5, pp. 914–927, 2014.
  • [3] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in CVPR, 2011.
  • [4] F. Han, B. Reily, W. Hoff, and H. Zhang, “Space-time representation of people based on 3D skeletal data: A review,” CVIU, 2017, to appear.
  • [5] H. Zhang and L. E. Parker, “Bio-inspired predictive orientation decomposition of skeleton trajectories for real-time human activity prediction,” in ICRA, 2015.
  • [6] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human activity detection from RGBD images,” in ICRA, 2012.
  • [7] X. Yang and Y. Tian, “EigenJoints-based action recognition using Näive-Bayes-Nearest-Neighbor,” in CVPRW, 2012.
  • [8] M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban, “Histogram of oriented displacements (HOD): describing trajectories of human joints for action recognition,” in IJCAI, 2013.
  • [9] X. Yang and Y. Tian, “Effective 3D action recognition using EigenJoints,” JVCIR, vol. 25, no. 1, pp. 2–11, 2014.
  • [10] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in CVPR, 2012.
  • [11] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition,” JVCIR, vol. 25, no. 1, pp. 24–38, 2014.
  • [12] X. Chen and M. Koskela, “Online RGB-D gesture recognition with extreme learning machines,” in ICMI, 2013.
  • [13] P. Wei, Y. Zhao, N. Zheng, and S.-C. Zhu, “Modeling 4D human-object interactions for event and object recognition,” in ICCV, 2013.
  • [14] P. Wei, N. Zheng, Y. Zhao, and S.-C. Zhu, “Concurrent action detection with structural prediction,” in AAAI, 2013.
  • [15] G. Yu, Z. Liu, and J. Yuan, “Discriminative orderlet mining for real-time recognition of human-object interaction,” in ACCV, 2014.
  • [16] S. Z. Masood, C. Ellis, A. Nagaraja, M. F. Tappen, J. J. L. Jr., and R. Sukthankar, “Measuring and reducing observational latency when recognizing actions,” in ICCV, 2011.
  • [17] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based action recognition,” in CVPR, 2013.
  • [18] B. X. Nie, C. Xiong, and S.-C. Zhu, “Joint action recognition and pose estimation from video,” in CVPR, 2015.
  • [19]

    Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in

    CVPR, 2015.
  • [20] A. A. Chaaraoui, J. R. Padilla-López, P. Climent-Pérez, and F. Flórez-Revuelta, “Evolutionary joint selection to improve human action recognition with RGB-D devices,” ESA, vol. 41, no. 3, pp. 786–794, 2014.
  • [21] M. Reyes, G. Domínguez, and S. Escalera, “Feature weighting in dynamic timewarping for gesture recognition in depth data,” in ICCVW, 2011.
  • [22] O. Patsadu, C. Nukoolkit, and B. Watanapa, “Human gesture recognition using Kinect camera,” in IJCCSSE, 2012.
  • [23] D.-A. Huang and K. M. Kitani, “Action-reaction: Forecasting the dynamics of human interaction,” in ECCV, 2014.
  • [24] R. Chaudhry, F. Ofli, G. Kurillo, R. Bajcsy, and R. Vidal, “Bio-inspired dynamic 3D discriminative skeletal features for human action recognition,” in CVPR, 2013.
  • [25]

    F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint

    -norms minimization,” in NIPS, 2010.
  • [26] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” in CVPRW, 2010.
  • [27] C. Ellis, S. Z. Masood, M. F. Tappen, J. J. Laviola Jr, and R. Sukthankar, “Exploring the trade-off between accuracy and observational latency in action recognition,” IJCV, vol. 101, no. 3, pp. 420–436, 2013.
  • [28] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3D joints,” in CVPRW, 2012.
  • [29] B. Ben Amor, J. Su, and A. Srivastava, “Action recognition using rate-invariant analysis of skeletal shape trajectories,” TPAMI, vol. 38, no. 1, pp. 1–13, 2016.
  • [30] B. Ni, P. Moulin, and S. Yan, “Order-preserving sparse coding for sequence classification,” in ECCV, 2012.
  • [31] L. Piyathilaka and S. Kodagoda, “Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features,” in CIEA, 2013.
  • [32] C. Zhang and Y. Tian, “RGB-D camera-based daily living activity recognition,” CVIP, vol. 2, no. 4, p. 12, 2012.