An important aspect in human-robot interaction is the recognition of trigger events that require a response on the robot side [laptev2007retrieving, turaga2008machine, fan2009recognition]. Such an event is usual an unexpected action or motion of the human subject. These events may trigger additional learning of new motions or an emergency response in case of accidents. In this paper, we address a common event detection of humans falling down due to tripping or health conditions.
Since the most part of human body can be viewed as an articulated system with rigid bones connected by joints, human action can be expressed as the movement of skeleton [lie2019fully]. Most existing skeleton based event detection methods can be generally categorized into two categories: 2D skeleton-based [lie2018human, avola20192, zheng2019fall] and 3D skeleton-based [min2018support, wu2019skeleton, zhang2016automatic]. Compared to 2D skeleton-based methods, the 3D skeleton has more extensive spatial information at the cost of higher time-consuming and manual labeling requirements. Most existing research methods still have an ill-posed and inverse problem that extracts 3D skeleton from monocular images [zheng2020deep].
The emergence of Microsoft Kinect [kinect], and RealSense [realsense]
cameras made multidimensional observation of human events feasible without high processing loads on the system. However, the noise of the depth measurement in these cameras has a significant influence on event detection. To solve the problem, we applied a gradual filtering processing on skeleton sequences extracted from RGB images using a lightweight Deep Learning toolbox with aligned depth information.
In addition to detecting the event action, learning and establishing structure representation of the action is also essential and challenging. Different actions may have the same start, end position, and similar pose transformation and rotation, such as lying down and fall down. However, their latent temporal feature is totally different. Modeling latent spatio-temporal structures of actions is one of the most widely-used techniques for action recognition, and representation [rabiner1989tutorial, wang2010hidden, tang2012learning]. A latent spatio-temporal structure has two parts: action unit with spatial information and temporal model. The action units are the sequence and constituent elements of action. The temporal feature defines the length of step from the previous state to the next state [qi2018learning]. For the fall-down event, the temporal feature is the sharp height change of skeleton [ma2014depth].
For the latent action unit extraction, Sparse Coding Dictionary (SCD) is a well-known approach [chiang2013multi, ben2018coding, mairal2010online], which approximate a given video sequence by the manipulation of a low-rank dictionary and its coefficient matrix . Online Dictionary Learning is one of the most successful SCD methods and is widely used in the action recognition area. Because that fall event detection is just one extreme case of action recognition, we consider the ODL algorithm in this work as a baseline method. Its cost can be expressed in the least squares problem with regularizer as
where means Frobenius norm, is the number of action unit and is the regularization parameter. Unfortunately, in the presence of outliers, Eq (1
) provides a poor estimation forand [yang2020graduated]. The performance is worse for the 3D skeleton-based human fall event detection because the 3D skeleton has more outlier sources, such as skeleton estimation and depth measurement.
In this paper, an attempt to improve event detection latency and temporal resolution is presented and performed at the example of fall detection. We separate the fall event into five latent action atoms ”standing”, ”bending knee”, ”opening arm”, ”Knee landing” and ”arm supporting”.
Overall, the technical contributions of the paper are:
We propose a novel Gradual Online Dictionary Learning method that uses Graduated Non-convexity (GNC) with Geman McClure (GM) cost function to decrease outlier weight during training.
We demonstrate that our approach can robustly extract action unit and detect fall events with training data with a different ratio of outlier.
We compare our results with other good dictionary learning approaches on the NTU RGB+D dataset [Shahroudy_2016_NTURGBD] and achieve the best performance in the aspect of precision and accuracy.
The rest of the paper is organized as follows: in section II, we briefly review existing approaches of the latent action unit and the Sparse Coding Dictionary. Section III introduces Gradual Sparse Coding Dictionary. Section IV reports experimental results and discussions. Section V concludes the paper.
Ii Related Work
We review the previous works from three primary related streams of the research area: fall-down event detection, Spatio-temporal latent action unit extraction, and global minimization with a robust cost.
Ii-a Fall-Down Event Detection
With the rapid development of motion capture technologies, e.g., single RGB camera systems [rougier2011robust, de2017home, huang2018video, mirmahboub2012automatic, tra2013human], fall event detection has recently received growing attention because of its importance in the health-care area.
For 3D event detection, RGBD cameras, e.g., Microsoft Kinect and Intel RealSense, provide a significant advantage over standard cameras[wei2019learning]. Nghiem et al. [nghiem2012head] proposed a method to detect falling down, based on the speed of head and body centroid and their distance to the ground. Stone et al. [stone2014fall] used Microsoft Kinect to obtain person’s vertical state from depth image frames based on ground segmentation. Fall is detected by analyzing the velocity from the initial state until the human is on the ground. In contrast with using depth images directly, Volkhardt et al. [volkhardt2013fallen]
segmented and classified the point cloud from depth images to detect fall events.
Since depth-based methods are sensitive to the error of shape and depth [wei2019learning], many researchers prefer 3D skeleton-based methods. Tran [le2014analysis]
computed three states (distance, angle, velocity) from Kinect’s 3D skeleton and applied support vector machine (SVM) to classify falling down action. Kong et al.[kong2018privacy]
applied Fast Fourier Transform (FFT) to classify the 3D fall event skeleton dataset. However, the 3D skeleton estimation using a monocular camera is an ill-posed and inverse problem[zheng2020deep].
Ii-B Spatio-Temporal Latent Action Unit Extraction
Based on sparse coding and dictionary learning method, falling down action can be represented as a linear combination of dictionary elements (latent action units). After Mairal et al. [mairal2010online] proposed an Online Dictionary Learning algorithm. It has attracted a lot of attention because of its robustness [chiang2013multi, ferrari2017dictionary, qi2018learning, wilson2014dictionary]. Ramirez et al. [ramirez2010classification] proposed a classic Dictionary Learning method with Structured Incoherence (DLSI) considering the incoherence between different dictionaries as part of the cost, which could have shared atoms between dictionary. In against sharing dictionary, Yang et al. [yang2011fisher] presented Fisher Discrimination Dictionary Learning (FDDL) using both the discriminative information in the reconstruction error and sparse coding coefficients to maximize the distance between dictionary. In other words, one training data should only be approximated by the dictionary generated from its cluster. Kong et al. [kong2012dictionary] separated the dictionary into Particularity and Commonality and proposed a novel dictionary learning method COPAR. With the similar idea, Tiep et al. [vu2016learning] developed Low-Rank Shared Dictionary Learning (LRSDL) that extract a bias matrix for all dictionary based on FDDL. However, its performance is limited for action recognition because each action unit should have a different action space. The results are discussed in the evaluation chapter.
Recently, spatio-temporal deep convolutional networks [plizzari2020spatial, wen2019graph, yan2018spatial, chen2020afnet, li2020spatio] have been widely applied for action recognition. The common principle of these works is that using several continuous frames generate temporal information around feature joints. However, the size of temporal block is a tricky problem among different actions. Besides that, some events have a strict sequence, such as fall down starts from standing (sitting) and ends on the ground. Most of the deep learning networks cannot identify the sequence by summing all temporal blocks note.
Ii-C Global Minimization with Robust Cost
Global minimization of ODL is NP-hard with respect to both outliers and chosen of regularization parameters. RANSAC [fischler1981random] is a widely used approach but does not guarantee optimality and its calculation time increases exponentially with the outlier rate [yang2020graduated]
. The Graduated Non-convexity has also been successfully applied in Computer Vision tasks to optimize robust costs[nielsen1995surface][rangarajan1990generalized]. However, with a lack of non-minimal solvers, GNC is limited to be used for spatial perception. Zhou et al. [zhou2016fast] proposed a fast global registration method, which combines the least square cost with weight function by Black-Rangarajan duality. Yang et al. [yang2020graduated] applied this method to 3D point cloud registration and pose graph estimation.
Inspired by the successful works mentioned above, we propose a novel dictionary learning method to robustly extract spatial and temporal latent action units under noised by depth image and uncertainty of 2D human pose estimation.
In this section, we first briefly introduce the setting of GODL and then present our framework.
Iii-a Task Definition
Formally, let denote a fall-down 3D pose sequence and is the
-th column vector of skeleton joints. We assume that the sequenceis segmented into sub-sequences and each sub-sequence corresponds to an action unit . Then the dictionary can be expressed as and their coefficient matrix is defined as .
Iii-B Prepossessing of data
An overview of the fall event detection training process is shown in Fig 1. RGB images are fed into OpenPose to get 2D skeleton joints. At the same time, depth frames are aligned with RGB images. 3D skeleton joints are obtained by projecting pixel position to 3D space along with aligned depth value. In order to compensate the fact that human could fall down from different positions in image coordinate, a normalization function is applied to keep skeletons in the same magnitude and the ratio for each direction: . In order to balance the influence of spatial and temporal information, we use a weight parameter that is and defined as in the paper. A K-means based clustering method segments a sequence into clusters.
Iii-C Train phase: Gradual Online Dictionary Learning
For each sub-sequence, we apply GODL to iteratively update the coefficient matrix and its action unit matrix , until the cost converges or the maximum iteration number is reached. The general framework for GODL is described in Algorithm 1. The main idea is to automatically enable the iteration process to automatically filter outliers and ensure that the latent action units are learned from inliers.
Graduated non-convexity is a popular method for optimizing general non-convexity cost functions like Geman McClure (GM) function. The following equation shows the GM function:
where is a given constant that is the maximum accepted error of inliers, determines the shape of GM function and is Frobenius norm of error between training sequence and approximation model as follow:
At each outer iteration, we update a new and optimize the Eq (4). The solution obtained at each iteration is used as an initial guess for the next iteration. The final solution is computed until the original non-convexity function is recovered ().
We use the Black-Rangarajan duality to combine the GNC-GM function with weighted ODL cost as follow:
with weighted cost:
and penalty term:
With simplified expression of , and , the Eq (6) can be described as following:
At the first inner iteration, all weights are set to . During inner iterating, the weighted ODL is optimized with fixed weight (), and then we optimize over with a fixed cost of ODL. At a particular inner iteration within weighted sub-sequence , we perform the following:
1) Dictionary Learning: minimize the Eq (5) with respect to and with fixed . This problem is the original ODL, but with weighted training sequence:
In ODL optimization, we first update coefficient matrix with fixed action unit (Sparse Coding). We assign the weight parameter to training sequence and coefficient matrix . Then update action unit with fixed weighted coefficient matrix and weighted input matrix (Dictionary Learning):
Assign Weight: and , where is column dot-production.
Sparse Coding: we use Lasso-Fista algorithm to update with fixed , see [mairal2010online].
Dictionary Learning: minimize the following equation with fixed :
2) Weight update: minimize the Eq (5) with respect to weight with fixed dictionary matrix and coefficient vector .
where is Frobenius norm of error between training sequence and approximation model , see Eq (3).
In the implementation, we start with an initialization with
At each outer iteration, update and stop when is blow , see [yang2020graduated].
Iii-D Inferring phase
In the inferring phase, we assume that the error between sub-sequence and model
is normally distributed. Hence, the measured error
between real-time skeleton frames of a fall-down action and the action unit model should fall within the confidence interval as follows:
where is the mean error of training set,
is the standard deviation of error,and is an acceptance parameter.
Since the fall event has strict order of sub actions, which are from ”standing” to ”on the ground”, each sub-action detection will be performed only when the previous action is passed.
In addition to action unit extraction, the temporal feature of fall down is important as well. A fall is defined as an event that results in a person moving from a higher to a lower level, typically rapidly and without control. From this definition, we can know that the action fall down is a rapid human’s height change in a very short time. For the height change, we don’t need all the skeleton information. We only need the skeleton information in -direction as the following equation:
where y is the value of skeleton in y axis, h means the height of skeleton and T is the width of time interval shifted from beginning of video to end. Since the first action unit is ”standing”, we define its height as an initial value . The height change of fall event inside a time interval should meet following two conditions:
where these thresholds are obtained through experiments.
Iv Experiments and Results
In this section, we present the experiments’ results on the NTU RGB+D dataset [Shahroudy_2016_NTURGBD]. First, it introduces the dataset for training and evaluation. Second, it displays the tendency of weight parameter with an increasing number of iteration in the training phase and demonstrates how the dimension of action units influences the prediction performance. In the end, we compare our method with other existing well-performed dictionary learning methods on the NTU RGB+D dataset [Shahroudy_2016_NTURGBD].
NTU RGB+D dataset [Shahroudy_2016_NTURGBD] is one of the largest datasets for human action recognition. It contains action classes, video samples, and their depth image frames. From the first setups (), we successfully generate fall-down 3D skeleton examples, in which samples are used for training and the rest are used for testing. For both training and testing, we randomly select subjects and camera views, including two side views, two diagonal views and a front view
In order to recognize fall-down event from similar actions, we generate sitting-down and ground-lift 3D skeleton examples and merge them into the test dataset, the rest skeleton examples are taken from other actions.
As suggested in the dataset [Shahroudy_2016_NTURGBD], we use cross-subject (CS) and cross-view (CV) criteria to compare our model with deep learning method. In CS evaluation, the subjects are split into training and testing groups. The IDs of training subjects are 2,3,5,7,8. The result of CS evaluation is shown in table I. In CV evaluation, the samples of camera view 2 and 3 are used for training. Camera view 2 and 3 include one front view and two side views. The samples of camera view 3 are used for testing, which contains diagonal views.
Iv-B Validating the effectiveness of GODL
In order to prove that GODL is resistant to outliers, we record skeleton’s weight change during training of the first action unit, see Fig 2 (a). It shows the weight tendency of six skeleton examples in the first sub-sequence ”standing” over iteration in the GODL program. The weight of outliers (skeleton and ) have a steeper decreasing trend, while the inlier’s (skeleton - ) weight changes slower. At the end of the iteration, outliers are assigned with and , respectively. In opposite to outliers, the weight of inliers still keeps a high value, respectively , , , and . Fig 2 (b) shows the histogram of weight and its cumulative distribution at the last iteration. Fast of skeletons have a weight with a value larger than . These skeletons have a greater impact on the cost function. Hence these skeletons are considered as inliers, and the last with lower value are outliers.
Since each action unit’s dimension influences the performance of prediction, we measure its performance of recall with different settings and select the optimal dimension for each action unit. The dimension of dictionary is highly depending on the complexity of action unit, for example the first three dictionary , and has less dimension than the last two dictionary and , because action unit ”standing” ”bending knee” and ”opening arm” are much simple than ”knee landing” and ”arm supporting”. Before reaching the optimal point, the recall increases with dimension, because it is not enough to represent the action space. After exceeding the optimal point, the recall is decreasing with dimension because of overfitting. Fig 2 (c) presents the selection process of each unit dimension. The optimal combination of dimensions is .
Iv-C Evaluation of fall-down using action unit and temporal structure
|Accuracy (%)||Recall (%)||Precision (%)|
The best results of each class are in bold.
The evaluation results can be found in Tab I. Compared to the other four state-of-the-art Dictionary Learning methods, our GODL model achieves the best accuracy and precision. Considering the recall, FDDL [yang2011fisher] and LRSDL [vu2016learning] both achieves a good performance. However, their precision is lower than our method. Compared to the baseline ODL [mairal2010online], our method has better performance in all aspects.
To demonstrate the robustness of our method, we deliberately add noise into the training data and compare the performance of our method with other methods. Fig 3
(a) shows the accuracy, precision and recall of the methods under different noise ratio (). Although some methods surpass our methods in recall and precision, overall, with the noise increases, our method remains at the same level and the highest in the accuracy. It proves that our method is more robust than the other four methods.
In addition to the dictionary learning methods, we also compare our result with the state-of-the-art deep learning based fall-down detection methods. The comparison is shown in table II. The deep learning based methods have slightly higher precision. However, our method has more stable prediction result. Besides, our method encodes spatial-temporal information, which is more explainable than the end-to-end deep learning method. It is not possible in the end-to-end deep learning methods, because they can only detect the fall-down action after falling. Compared to other spatial-temporal methods, our method can determine which step of the fall-down action is in progress, see Fig 3 (b). For more real-time evaluation, please check the attached video material.
|Accuracy (% CS)||Accuracy (% CV)|
|Biomechanic, RNN [Xu_Zhou_2018]|
|Thining, DNN [Thinning_DNN]|
The best results of each class are in bold.
In the paper, we have proposed a novel event detection method using robust latent action units extraction method GODL and performed at the example of fall-down detection. Experiments have been evaluated on a public dataset. The proposed method outperforms the existing good dictionary learning methods on both robustness and average accuracy.
Compared to the end-to-end deep learning methods, our method includes spatial-temporal information, which is better explainable. Compared to other spatial-temporal methods, our method can determine which step of the fall-down action is in progress, instead of determining a single fall-down action. In other words, our method contains implicit information. Therefore, our method can not only detect fall-down activity, but also predict and prevent the fall-down activity. It is very useful in scenarios such as health-care area. Most importantly, we attempt to approximate action space through more mathematical method.
We plan to focus on applying the proposed method to recognize different actions with larger datasets in the future.
We gratefully acknowledge the funding of the Lighthouse Initiative Geriatronics by StMWi Bayern (Project X, grant no. 5140951) and LongLeif GaPa GmbH (Project Y, grant no. 5140953).