Capturing and understanding people behavior through vision and RGB-D data is a popular and challenging research topic. Since the area is very broad, the proposed systems are often specialized, according with the context, the specific application goals and the type of observed human activity. Depending on the task, the temporal granularity and the semantic abstraction level, different terms have been adopted to describe atomic body movements such as posture, gesture, action, interaction, behavior, and so on. In this paper, we focus on the recognition of dynamic body gestures for explicit Human-Computer Interaction (HCI), that can be defined as follows:
Dynamic: the target gesture requires a movement; thus we neglect static postures (e.g., sitting, reading a book);
Body: the target gesture is potentially performed using the whole body; thus we discard too local gestures such as finger movements or facial expressions;
Gestures: a gesture is a well-defined and time-limited body movement; continuous actions such as running, walking are not considered;
Explicit HCI: we focus on gestures provided by a user which has spontaneously decided to interact with the system; thus, the gesture recognition subsumes a corresponding reaction or feedback at the end of each gesture.
Different methods have been proposed in the past to address this problem; currently, the most common solutions adopted in real-time applications include a 3D sensor and a classifier based on Dynamic Time Warping (DTW) or a Hidden Markov Model (HMM) working on body joint positions. Even if more complex solutions have been proposed, DTW and HMM based systems have a wider diffusion thanks to their simplicity, low computational requirements and scalability; moreover, even a limited training set allows to reach worth performance. Although the recent release of cheap 3D sensors, like Microsoft Kinect, bring up new opportunities for detecting stable body points, developing a gesture recognition solution with characteristics of efficiency, efficacy and simplicity together is still far to be accomplished. Here we propose a new framework, which performs an on-line double-stage Multiple Stream Discrete Hidden Markov Model (double-stage MSD-HMM) to classify gestures. In the HCI context we address, users are in front of the acquisition and have the (intentional) need to exchange information with the system through their natural body language. Differently from other works that use standard HMMs and are focused on defining and detecting ad hoc
feature sets to improve performances, we focus our attention on the whole pattern recognition flow, from the new HMM architectural setting to its implementation details, in order to obtain a real time framework with good performances on speed and classification.
Ii Related works
Thanks to the spreading of low cost 3D sensors, the research on gesture recognition from RGB-D data has grown interest. In particular, the availability of almost accurate streams of body joint 3D positions [Shotton11]
allows the body posture or gesture recognition directly from skeleton data, without the need of working on the source depth map. 3D joint position could be affected by noise and estimation errors in presence of occlusions and depth maps intrinsically include richer information. Neural networks and deep architectures are able to extract good features and to correctly classify actions and gestures from the depth maps with impressive performances, when large annotated datasets are available[WangLGZTO15]
. Hybrid methods merge skeleton and depth features, in order to get both advantages and increase system performances. To this aim, solutions based on Random Forest[Zhu_2013_CVPR_Workshops] or multi-kernel learning in [Althloothi2014, Wu12] have been proposed. However, although methods based on skeleton only may be affected by errors and noise, their computational requirements are usually less demanding and more suitable for real-time systems. Thus, in this work we focus on a solution based on 3D joint positions only. Following the same assumption, Evangedilis et al. in [Evangelidis-ICPR-2014] proposes a local, compact and view-invariant skeletal feature to encode skeletal quad. Chaudhry et al. in [Chaudhry_2013_CVPR_Workshops] proposes a bio-inspired method to create 3D discriminative skeletal feature. In [Eweiwi2015] action recognition is performed through skeletal pose-based features, built on location, velocity and correlation joint data. Xia et al. in [xia2012view] uses histograms of 3D joints locations to perform classification through discrete HMMs. HMM models has been widely used due to their low computational load and parallelization capabilities. Also [Wu2014]
suggests an approach based on HMMs by modelling the state emission probability with neural networks, which somehow nullify the advantages of HMM models. Finally, very few works[AleksicK06, 08sch16] have been proposed in the past with discrete weighted multiple stream HMMs, but only in facial expression and handwriting recognition tasks respectively (no gesture recognition task).
Iii Offline Gesture Classification
The core part of the proposal is a probabilistic solution for the classification problem of a pre-segmented clip, containing a single gesture (offline classification). Given a set of gesture classes , we aim at finding the class which maximizes the probability , where is the entire sequence of frame-wise observations (i.e. the features). A set of HMMs, each trained on a specific gesture, is the typical solution adopted for this kind of classification problems [rabiner]. The classification of an observation sequence is carried out selecting the model whose likelihood is highest. If the classes are a-priori equally likely, this solution is optimal also in a Bayesian sense.
If the decoding of the internal state sequence is not required, the standard recursive forward algorithm for HMMs with the three well known initialization, induction and termination equations can be applied:
where is the initial state distribution, , matrix A describes the transition for hidden states ( is the current state), depends on the type of the observation and defines the emission state probabilities.
A common solution is based on Gaussian Mixture Models (GMMs) learned on a feature set composed of a set of continuous values[VezzaniIcip09]. The term of Eq. 2 would be approximated as:
where is the number of Gaussian components per state, and are the Gaussian parameters, and are the mixture weights.
Iii-a Feature set
We assume to already have the body joint 3D coordinates as input [Shotton11]
. In this work, we exploited a simple feature set directly derived from the joint stream, discriminative enough to obtain reasonable classification rates. Additional features may be included without changing the classification schema, but this may go to the detriment of the computational complexity and to the overall efficacy due to the curse of dimensionality. Thus, only nine features are extracted for each selected body joint. Given the sequence of 3D positions of the joint , we define as:
where are the joint positions with respect to a reference joint selected to make the feature set position invariant; and are the speed and acceleration components of the joint respectively. This approach is inspired by [Eweiwi2015, Zhu_2013_CVPR_Workshops]. A linear normalization is applied to normalize the feature values to the range of . Selecting a subset (or the whole set) of body joints, the complete feature set for the frame is obtained as a concatenation of features. Thanks to the limited dependencies among the features, fast and parallel computation is allowed.
Iii-B Multiple Stream Discrete HMM
The computation of exponential terms included in Eq. 3 is time-consuming. Moreover, the Gaussian parameters may lead to degenerate cases when learned from few examples. In particular, in the case of constant features the corresponding values becomes zero. Even if some practical tricks have been proposed in these cases [rabiner], overall performances generally decrease when few input examples are provided in the learning stage. For these reasons, we propose to adopt discrete HMM. All continuous observations are linearly quantized to discrete values . The adoption of discrete distributions to model the output probability solves the above mentioned numerical issues. In addition, to improve the generalization capabilities of HMM with limited samples for each state, we adopted a set of independent distributions –one for each feature item– called streams; the term of Eq. 3 is replaced with the following one:
where is the emission probability of the discrete value in the state and are weighting terms.
The observation of each stream is thus statistically independent from the others given the current state .
The weight coefficients may be used to take into account the different classification capability of each joint and each feature component (see Eq. 4).
Iv Double-stage classification
The previous described approach does not discriminate if there are some parts of the body that are more significant for a given gesture, since the same feature set is provided as input to all the HMMs. For example, the gesture “hello” can be better recognized if the hand’s joints are analyzed alone without other distracting joints. However, the responses HMM with different feature sets becomes not comparable. Following this observation, a double-stage classification system based on MSD-HMM is proposed as outlined in Fig. 1. Gestures are grouped into sets depending on the sub set of the interested body joints. The first classification stage recognizes the gesture group (i.e., discovers which part of human body is most involved in the gesture), while the second stage provides the final classification among the gestures in the selected group. The MSD-HMMs of the first stage work on a global subset of joints extracted from the whole body, while the MSD-HMMs of the second stage are more specific on the body part involved. In particular, the global subset of joints (used to compute the features of Eq. 4) contains the left and right foot, the left and right hand and the head. Other joints, like shoulders or knees, are instead used in the second stage. Gesture groups of the second-stage correspond with the partition reported on the left part of Fig.1. Four local-areas (i.e., Right Upper Part (RUP), Left Upper Part (LUP), Right Lower Part (RLP), Left Lower Part (LLP)) are firstly defined. Then, four additional macro-areas are created as combinations of local-areas (i.e., Upper Part (UP), Bottom Part (BP), Right Part (RP) and Left Part (LP)). Eight gesture clusters are correspondingly defined based on the main body part involved. A different set of stream weights (see Eq. 5) is computed for each HMM, using the average motion of each joint: the high the motion of a body part in a gesture is, the high is the corresponding stream weights.
V Online Temporal Segmentation and Classification of Gestures
Differently from offline gesture classification, online recognition requires to estimate the most likely gesture currently performed by the monitored subject, given only the observations until now. The observed sequence may contain more instances and the current gesture may be in progress. The original proposed system is able to detect gestures on-line, overtaking classical static approaches that are characterized by strong prior hypothesis. For example, common sliding window methods implicitly apply a strong constraint on the average and maximum duration of each gesture. Since HCI is often characterized by a high variability on the gesture duration and noise, the proposed solution is not based on fixed-length rigid windows. Using HMM, the temporal evolution of a gesture is correlated with the hidden state probability. In particular, exploiting the left-right transition model [rabiner] the first and last state of the chain can be exploited to detect the temporal boundaries of each gesture. The following rules are applied to the first-stage classification layer of Fig. 1:
Beginning detection: the beginning of a gesture candidate is detected by analyzing the first hidden state of each HMM. The adoption of a left-right transition model is mandatory. The most likely state of HMM will be the first one during non-gesture times, while the following states will be activated once the gesture starts. A voting mechanism is exploited, counting how many HMMs satisfy the following condition:
where is the total number of hidden states in HMMs and is defined in Eq. 5. If a large number of models satisfies Eq. 6, a gesture is performing. The threshold on checks the probability to be still at the instant in the first hidden state.
End detection: if a gesture is currently performed (as detected by the previous rule) the most likely gesture is computed at each frame as in Eq. 1
. A probability distribution analysis of the last state of the corresponding HMM is performed in order to detect the end of the gesture:
where is the total number of hidden states and is a threshold on , the probability to be at the instant in the last (-th) state after observations .
Reliability Check: this rule is applied to filter out false candidates (non valid or incomplete gestures, for example). The sequence of hidden state probabilities obtained with the selected HMM model on the observation window is analyzed. Starting from the set of hidden states, the subset is defined as follows:
The following checks are performed to validate the candidate gesture:
where is the cardinality of the set . The previous checks guarantee that at least of hidden states (included the second to last) are visited. A state is defined as visited if there is an instant characterized by a high probability () to be in that state.
Once the first stage classification layer provides a valid temporal segmentation and the corresponding simultaneous gesture classification, the second-stage layer is exploited to refine the classification.
Vi Experimental Results
We test our system on three public and famous datasets, MSRAction3D, MSRDailyActivity3D and UTKinect-Action, as well as on a new custom dataset (Kinteract Dataset) we developed for HCI.
Vi-a MSRAction3D Dataset
MSR Action3D dataset [Li10] contains 20 action classes performed by 3 subjects: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing and pickup & throw. In total there are 567 action instances. This dataset fits our method due to the absence of human-object interaction and it contains only 3D joint position. This is one of the most used dataset for human action recognition task, but in [Padilla-Lopez14] are reported some issues about validation methods: the total number of samples used for training and testing is not clear, because some 3D joint coordinates are null or with highly noise. To avoid ambiguity, the complete list of valid videos we used is publicly available111http://imagelab.ing.unimore.it/hci. The validation phase has been carried out following the original proposal by Li et al. [Li10]. Three tests are performed: in the first two tests, 1/3 and 2/3 of the samples are used for training and the rest for the testing phase; in the third test half of the subjects are used for training (subjects number 1, 3, 5, 7, 9) and the remainder for testing (2, 4, 6, 8, 10).
Vi-B UTKinect-Action Dataset
UTKinect-Dataset [xia2012view] contains 10 action classes performed by 10 subjects: walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands, clap hands; each subject performed every action twice. There are 200 action sequences. This dataset is challenging due to high intra-class variations and the variations in the view point. Moreover, because of the low frame rate and the duration of same actions, some sequences are very short. The validation method proposed in [xia2012view] has been adopted, using a leave one sequence out cross validation.
Vi-C MSRDailyActivity3D Dataset
DailyActivity3D dataset [Wu12] contains daily actions captured by a Kinect device. There are 16 activity classes performed twice by 20 subjects: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, sit down. Generally, each subject performs the same activity standing and sitting on the sofa. There is a total of 320 long activity sequences. Since our proposed method is specifically conceived for dynamic gestures (see Sec. I), still actions (e.g., read book, sit still, play game and lay down on the sofa) have been removed during the evaluation. We follow the cross-subject test proposed in [Wu12, Zanfir].
Vi-D Kinteract Dataset
In addition, we collect a new dataset publicly available, which has been explicitly designed and created for HCI. Ten common types of HCI gestures have been defined: zoom in, zoom out, scroll up, scroll down, slide left, slide right, rotate, back, ok, exit. For example, these gestures can be used to control an application using the Kinect sensor as a natural interface. Gestures are performed by 10 subjects for a total of 168 instances and are acquired by standing in front of a stationary Kinect device. Only the upper part of the body (from shoulders to hands) is involved. Each gesture is performed with the same arm by all the subjects (despite they are left or right handed). This dataset allows to highlight the advantages of our solution in a real world context.
The output of the first-stage HMM can be directly used for classification (using argmax instead max in Fig. 1). We report the corresponding performance as reference for the final double-stage one. The number of HMM hidden states is set to 8. The number of quantization levels (see Sec. III-A) has been empirically selected on the MSRAction3D dataset. As reported in Fig. 3, the best value is . Finally, stream weights are equally initialized to 1 by default.
Table I contains an internal comparison of the system on the MSRAction3D dataset, based on cross subjects evaluation method. The complete system is compared with baseline solutions where the feature normalization (FN) and the stream weight (WMS) steps are not performed.
|System Part||First Stage||Second Stage|
|Base + FN||0.845||0.873|
|Base + FN + WMS||0.861||0.905|
|FN: Feature Normalization; WMS: Weighted Multiple Streams|
Table II reports the performance of state of the art methods on the same dataset. Results show that our method is in line with the state of the art, where methods based on skeleton only are listed. The approach [Eweiwi2015]
provides better results but is really much more expensive in terms of feature extraction and classification time, as described below.
|HMM + DBM [Wu2014]||-||-||0.820|
|HMM + GMM (our implementation)||0.861||0.929||0.825|
|Actionlet Ensemble (skeleton data) [Wu12]||-||-||0.882|
|Skeletal Quads [Evangelidis-ICPR-2014]||-||-||0.898|
Results on UTKinect-Action dataset are reported in Table III: we run the experiment 20 times and we get the mean performance; however, the best accuracy obtained in a run is 0.965. Table IV reports results for the MSRDailyAction dataset. Even if the comparison is not completely fair since we considered a subset of gestures and we only use skeleton data, the reported performance confirms the generalization capability of our method as well as its efficacy on long and generic actions. The proposed system has an overall accuracy of 0.974 on the Kinteract dataset performing a cross subject test. Finally, we test the online temporal segmentation. Instances belonging to MSRAction3D and Kinteract dataset are merged in a continuous stream. Corresponding results are reported in Table V.
|FSFJ3D (skeleton data)[Zhu_2013_CVPR_Workshops]||0.879|
The threshold for state analysis probability distribution is (see eq. 6). A temporal segment is valid only if has a Intersection over Union (IoU) with ground truth instance larger than overlap threshold (); the ground truth instance with greater IoU defines the current class action that has to be classified. Results show that our original temporal segmentation method can be used in a real world framework. We implemented the proposed system in C++ and tested on a computer with Intel i7-4790 (3.60GHz) and 16 GB of RAM. The framework is able to extract and calculate features, perform feature quantization, evaluate stream weights and classify a single action with an average time of (tested on the MSRAction3D dataset with 20 HMM for each stage). Performance results with respect to the number of HMMs (for each stage) are reported in Fig. 4. The total number of HMM of each stage corresponds to the potential number of recognizable gesture classes. Results are normalized by the mean gesture length (41 frames). The complete online system runs at about 80.35 frames per second when trained on the MSRAction3D dataset. Its exploitation on real time systems is then guarantee, differently to other recent state of the art solutions [BMVC.23.124:abbreviated, Fanello2013]; in particular, the feature extraction and classification time for a single action in [Eweiwi2015] is about 300 ms.
|Moving Pose [Zanfir]||0.738|
|Actionlet Ensemble [Wu12]||0.845|
|Dataset||Detection Rate||Recognition Rate|
In this paper, we investigate and improve the use of on-line double-stage Multiple Stream Discrete HMM (MSD-HMM), that are widely used due to their implementation simplicity, low computational requirements, scalability and high parallelism, in action and gesture recognition fields. We also demonstrate that HMMs can be successfully used for gesture classification tasks with worth performances even with a limited training set. Thanks to a double-stage classification based on MSD-HMM, our system allows both to quickly classify and perform online temporal segmentation, with a great generalization capability. Results are in line with the state of the art on three public and challenging datasets. Before our method HMM were not able to compete in gesture classification task with other state-of-the-art methods. Moreover, a new gesture dataset is introduced, namely Kinteract dataset, which is explicitly designed and created for HCI.