What's the point? Frame-wise Pointing Gesture Recognition with Latent-Dynamic Conditional Random Fields

10/20/2015
by   Christian Wittner, et al.
KIT
0

We use Latent-Dynamic Conditional Random Fields to perform skeleton-based pointing gesture classification at each time instance of a video sequence, where we achieve a frame-wise pointing accuracy of roughly 83 we determine continuous time sequences of arbitrary length that form individual pointing gestures and this way reliably detect pointing gestures at a false positive detection rate of 0.63

READ FULL TEXT VIEW PDF
06/26/2016

Training LDCRF model on unsegmented sequences using Connectionist Temporal Classification

Many machine learning problems such as speech recognition, gesture recog...
04/01/2017

Complexity-Aware Assignment of Latent Values in Discriminative Models for Accurate Gesture Recognition

Many of the state-of-the-art algorithms for gesture recognition are base...
03/29/2022

Enabling hand gesture customization on wrist-worn devices

We present a framework for gesture customization requiring minimal examp...
07/15/2022

A Non-Anatomical Graph Structure for isolated hand gesture separation in continuous gesture sequences

Continuous Hand Gesture Recognition (CHGR) has been extensively studied ...
11/05/2002

Prosody Based Co-analysis for Continuous Recognition of Coverbal Gestures

Although speech and gesture recognition has been studied extensively, al...
09/10/2018

Does Your Phone Know Your Touch?

This paper explores supervised techniques for continuous anomaly detecti...
06/18/2014

Exact Decoding on Latent Variable Conditional Models is NP-Hard

Latent variable conditional models, including the latent conditional ran...

Abstract

We use Latent-Dynamic Conditional Random Fields to perform skeleton-based pointing gesture classification at each time instance of a video sequence, where we achieve a frame-wise pointing accuracy of roughly . Subsequently, we determine continuous time sequences of arbitrary length that form individual pointing gestures and this way reliably detect pointing gestures at a false positive detection rate of .

1 Introduction

Pointing gestures are a fundamental aspect of non-verbal human-human interaction, where they are often used to direct the conversation partner’s attention towards objects and regions of interest – an essential mean to achieve a joint focus of attention. As a consequence, reliable detection and interpretation of pointing gestures is an important aspect of natural, intuitive Human-Computer Interaction (HCI) and Human-Robot Interaction (HRI).

In this paper, we use Latent-Dynamic Conditional Random Fields (LDCRFs) to perform pointing gesture detection and frame-wise segmentation based on skeleton data such as, e.g., the joint data provided by a Kinect. Therefore, we use the Latent-Dynamic Conditional Random Field (LDCRF) to label each frame of a video sequence and subsequently determine continuous time sequences of arbitrary length that form individual pointing gestures. An important advantage of this approach is that we can detect pointing gestures while they are being performed and do not have to wait for a person to perform and complete the whole pointing action. This enables us to react to a pointing gesture as it is performed – an important aspect considering that natural HRI is our target application (see, e.g., [1, 2]). For example, this way our robot is able to direct its head toward the coarse target area, thus providing a visual feedback for the pointing person, while the pointing person is still adjusting and/or fine-tuning the pointing direction. We evaluate the performance of our pointing gesture detection method based on a novel dataset. We used a Microsoft Kinect to record a diverse set of pointing gestures performed by 18 subjects, which enables us to evaluate person-independent pointing gesture detection.

Figure 1: Pointing gestures are an essential aspect of human communication that is frequently used throughout various situations such as, e.g., during speeches.

2 Related Work

One of the first systems to recognize pointing gestures was Bolt’s “Put that there” system [3], which enabled users to point at and interact with objects while giving voice commands. However, the system required the user to wear a Polhemus Remote Object Position Attitude Measurement System (ROPAMS) device at his wrist. Systems that avoid specialized wearable devices nowadays mostly use the data provided by stereo or depth cameras as a basis for (pointing) gesture recognition. Here, stereo cameras or depth cameras (e.g., Microsoft Kinect or the Mesa SwissRanger) are used to acquire depth image sequences, which allows for simpler gesture recognition in situations with, for example, occlusions and/or difficult lighting conditions.

Gesture recognition for Human-Computer Interaction has been an active research areas for many years. Accordingly, there exist several surveys that provide a detailed overview of early and recent research developments [4, 5, 6, 7, 8, 9]. In the following, we focus on depth-based sequential gesture recognition approaches and, for example, do not discuss non-sequential approaches (e.g., [10, 11, 12, 13, 14, 15])

Hidden Markov Models (HMMs) and stochastic grammars are widely used models for speech and gesture recognition. The Hidden Markov Model is a generative model that includes hidden state structure. Gesture recognition applications using Hidden Markov Models can be found at Chen et al. [16], Yang et al. [17], Zafrulla et al. [18] and Vogler et al. [19]. The popular approach by Nickel and Stiefelhagen [20] uses a stereo camera to detect pointing gestures. A color-based detection of hands and face is combined with a clustering of the found regions based on depth information enabling the tracking of 3D skin color clusters. Hidden Markov Models are trained on different states (begin, hold, end) of sample pointing gestures and used to detect the occurrence of pointing gestures. The additional head orientation feature is used to improve recall and precision of pointing gestures. The authors claim to achieve 90% accuracy identifying targets using the head-hand line, but no accuracy for the pointing gesture detection has been reported. In Park et al. [21]

, face and hand tracking is performed using 3D particle filters after initially having detected those body parts based on skin color. To account for small pointing gestures (where the pointing arm is not fully extended), the hand positions are mapped onto an imaginary hemisphere centered around the shoulder of the pointing arm before estimating the pointing direction. This is done by a first stage

HMM that is used to retrieve more accurate hand positions. In a second stage these hand positions are fed into three HMMs (for three states Non-Gesture, Move-To and Point-To) in order to detect a pointing gesture. In case of a pointing gesture, the pointing direction is estimated.

Droeschel et al. [22] focus on pointing gestures where the person does not look into the target direction. Body features are extracted from the depth and amplitude images delivered by a Time-of-Flight camera. HMMs are used to detect a pointing gesture which is segmented into the three phases “preparation”, “hold” and “retraction”. The HMMs are trained with features such as the distance from the head to the hand, the angle between the arm and the vertical body and the velocity of the hand. To estimate pointing directions, a pointing direction model is trained using Gaussian Process Regression, which leads to a better accuracy than simple criteria like head-hand, shoulder-hand or elbow-hand lines (see [20]).

Despite their popularity, HMMs have some considerable limitations. An HMM

is a generative model that assumes joint probability over observation and label sequences. The model needs to enumerate all possible observation sequences in which each observation is required to be an atomic entity. Thus, the representation of long-range dependencies between observations or interacting features is computationally not tractable.

The need for a richer representation of observations (e.g., with overlapping features) has led to the development of Maximum Entropy Markov Models (MEMMs) [23]. A Maximum Entropy Markov Model (MEMM) is a model for sequence labeling that combines features of HMMs

and Maximum Entropy models. It is a discriminative model that extends a standard maximum entropy classifier by assuming that the values to be learned are connected in a Markov chain rather than being conditionally independent of each other. It represents the probability of reaching a state given an observation and the previous state. Sung

et al. [24] implemented detection and recognition of unstructured human activity in unstructured environments using hierarchical MEMMs. Sung et al. use different features describing body pose, hand position and motion extracted from a skeletal representation of the observed person.

A major shortcoming of MEMMs and other discriminative Markov models based on directed graphical models is the “label bias problem” which can lead to poorer performances compared to HMMs. This problem has been addressed by Lafferty et al.’s Conditional Random Fields (CRFs) [25]. A conditional model like Conditional Random Fields specifies the probabilities of label sequences for a given observation sequence. Features of the observation sequence do not need to be independent and they may represent attributes at different levels of granularity of the same observations. They could combine several properties of the same observation. Past and future observations may be considered to determine the probability of transitions between labels. Lafferty et al. [25] and Sminchisescu et al. [26] demonstrate how CRFs outperform both HMMs and MEMMs. While Lafferty et al. [25] use synthetic data and real Part-Of-Speech (POS) tagging data, Sminchisescu et al. [26] apply CRFs for the recognition of human motions. For this purpose, Sminchisescu et al. use a combination of 2D features from image silhouettes and 3D features. They show that the CRF’s performance based on 3D joint angle features is more accurate with long range dependencies and that CRFs improve the recognition performance over MEMMs and HMMs.

CRFs can model the transitions between gestures (extrinsic dynamics) but are not able to represent internal sub-structure. This led to the development of Hidden Conditional Random Fields that incorporate hidden states variables that model the sub-structure of a gesture sequence. Wang et al. [27] combine the two main advantages of current approaches to gesture recognition: The ability of CRFs to use long range dependencies and the ability of HMMs to model latent structure. One single joint model is trained for all gestures to be classified and hidden states are shared between those gestures. According to Wang et al. [27] the Hidden Conditional Random Fields (HCRFs) model outperforms HMMs and CRFs in the classification of arm gestures. But since Hidden Conditional Random Fields are trained on pre-segmented sequences of data, they only capture the internal structure but not the dynamics between gesture labels. To overcome this limitation, Morency et al. [28] introduced Latent-Dynamic Conditional Random Fields (LDCRFs), that combine the strengths of CRFs and HCRFs. LDCRFs are able to capture extrinsic dynamics as well as intrinsic substructure and can operate directly on unsegmented data. The performance of LDCRFs was tested on head and eye gesture data of three different datasets and the results were compared to state-of-the-art generative and discriminative modeling techniques like Support Vector Machine (SVM), HMMs, CRFs and HCRFs. The results show that LDCRFs perform better than all other methods.

3 Method

In the following, we use the Microsoft Kinect and OpenNI’s NITE [29] module to obtain a skeleton of the person in front of the Kinect. We then use the joint data over time to detect the occurrence of pointing gestures. Since the Kinect provides relatively stable joint tracks (e.g., it is quite robust against illumination changes), we focus on pointing gesture detection and do not address, e.g., noisy body part detections. However, our trained model can be applied to other sensing methods (e.g., stereo cameras), if comparable, stable joint tracks are provided.

3.1 Crf

Conditional Random Fields (CRFs) are a framework for building probabilistic models to segment and label sequence data [25]. For this purpose, CRFs define an undirected graph

with the random variable

representing the input variable observation and being a random variable over corresponding label sequences. All components are from the label alphabet = {“background”, “rise left”, “point left”, “fall left”, “rise right”, “point right”, “fall right”, “other”}. contains a vertex for each random variable representing an element .

In contrast to HMMs, CRFs do not model the output label sequence and input data sequence as a joint probability , but the conditional probability of a label sequence given a sequence of input data and the model parameters

(1)

with the normalizing partition function

(2)

summing for each over all corresponding with .

The feature functions can be broken down into state functions and transition functions

(3)

State functions model the conditioning of the labels on the input features while the transition functions model the relations between the labels with respect to input features . Due to computational constrains a transition feature functions can only have two label node values as input parameters.

Figure 2: CRF (left) and LDCRF (right) structure illustration with labels , features , and latent variables . The boxes represent transition functions (green: intrinsic; black: extrinsic).

For our approach, the unconstrained access to any input feature , before or after a given frame, in the sequence , is one of the major advantages that CRFs provide in contrast to HMMs (see Sec. 3.4).

3.2 Ldcrf

Latent-Dynamic Conditional Random Fields (LDCRFs) as introduced by Morency et al. [28] (please note that LDCRFs were first introduced as Frame-based Hidden-state Conditional Field in Morency’s PhD-thesis [30]), extend the structure of CRFs by hidden states to model intrinsic structure. Therefore, a set of latent variables is introduced. Here, three hidden states per label . The probability of each label in the graph is substituted by the chain of probabilities of its hidden states

(4)

The random field (see Eq. 1) is now build upon the hidden states

(5)

Similar to CRFs, transition functions model relations between two hidden states. As shown in Fig. 2 these functions can in addition to the extrinsic structure (black boxes), relating observable class nodes, now also model intrinsic structure of a class (green boxes).

Interestingly, HMMs model intrinsic structure with hidden states as well, but they need a model for each label class . Thus, they calculate a probability for each sequence from each trained HMM model, but those probabilities are unrelated. In contrast, LDCRFs seemingly combine those and output a meaningful probability for each (i.e., ).

3.3 Features

Many features have been proposed and evaluated for gesture recognition (see, e.g., [31]). In preliminary experiments, we evaluated several features and feature combinations (e.g., the features that Nickel and Stiefelhagen proposed [20]) We achieved the best results for pointing gesture recognition based on LDCRFs and NITE’s [29] head, shoulder, elbow, and hand data with the following feature combination: The torso relative position of shoulder, elbow and hand, the angle between the vertical y-axis and the shoulder-hand line, the height difference of the two hands, and the distance of the hand to the corresponding shoulder representing the arm extension. To abstract different body dimensions, we normalize the distance of each distance-based feature by the shoulder width. Furthermore, to have a completely angular representation independent of body dimensions, we add each hand’s polar coordinates as feature. The position of each hand in polar coordinates is represented by the azimuth angle (in floor plane) and the elevation angle with respect to the corresponding shoulder. Here, we only use the angles and omit the radius to be body size independent.

3.4 History

We use the CRF’s ability to establish long range relations between input frames, which is the result of each feature function’s ability to access all of the input sequence (see Eq. (3)). We choose to establish state functions that use the input features , where all of our features described in Sec. 3.3 are used as state functions . This way we can represent dynamics in feature changes over an extended period. This is useful, because, for example, at the beginning of a pointing sequence the arm rises faster while it slows down shortly before arriving at the pointing posture. Taking into account deltas over shorter as well as larger time intervals captures this dynamic. We use a non-equidistant history to maintain the same level of computation complexity while covering a larger dynamic range compared to an equidistant history (e.g., [28]). Furthermore, we only make use of previous features (i.e., only with ) since one of our system’s design goals is to keep the latency as low as possible. Thus, in contrast to, e.g., Morency [28], we predict the current label only based on previous observations not waiting for additional future input frames.

4 Evaluation

4.1 Dataset

Figure 3: Example of a person pointing at target 3. Left (top to bottom): The camera image, the depth image, and the skeleton. Right: The skeleton and target object locations. The different lines represent different methods to calculate the pointing gestures direction such as, e.g., the frequently used head-hand line.

We recorded a novel evaluation dataset that consists of 990 pointing gestures: 18 persons (age range 23 to 63; six female and twelve male) performed 55 pointing gestures toward 22 targets. The subjects were positioned approximately m away from the Kinect, which is at the upper end of Kinect’s depth resolution sweet spot and allows us to record a full skeleton even of the tallest subjects. For each camera frame, we recorded the 11 bit depth image, the RGB image, and a 15 joint skeleton as is provided by the NITE framework, see Fig. 3. Apart from the pointing gestures, all subjects performed 10 other gestures (e.g., “hand waving”, “shoulder shrug”, or “come closer”) that we use as other/negative samples for training and testing. Additionally, we determined each person’s dominant eye.

We manually assigned every video frame in the dataset with one of 8 labels: “background” describes idle phases between gestures (e.g., standing upright with hanging arms). “left rise”, “left point”, and “left fall” (analogously, “right rise”, “right point”, and “right fall”) describe the typical pointing behavior of first raising the arm, the actually pointing and fine-tuning of the gesture, and finally lowering the arm again. “other” is used as a label for the various other gestures.

4.2 Results and Discussion

4.2.1 Frame-wise classification

Background Left Rise Left Point Left Fall Right Rise Right Point Right Fall Other
Background 83.44 2.67 0.43 3.29 2.08 0.26 1.83 5.99
Left Rise 6.34 78.14 9.60 1.43 0.48 0.00 0.03 3.99
Left Point 0.03 4.28 87.22 7.09 0.36 0.12 0.15 0.76
Left Fall 5.66 0.52 5.34 85.15 0.07 0.11 0.05 3.11
Right Rise 2.98 1.83 0.07 0.23 79.02 10.03 0.60 5.24
Right Point 0.00 1.12 0.74 0.53 8.79 79.55 5.08 4.18
Right Fall 6.36 0.30 0.08 0.07 0.94 5.49 80.20 6.55
Other 5.22 3.96 1.53 3.12 2.27 3.16 2.23 78.51
(a) LDCRF
Background Left Rise Left Point Left Fall Right Rise Right Point Right Fall Other
Background 74.92 0.61 6.91 1.61 0.86 7.97 1.26 5.85
Left Rise 5.81 82.01 7.38 0.33 0.30 0.99 0.50 2.69
Left Point 10.10 3.22 71.85 1.40 0.15 5.81 0.13 7.34
Left Fall 8.29 0.00 4.46 85.23 0.78 0.56 0.28 0.40
Right Rise 3.06 0.00 0.27 0.97 89.16 6.20 0.00 0.33
Right Point 15.52 1.64 5.19 1.16 2.69 65.70 0.86 7.23
Right Fall 7.89 1.05 0.33 0.03 0.08 3.62 86.91 0.10
Other 18.24 8.17 22.17 8.62 5.23 14.58 4.63 18.35
(b) CRF
Table 1: Leave-one-subject-out evaluation results with LDCRF and CRF.
Background Left Rise Left Point Left Fall Right Rise Right Point Right Fall Other
Background 23.52 15.65 0.05 4.82 20.82 0.00 5.43 29.71
Left Rise 0.78 98.56 0.66 0.00 0.00
Left Point 0.24 8.86 80.33 10.57 0.00
Left Fall 2.61 0.38 0.07 96.94 0.00
Right Rise 2.61 95.40 0.60 0.49 0.89
Right Point 0.56 9.08 78.81 11.49 0.06
Right Fall 7.35 0.16 0.20 92.13 0.16
Other 63.81 8.11 2.74 5.27 7.41 3.08 3.85 5.73
Table 2: Leave-one-subject-out evaluation results with HMM [20]. Please note that Nickel and Stiefelhagen’s approach [20] is not multiclass and does not distinguish between left/right pointing. Accordingly, we do not record mistakes of, e.g., a detected “left rise” on a sequence in which the person points with his/her right arm.

The frame-wise classification results are depicted in confusion matrices in Tab. 1. On average, the LDCRF correctly classifies roughly of the frames labeled as “left/right point” that mark the holding phase with pointing on target of each gesture, see Tab. 4(a). It is important to note that the most common misclassifications are: First, “rise” and “fall” are misclassified as either “background” or “point”. Second, “point” is misclassified as either “rise” or “fall”. In our opinion, such mistakes are not critical, because during the transition phases from one state into the other (e.g., from “rise” to “point”) there is substantial label ambiguity over several frames between these classes. And these misclassifications almost exclusively occur during these transition phases.

Furthermore, we can see that the LDCRF provides a substantially better performance than the CRF, compare Tab. 4(a) and Tab. 4(b). Most importantly, we can see that the CRF often misclassifies frames from “other” gestures as being parts of pointing gestures, which in practice could lead to a significant amount of false positive pointing gesture misdetections. This is an important aspect for our intended use in Human-Robot Interaction, because it is less disturbing for an interaction partner to repeat a pointing gesture than to have the robot react to a falsely detected pointing gesture. The fact that the LDCRF rarely makes such mistakes is most likely due to its ability to learn and model the intrinsic structure of pointing gesture phases.

4.2.2 Sequence detection and segmentation

Prediction Class Type LDCRF HMM
Exact (End) TP 20.32 0.21
Detection longer TP 23.89 0.31
Detection shorter TP 29.26 67.26
No detection (control) TN 9.47 10.53
Phantom detection FP 0.63 0.83
Overlapping detection FN 16.42 0.00
Missed detection FN 0.00 20.75
Total True T* 82.95 78.31
Total False F* 17.05 21.58
Table 3: Sequence detection results (in %).
Figure 4: Sequence evaluation state machine.

As we have seen in the frame-wise classification, most of the pointing misclassifications are non-critial confusions between “rise”, “point”, “fall”, and “background”. Accordingly, we can use a window to suppress such misclassifications and obtain a continuous pointing detection. For this purpose, we use a simple median window to filter the frame-wise detections and eliminate small discontinuities. To evaluate the resulting continuous blocks of pointing gesture detections, we use a special state machine that is able to distinguish between different types of detection behaviors, see Fig. 4 and Tab. 3.

As can be seen in Tab. 3, the resulting system exhibits a false positive rate of . Furthermore, the system detects all pointing gestures, but unfortunately – due to the simple filtering mechanism – it classifies two immediately subsequent pointing gestures into a single pointing gesture detection in of the cases. However, we expect that we can easily improve on these number given a more elaborate frame grouping mechanism than median filtering of the predicted labels. In of the cases, the end frame of the predicted and annotated pointing sequence are exact matches. The predicted pointing segment is slightly longer than the annotation (i.e., an overlap of “rise” and/or “fall” with “background”) in of the detections and slightly shorter in . This can be explained with the label ambiguity that we already addressed in Sec. 4.2.2.

4.2.3 Baseline

To serve as a baseline, we implemented Nickel and Stiefelhagen’s [20] HMM-based pointing gesture detection system, which – despite its age – is still a popular method. Interestingly, the HMMs performes bad with our feature set (see Sec. 3.3) and, analogously, the LDCRF performs bad with Nickel and Stiefelhagen’s features set. Consequently, we report the HMM results based on Nickel and Stiefelhagen’s [20] original set of features. As we can see in Tab. 2, the HMM correctly detects roughly of frames that belong to rise or fall. However, this comes at the cost of a substantial amount of false rise and fall detections, especially for “background” frames but also for “other” and “point”. This leads to the fact that the ability to detect “point” frames is substantially lower than for the LDCRF, see Tab. 4(a). If we consider sequence detection and segmentation based on the HMM’s frame-wise detections, see Tab. 3, we notice that the HMM does miss entire pointing sequences (i.e., missed detection). Furthermore, we can see that the false positive rate of the LDCRF is better compared to the HMM’s with and , respectively.

5 Conclusion and Future Work

We presented how we use LDCRFs based on depth-based skeleton joint data to perform person-independent pointing gesture detection. Here, we have shown that LDCRFs outperform traditional CRFs and also HMMs for pointing gesture detection and labeling of individual frames. Based on the labeled frames of a video sequence, we can use filtering over time to suppress false detections and determine the onset and end of actual pointing gestures. Thus, segmenting pointing gesture of arbitrary length in video sequences. This way, we were able to reliably and efficiently detect pointing gestures with a very low false positive rate. To evaluate our approach, we recorded a novel dataset.

We leave two important aspects as future work: First, we intend to improve the pointing sequence extraction based on the frame-wise labeled video frames, specifically to avoid misclassification of two successive pointing gestures as one pointing gesture. Second, we want to determine the optimal point in time of a pointing gesture to determine the target object, because in preliminary experiments in our dataset we have shown that it has a drastic influence on the ability to correctly determine the pointed-at object (i.e., an improvement of the correct classification rate by up to roughly % on our dataset).

References

  •  1. Schauerte B, Stiefelhagen R. Look at this! Learning to Guide Visual Saliency in Human-Robot Interaction. In: Proc. Int. Conf. Intelligent Robots and Systems; 2014. .
  •  2. Schauerte B, Fink GA. Focusing Computational Visual Attention in Multi-Modal Human-Robot Interaction. In: Proc. Int. Conf. Multimodal Interfaces; 2010. .
  •  3. Bolt RA. “Put-that-there”. In: Proc. Ann. Conf. Computer graphics and interactive techniques (SIGGRAPH); 1980. .
  •  4. Suarez J, Murphy RR. Hand gesture recognition with depth images: A review. In: Proc. Int. Workshop on Robot and Human Interactive Communication; 2012. .
  •  5. Mitra S, Acharya T. Gesture recognition: A survey. IEEE Trans Systems, Man and Cybernetics Part C: Applications and Reviews. 2007;37:311–324.
  •  6. Wachs JP, Kölsch M, Stern H, Edan Y. Vision-based hand-gesture applications. Communications of the ACM. 2011;54:60.
  •  7. Gavrila DM. The Visual Analysis of Human Movement: A Survey. Computer Vision and Image Understanding. 1999;73:82–98.
  •  8. Pavlovic VI, Sharma R, Huang TS. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Trans Pattern Analysis and Machine Intelligence. 1997;19:677–695.
  •  9. Jaimes A, Sebe N. Multimodal human-computer interaction: A survey. Computer Vision and Image Understanding. 2007;108:116–134.
  •  10. Jojic N, Brumitt B, Meyers B, Harris S, Huang T. Detection and estimation of pointing gestures in dense disparity maps. In: Proc. Int. Conf. Automatic Face and Gesture Recognition; 2000. .
  •  11. Feris R, Turk M, Raskar R, Tan KH, Ohashi G. Recognition of isolated fingerspelling gestures using depth edges. In: Real-Time Vision for Human-Computer Interaction; 2005. p. 43–56.
  •  12. Van Den Bergh M, Carton D, De Nijs R, Mitsou N, Landsiedel C, Kuehnlenz K, et al. Real-time 3D hand gesture interaction with a robot for understanding directions from humans. In: Proc. Int. Workshop on Robot and Human Interactive Communication; 2011. .
  •  13. Keskin C, Kıraç F, Kara YE, Akarun L. Real time hand pose estimation using depth sensors. In: ICCV Workshops; 2011. .
  •  14. Biswas KK, Basu SK. Gesture recognition using Microsoft Kinect®. In: Proc. Int. Conf. Automation, Robotics and Applications; 2011. .
  •  15. Ramey A, Gonzalez-Pacheco V, Salichs MA, González-Pacheco V. Integration of a low-cost RGB-D sensor in a social robot for gesture recognition. In: Proc. Int. Conf. Human-robot interaction; 2011. .
  •  16. Chen FS, Fu CM, Huang CL. Hand gesture recognition using a real-time tracking method and hidden Markov models. Image and Vision Computing. 2003;21(8):745–758.
  •  17. Yang C, Jang Y, Beh J, Han D, Ko H. Gesture recognition using depth-based hand tracking for contactless controller application. In: Proc. Int. Conf. Consumer Electronics (ICCE); 2012. .
  •  18. Zafrulla Z, Brashear H, Starner T, Hamilton H, Presti P. American sign language recognition with the kinect. In: Proc. Int. Conf. Multimodal Interfaces; 2011. .
  •  19. Vogler C, Metaxas D. ASL recognition based on a coupling between HMMs and 3D motion analysis. In: Proc. Int. Conf. Computer Vision; 1998. .
  •  20. Nickel K, Stiefelhagen R. Pointing Gesture Recognition based on 3D-Tracking of Face, Hands and Head Orientation Categories and Subject Descriptors. In: Proc. Int. Conf. Multimodal interfaces; 2003. p. 140–146.
  •  21. Park CB, Roh MC, Lee SW. Real-time 3D pointing gesture recognition in mobile space. In: Proc. Inf. Conf. Automatic Face & Gesture Recognition; 2008. .
  •  22. Droeschel D, Stückler J, Behnke S. Learning to interpret pointing gestures with a time-of-flight camera. In: Proc. Int. Conf. Human-Robot Interaction; 2011. .
  •  23. McCallum A, Freitag D, Pereira FC. Maximum Entropy Markov Models for Information Extraction and Segmentation. In: ICML; 2000. .
  •  24. Sung J, Ponce C, Selman B, Saxena A. Unstructured human activity detection from rgbd images. In: Proc. Int. Conf. Robotics and Automation; 2012. .
  •  25. Lafferty J, Mccallum A, Pereira FFCN. Conditional Random Fields : Probabilistic Models for Segmenting and Labeling Sequence Data.

    In: Proc. Int. Conf. Machine Learning; 2001. .

  •  26. Sminchisescu C, Kanaujia A, Metaxas D. Conditional models for contextual human motion recognition. Computer Vision and Image Understanding. 2006;104:210–220.
  •  27. Wang SBWSB, Quattoni a, Morency LP, Demirdjian D, Darrell T. Hidden Conditional Random Fields for Gesture Recognition.

    In: Proc. Int. Conf. Computer Vision and Pattern Recognition; 2006. .

  •  28. Morency L, Quattoni A, Darrell T. Latent-dynamic discriminative models for continuous gesture recognition. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE; 2007. p. 1–8.
  •  29. Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, et al. Real-time human pose recognition in parts from single depth images. Communications of the ACM. 2013;56(1):116–124.
  •  30. Morency LP. Context-based visual feedback recognition. Massachusetts Institute of Technology; 2006.
  •  31. Richarz J, Fink GA. Feature representations for the recognition of 3D emblematic gestures. In: Human Behavior Understanding. Springer; 2010. p. 113–124.