A New GNG Graph-Based Hand Gesture Recognition Approach

09/08/2019 ∙ by Narges Mirehi, et al. ∙ Shahid Beheshti University 0

Hand Gesture Recognition (HGR) is of major importance for Human-Computer Interaction (HCI) applications. In this paper, we present a new hand gesture recognition approach called GNG-IEMD. In this approach, first, we use a Growing Neural Gas (GNG) graph to model the image. Then we extract features from this graph. These features are not geometric or pixel-based, so do not depend on scale, rotation, and articulation. The dissimilarity between hand gestures is measured with a novel Improved Earth Mover s Distance (IEMD) metric. We evaluate the performance of the proposed approach on challenging public datasets including NTU Hand Digits, HKU, HKU multi-angle, and UESTC-ASL and compare the results with state-of-the-art approaches. The experimental results demonstrate the performance of the proposed approach.



There are no comments yet.


page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonverbal communication, which involves communication through hand gestures, body, and facial movements, contains about 65% of all human communication Hogan (2003). Hand gestures are the most important part of nonverbal communication among body, arm, and facial movements (body language). One of the objectives of intelligent systems is to facilitate natural human-computer interaction. Among various behaviors of human-computer interaction, hand gesture is a natural and effective way for communication with significant ability to exchange information.

However, hand gesture recognition is known as a difficult problem in computer vision due to the varieties in the shape, size, and direction of hands or fingers in different hand images. The problem can be generally divided into two categories: static and dynamic conditions. The recognition of dynamic gestures tries to examine spatial-temporal characteristics, while static detection focuses on the internal information of an image. The study of static gesture recognition is an essential part of hand gesture recognition because hand shapes carry specific information without any movement

Li et al. (2018).

Most previous approaches have considered developing systems for hand gesture recognition using a combination of preprocessing and machine learning methods. Most of these studies extract pixel-based features and classify hand gestures using machine learning methods

Dong et al. (2015); Li et al. (2018)

. The study of hand gesture recognition using meaningful shape features is important because meaningful shape features improve stability against articulation, scale, rotation, and noise. Hence, in recent decades, researchers have been trying to recognize hand gestures using significant features extracted from the shape of the image and its boundary

Ren et al. (2011); Wang et al. (2017). Skeleton, geometry and graph-based methods are the most well-known methods in this area Ren et al. (2011, 2013); Wang and Yang (2013); Wang et al. (2017).
Skeleton-based methods have attractive properties such as invariance to scale and rotation by capturing topological and geometrical information of skeletal branches. The main limitation of skeleton-based methods is low stability against contour noises. In fact, small variation or noise on the boundary of the object can cause redundant branches in the skeleton and significant changes in the structure of its topology.

Geometric-based methods are studied geometric conditions of the image, such as Euclidian distance and angle of fingers with respect to the center of the palm and describe the important information of the object with a summary vector and ignore the redundant information of the pixels. These approaches may be influenced by articulation and viewpoint. Most of them are based on hand contour, but they are often distorted due to low resolution and precision of the current depth cameras. In other words, their performance may be reduced due to orientation and noises on contour

Wang et al. (2017).

Graphs are robust against rotation, articulation, and noise. The inherent properties of a graph do not depend on its representation, so it can be used as effective tools for image representation. Hence, various graph-based methods have been presented for hand gesture recognition, but their structure depends on the local information of the pixels, and the loss of some pixel information, such as noise and small inner holes reduces their performance Li and Wachs (2014); Triesch and von der Malsburg (2002); Wang et al. (2015, 2017). Recently, a new graph-based method that uses GNG to construct the graph and linear discriminant analysis (LDA) is introduced for hand gesture recognitionMirehi et al. (2019).

In this paper, we use an idea similar to Mirehi et al. (2019) to form a GNG graph for a given image then, we introduce new topological features for hand gesture recognition. The new features can extract convexity and concavity of boundaries more precisely. We also introduce an improved version of Earth Mover's Distance to measure the dissimilarity between feature vectors. This leads to higher accuracy in different datasets. We evaluate the proposed approach on challenging datasets including NTU Hand Digit dataset, HKU, HKU multi-angle, and UESTC-ASL dataset.

The rest of the paper is organized as follows: Section 2 reviews the related work briefly. Section 3 presents the basic steps of the proposed method and hand gesture recognition approach. The results of the experimental study and a comparison with state-of-the-art approaches are presented in Section 4. Finally, Section 5 concludes this paper.

2 Related works

In this section, we review the state-of-art approaches briefly. Different skin color methods have been used for hand detection and segmentation. The main decision on providing a skin color model is the choice of color space. However, variations of skin colors and background objects with color distribution similar to human skin can confuse the methods Rautaray and Agrawal (2015). Most of the current methods use Kinect sensors to collect data, detect and segment hand information Maqueda et al. (2015); Cheng et al. (2016); Wang et al. (2017); Li et al. (2018). Depth cameras facilitate the hand segmentation process compared to skin-based models, especially when there are similar texture backgrounds Plouffe and Cretu (2016); Ren et al. (2011, 2013). In these approaches, the hand of the user is considered as the nearest object of the scene to the camera and segmentation is performed by specifying a threshold value. We use the same way for hand detection and segmentation in this study.

For more precision in hand detection and segmentation, some approaches applied both the depth map and skeleton tracking provided by Kinect for hand detection Presti and La Cascia (2016); Zafrulla et al. (2011); Wang et al. (2015). Although these methods may provide more accuracy in hand prediction, they suffer from configuration complexity.

Various hand features can be used for hand recognition. Hand features can be grouped into almost two groups, including pixel-based Maqueda et al. (2015); Zhang et al. (2013) and shape-based features Bai and Latecki (2008); Belongie et al. (2002); Stergiopoulou and Papamarkos (2009). Shape-based features contain geometry, graph, and skeleton-based features.

Belongie et al. introduced a shape context descriptor by computing a log-polar histogram of the relative position of contour points Belongie et al. (2002). Fritzke et al. presented an incremental network which learns the topological structure of input vectors by a simple Hebb-like rule Fritzke (1995).

Stergiopoulou and Papamarkos applied GNG graph for image representation and considered limited geometric features such as the distance and angle between neurons

Stergiopoulou and Papamarkos (2009).

The skeleton of objects can be considered another source of shape information for hand gesture recognition Bai and Latecki (2008). Noisy and distorted contours have a significantly negative effect on extraction of the correct skeleton. Zhang et al. used local features for hand gesture recognition. They computed the Histogram of Oriented Gradients (HOG) of 3D point distribution in color images Zhang et al. (2013).

In Maqueda et al. (2015), a Volumetric Spatiograms of Local Binary Patterns (VS-LBP) method was employed for hand gesture recognition. Despite the appropriateness of the results, these approaches depend on the local information of the pixels, which reduces their stability against pixel distortions Zhang et al. (2013); Maqueda et al. (2015). Ren et al. in Ren et al. (2011) and Ren et al. (2013) proposed a contour-based method by a Finger Earth Mover's Distance (FEMD) and a template matching approach. Contours are often distorted due to low resolution and the precision of the current depth cameras, which affects the accuracy of contour-based approaches. Wang et al. presented a color-depth Superpixel Graph Earth Mover's Distance (SP-EMD) constructed by segmenting pixels to almost the same size superpixels. They applied Earth Mover's Distance (EMD) to measure the similarity of hand gestures and considered the cost of the centroid of superpixels based on their depth information and location, of them which can be influenced by camera conditions and the variety of hand shapes Wang et al. (2015). Wang e.al extended the previous method based on Canonical Superpixel-Graph to reduce hand the shape variation problem Wang et al. (2017). In another study, a super-pixel based finger Earth Mover's Distance (SPFEMD) approach was proposed which was considered only super-pixel of fingers and was used template matching Wang et al. (2019). An Image-to-Class Dynamic Time Warping (I2C-DTW) approach for both 3D static and trajectory hand gesture recognition was introduced in Cheng et al. (2016) by computing the Image-to-Class distance for hand gesture classification.

In Plouffe and Cretu (2016), a K-curvature algorithm, which is based on the change in the slope angle of the tangent line, was employed for the localization the fingertips over the contour extracted from depth data, and dynamic time warping (DTW) is applied for gesture recognition. This approach depends on the precision and resolution of the depth data.

Moreover, various deep learning approaches have been proposed for developing hand gesture recognition systems

Cheok et al. (2019); Farooq and Won (2015); Li et al. (2018); Núñez et al. (2018). Li et al. provided a deep CNN framework for hand gesture recognition using the four-channel RGB-D (Depth) of the image Li et al. (2018)

. Their disadvantage is their dependence on lighting conditions. Nunez et al. proposed the combination of a CNN and a Long-Short Term Memory (LSTM) network based on human skeleton kinematics for the hand gesture recognition problem

Núñez et al. (2018).

3 Proposed approach

In the previous work Mirehi et al. (2019), Growing Neural Gas (GNG) graphs were constructed from binary images and the bulges of hand including fingers and wrist were computed, afterward, topological and geometrical features were extracted from the GNG graph. The hand gestures were first categorized based on the number of bulges and then classified according to defined features by LDA. In the current work, we extend method Mirehi et al. (2019) in the following directions.

1) Defined features are developed to recognize the hand gesture, and new features describing the shape of the boundary, such as concavity, convexity, and overall bulge shape are defined.

2) A new Earth Mover's Distance (EMD) is introduced to measure the dissimilarity of feature vectors extracted GNG graphs.

3) Hand gestures are classified by k-NN classifier according to Earth Mover's Distance.

4) The proposed approach is evaluated on more challenging datasets including HKU, HKU multi-angle, and UESTC-ASL.
We first briefly describe the overall framework and then describe each step in detail. The main steps of the proposed approach contain:
1. Segmentation of the hand gesture image.
We use the simple thresholding on the depth map for hand segmentation. We consider the hand of the user as the nearest object of the scene to the camera and segment the hand region by thresholding on the depth. This is a simple and effective method for hand segmentation, which is applied in many approaches Ren et al. (2011, 2013).
2. Constructing the GNG graph of the binary image.
3. Computing the outer boundary of the graph by computational geometry approaches.
4. Extracting the topological and geometrical features of the graph.
5. Measuring the dissimilarity between the hand gestures using a new Earth Mover's Distance.
6. Classifying the hand gestures by k-NN algorithm.

3.1 Constructing the GNG graph

Various approaches can be applied to construct a graph for an image. Our target graph should provide the following properties:

  • The vertices should be distributed almost uniformly within the image, and the edges should be almost equal.

  • The number of vertices should be constant in other words, the graph should not depend on the scale of the image.

  • The graph should ignore the holes and cracks inside the image and be robust against the noise on the image contour.

We choose the GNG graph because it can very well satisfy the explained properties. The GNG algorithm includes a low-dimensional subspace of the input data space while learning the topological structure of data distribution Fritzke (1995). Moreover, the GNG algorithm has the ability to follow the behavior of vertices in changing dynamic conditions and can be extended to 3D online representation and object tracking Fink et al. (2015); Orts-Escolano et al. (2016); Sun et al. (2017). In the following, we will describe the GNG algorithm briefly.

3.2 The GNG graph

Growing Neural Gas is an unsupervised learning algorithm

Fritzke (1995). The algorithm starts with two vertices that are located in a random location and then updates the location of some vertices by comparing the distance of vertices with the initial data in each step. Some error is assigned to the closest vertex to the compared input data, which indicates the distance between them. The closest vertices to the compared input data are connected by a zero-age edge and old edges are removed from the graph. Eventually, new vertices are added between the vertices with a high error value. The algorithm is repeated until a finishing criterion occurs. The details are explained in Fritzke (1995). The principal parameters of the algorithm include:

  • : is the number of vertices.

  • and : The first closest vertex to the input data and its neighbors are moved towards the input data by fractions of and , respectively.

  • : The edges older than are removed in every step.

  • : The number of input data used for comparison is .

  • : The error value of vertices with most error is decreased, after inserting new vertex between them by a multiple of .

  • : All error variables are decreased in every step by a multiple of .

We tested different values for GNG parameters and set , , , , , , and .

3.3 Extracting the outer boundary

The outer boundary of a GNG graph can be computed by an algorithm similar to convex hull. The algorithm selects the leftmost vertex of the graph and walks clockwise around the graph to reach initial vertices. More details are described in Mirehi et al. (2019). The GNG boundary is an approximation of the contour of the image so unlike pixel-based boundaries, it is not sensitive to noise on the contour. Figure 1 shows an example of a GNG graph and its boundary.

Figure 1: (a) The GNG graph of a hand gesture image, (b) the outer boundary of the GNG graph is specified with red color.

3.4 Topological features

In this section, we introduce the meaningful features that capture the topological and geometrical properties of the graph. At first, we find the peaks and troughs in the boundary and then defined the features by using them.

Let be the GNG and H be the spanning subgraph of G consisting of boundary edges, note that the number of the vertices of G and H is the same. The adjacency matrices of and are specified with and , respectively. For each peak on the boundary, there are two vertices connecting it to the rest of the image. We select these vertices as the basic vertices of the bulge and the subgraph inside this peak as the bulge itself. The distance of the basic vertices of the bulge in G is less than a multiple of their distance in H.

Figure 2: An example of a bulge.

There is a standard for hand and body measurement Klein (2012). According to this, the length of the middle finger (fingertip to knuckle) is at least 5.5 times its width. The length of the little finger is not smaller than half the length of the middle finger.

The experimental study indicates that the distance of the basic vertices of a finger is at most 2 in a GNG graph of a hand with 300 vertices, so we consider the distance between the base of one finger is 2, and the length of fingers is greater than 4. To find the bulges, we compute the matrix . Non-zero elements of describes the edges of that do not belong to the graph ; therefore, shows the number of walks of length avoiding between vertices Bondy and Murty (2008).

The pairs of vertices whose corresponding entry in are non-zero candidate as basic vertices of a finger. Furthermore, the pairs of vertices whose corresponding entry in are non-zero candidate as basic vertices of sticking fingers.

To find the wrist, we consider the distance between the basic vertices of the wrist in as 6 or 7. The shape of the wrist is close to a rectangle, so the distance between the basic vertices of the wrist in graph is at least 11. In a similar way, matrix

is used for finding the basic vertices of the wrist. Between candidate pairs of basic vertices to find fingers and wrist, we select the pair with the largest distance in . Figure 3 displays an example of bulges (finger and wrist) from the GNG of a hand gesture. The geometrical and topological features are defined as follows.

Figure 3: The GNG graph of a hand gesture and its bulges including fingers and wrist.
  • The ratio of distances between fingers and the wrist ( and )
    Feature measures the ratio of distances between the first finger and the wrist, and feature is the ratio of distances between the last finger and the wrist. There are counterclockwise and clockwise paths from the basic vertices of a finger to the basic vertices of the wrist in (We replace other fingers with their base path ). The length of the counterclockwise and clockwise paths are specified with and , respectively. Feature for the first finger is defined as,

    Feature for the last finger is defined as,

    The features of and illustrate the relative location of the fingers with respect to the wrist. In Figure 4-a, the paths indicating and for the thumb are shown with black edges.

  • Distances between bulges ()
    For consecutive bulges and , shows the distance between them in (see Figure 4-b).

  • The length of bulges ()
    This feature measures the length of a bulge. For a given bulge , is the length of the path between basic vertices of in . This path for middle finger, index finger and thumb is specified with black edges in Figure 4-c with lengths 17, 17, and 13, respectively.

  • The width of bulges ()
    This feature contains the distance between the pair of the basic vertices of a bulge (see Figure 4-d).

  • The number of GNG vertices in a bulge ()
    This feature shows the number of GNG vertices that are in a bulge (see Figure 4-e).

  • The number of GNG vertices that are inside the convex hull of region between bulges ()
    This feature measures the number of vertices inside the convex hull of the shortest path between the two consecutive bulges. In fact, this feature measures the convexity or concavity of the region between the two consecutive bulges. In Figure 4-f, the convex hull of the shortest path between two fingers is shown with black.

  • The aspect ratio of OMBB of a bulge ()
    Given a bulge, this feature computes the ratio of the width and length of the oriented minimum bounding box (OMBB) of the bulge. Figure 4-g shows the OMBB of the index finger and thumb.

  • The aspect ratio of OMBB of the distance between bulges ()
    This feature finds the OMBB of the shortest path between bulges and computes the ratio of the width and length of it (see Figure 4-h).

Figure 4: features in a GNG graph of a hand gesture.

3.5 Improved Earth Mover's Distance (IEMD)

We define a new Earth Mover's Distance to measure the dissimilarity between the extracted features of the GNG graphs.

The Earth Mover's Distance (EMD) is a measure of the distance between two probability distributions over a region

Rubner et al. (2000). Different researchers have presented various forms of the EMD in terms of their application Wang et al. (2015, 2017). We compute the IEMD of two GNG graphs as follows. The first GNG graph is defined as a signature with m clusters, where represents the cluster, is the weight of the cluster, and shows the signature of the second GNG graph with n clusters. We consider every computed bulge as a cluster. The weight of a bulge is a vector with length 7, which contains the principle defined information of the clockwise order bulges and their relationship with other bulges as follows.


Since the basic vertices of the wrist are not necessarily located at the wrist of the image and might be on the forearm, we use and instead of to describe the relative distance of the wrist.

The maximum number of bulges occurs when all fingers are open. In this case, the number of bulges equals to 6 including five fingers and the wrist. To improve partial matching and reduce mismatching, we insert virtual clusters in signatures such that the number of its clusters equals 6. If m is less than 6, virtual clusters with zero weights be inserted into the signature and similarly virtual clusters with zero weights be inserted to signature . For the two bulges and , the cost is defined as


We expect that a good matching preserves the order of bulges. In this case, the cost is computed as the difference of the weight of these bulges. Otherwise, a penalty in the form is added to the cost. We need to find a flow between two signatures where is the flow between and

with the following constraints:


Then, IEMD is defined as,


We used IEMD to measure the dissimilarity between two hand gestures. This measurement does not depend on the location and depth of the pixels Wang et al. (2015, 2017), which results more stability against hand shape variations. Moreover, IEMD can be applied to measure the dissimilarity in other approaches.

3.6 Hand gesture recognition

Finally, we use the k-nearest neighbors (k-NN) algorithm for hand gesture classification. In this algorithm, the value of k is chosen according to the data. In our experiments, the value of k is considered 3 and the class of a hand gesture is predicted by three of the nearest neighbors of the training class. The size and variety of training sets can affect accuracy. In order to comprehensively evaluate the proposed approach, the training data sets are chosen as follows. Using half of the data for training and the other half for testing (h-h), leave-p-subject-out (l-p-o). In (l-p-o) validation protocol, if the dataset includes N subjects, N-p subjects are chosen for training and the remaining p subjects are used for testing. This procedure is repeated for every combination of p subjects, and then the average accuracy is reported. We choose leave-one-subject-out (l-o-o) and leave-4-subject-out (l-4-o) cross-validation, which are more common protocol to evaluate approaches.

4 Experimental study

In this section, we evaluate and compare our approach with some state-of-the-art approaches such as Ren et al. (2011), skeleton matching Ren et al. (2013), Hand dominant line Wang and Yang (2013), H3DF Zhang et al. (2013), VS-LBP Maqueda et al. (2015), SP-EMD Wang et al. (2015), CSG-EMD Wang et al. (2017), and GNG+LDA Mirehi et al. (2019) on different datasets. These datasets are NTU Hand Digits, HKU, HKU multi-angle, and UESTC-ASL datasets. First we introduce these datasets briefly.

4.1 Datasets

4.1.1 NTU Hand Digits dataset

The NTU Hand Digits dataset is collected with Kinect and includes 1000 color images and their depth maps in cluttered backgrounds. It contains 10 hand gestures of decimal digits 0-9, which are performed by 10 subjects with 10 samples per gesture. The subjects pose with variations such as orientation, articulation, and scale in gestures Ren et al. (2011). Figure 5 shows some of these images.

Figure 5: Some samples from NTU Hand Digits dataset.

4.1.2 HKU dataset

HKU dataset is captured using Kinect from 5 subjects. It consists of 1000 joint color-depth images with 10 gestures from labels 0 to 9. The subjects have performed each gesture in 20 different poses Wang et al. (2015). This dataset contains 10 gestures with 20 different poses from 5 subjects. In this dataset, the hand motions include large in-plane rotation and moderate out-of-plane rotation. In Figure, 6 gesture samples are shown.

Figure 6: The gesture samples of HKU dataset.

4.1.3 HKU multi-angle dataset

The HKU multi-angle hand gesture dataset is an extension of HKU dataset with challenging samples from 4 different viewing angles (approximately 0, 10 and 20) with 5 subjects. The HKU multi-angle includes 2000 color images 111The dataset downloaded from the link reported Wang et al. (2015) contains 2000 images while the authors indicated that the dataset includes 3000 images. for testing Wang et al. (2015). In Figure 7, gesture samples are shown.

Figure 7: The gesture samples of HKU multi-angle dataset.

4.1.4 UESTC-ASL dataset

UESTC-ASL dataset is collected 1100 color images from ASL digit gestures by Kinect. Gestures from 1 to 10 are performed 11 times by 10 subjects in different orientations, depths, and scales Cheng et al. (2016). Figure 8 shows some samples of UESTC-ASL dataset. Due to the high similarity among the gestures of ASL digits and small inter-class variations, this dataset is really challenging.

Figure 8: The sample gestures of UESTC-ASL dataset Cheng et al. (2016).

4.2 Mean accuracy

We test the proposed approach on NTU Hand Digits, HKU, HKU multi-angle, and UESTC-ASL datasets in a 3GHz CPU with Matlab implementation. The experimental results and comparison with the state-of-the-art approaches are reported in Tables 1-4.

Table 1 shows the results and the comparison on NTU Hand Digits dataset. The proposed approach is compared with well-known approaches as Thresholding Decomposition Ren et al. (2011), skeleton matching Ren et al. (2013), Hand dominant line Wang and Yang (2013), H3DF Zhang et al. (2013), VS-LBP Maqueda et al. (2015), CSG-EMD Wang et al. (2017), and GNG+LDA Mirehi et al. (2019).

Approaches based on (h-h) Mean accuracy
Skeleton matching Ren et al. (2013) 78.6
Near-convex Decomposition+FEMD Ren et al. (2011) 93.9
Hand dominant line + SVM Wang and Yang (2013) 97.1
VS-LBP + SVM Maqueda et al. (2015) 97.3
GNG+LDA Mirehi et al. (2019) 98.68
Approaches based on Deep learning Mean accuracy
Deep network + RGB-D images Li et al. (2018) 98.5
Approaches based on (l-o-o) Mean accuracy
Thresholding Decomposition+FEMD Ren et al. (2011) 95
Shape context without bending cost Ren et al. (2013) 97
Shape context with bending cost Ren et al. (2013) 95.7
Skeleton matching Ren et al. (2013) 96
Hand dominant line + SVM Wang and Yang (2013) 91.1
H3DF Zhang et al. (2013) 95.5
VS-LBP + SVM Maqueda et al. (2015) 95.9
CSG-EMD (shape only) Wang et al. (2017) 99.6
CSG-EMD Wang et al. (2017) 99.7
GNG+LDA Mirehi et al. (2019) 98.6
Approaches based on (l-4-o) Mean accuracy
Thresholding Decomposition+FEMD Ren et al. (2011) 91.025
Shape context without bending cost Ren et al. (2013) 92.2
Shape context with bending cost Ren et al. (2013) 85.375
Skeleton matching Ren et al. (2013) 90.475
SP-EMD Wang et al. (2017) (shap only) 96.5
SP-EMD Wang et al. (2017) 97.2

Table 1: The comparison of performance on NTU Hand Digits dataset

As we can see, the proposed approach (GNG-IEMD) achieves the highest mean accuracy of 99.9%, 99.7%, and 99.3% in (h-h), (l-o-o) and (l-4-o) cross validation protocols, respectively. Although GNG-IEMD uses only the binary images and others utilize the color and depth information Wang et al. (2015, 2017); Li et al. (2018), our results are more significant. One of the reasons for this is the use of graph distances instead of Euclidean distances. Moreover, as presented in Table 1, the mean accuracies in (l-4-o) CV and (l-o-o) CV do not differ significantly compared with other approaches, which indicates the insensitivity of the approach to the training data. Figures 8(b), 8(c), and 8(a)

show the confusion matrix of hand gestures in this dataset. In a few cases, gestures with the same number of fingers have been mismatched. The reason might be the similarity in the topology of gestures and inaccuracy in segmentation.

(a) NTU dataset (h-h) CV
(b) NTU dataset (l-o-o) CV
(c) NTU dataset (l-4-o) CV
(d) HKU dataset (l-o-o) CV
(e) HKU dataset (l-4-o) CV
(f) HKU multi-angle dataset (l-o-o) CV
(g) HKU multi-angle dataset (l-4-o) CV
(h) UESTC-ASL dataset (h-h).
(i) UESTC-ASL dataset (I2I).
Figure 9: Confusion matrix of GNG+ EMD on NTU, HKU, HKU multi-angle and UESTC-ASL dataset.

The results on HKU dataset is presented in Table 2. It can be seen that GNG-IEMD achieved considerable recognition rates in (l-o-o) CV and (l-4-o) CV among other approaches. Another substantial point is the smallest difference between (l-o-o) and (l-4-o) recognition rates, which results in the independency of our approach on the user and the training data. The confused cases are shown in the Figures 8(e), and 8(d).

Approaches (l-o-o CV) (l-4-o CV)
Thresholding Decomposition+FEMD Ren et al. (2011) 95 91
Skeleton matching Ren et al. (2013) 96 90.5
SP-EMD (shape only) Wang et al. (2015) 98.7 96.1
SP-EMD Wang et al. (2015) 99.2 97.3
CSG-EMD (shape only) Wang et al. (2017) 99.4 97.4
CSG-EMD Wang et al. (2017) 99.4 97.9
GNG-IEMD 99.6 98.65
Table 2: The comparison of performance on HKU dataset
Approaches (l-o-o CV) (l-4-o CV)
Thresholding Decomposition+FEMD Ren et al. (2011) 96.2 89.7
Skeleton matching Ren et al. (2013) 95.1 90.3
SP-EMD (shape only) Wang et al. (2015) 95.3 92.5
SP-EMD Wang et al. (2015) 97.8 94.7
CSG-EMD (shape only) Wang et al. (2017) 96.1 93.7
CSG-EMD Wang et al. (2017) 97.9 95.6
GNG-IEMD 96.9 96.4
Table 3: The comparison of performance on HKU multi-angle hand gesture dataset

The evaluation on HKU multi-angle dataset is shown in Table 3. The recognition accuracies of the proposed approach in both (l-o-o) and (l-o-4) CV are appropriate, which shows the stability of our approach against rotation. Not that GNG-IEMD does not use depth and color data for recognition. Figures 8(f) and 8(g) show the confusion matrix.

Input sign Thresholding Decomposition+FEMD Ren et al. (2011) I2I-DTW Cheng et al. (2016) I2C-DTW Cheng et al. (2016) GNG-IEMD ( I2I) GNG-IEMD (h-h)
1 100 92 100 96 96
2 100 99 100 100 100
3 95 99 95 85 98
4 100 96 100 100 100
5 100 98 100 100 100
6 59 58 80 53 76
7 80 66 77 80 94
8 64 73 90 78 84
9 57 44 72 47 82
10 73 93 91 100 100

82.8 81.8 90.5 83.9 93
Table 4: The performance on UESTC-ASL dataset

The number of studies in UESTC-ASL dataset is limited Ren et al. (2011); Cheng et al. (2016). We compared GNG-IEMD with these approaches in Table 4. In Cheng et al. (2016), two Dynamic Time Warping approaches including I2I-DTW, I2C-DTW and, FEMD Ren et al. (2011) were evaluated on UESTC-ASL. In Image-to-Image Dynamic Time Warping (I2I-DTW) approach, the distance between the testing sample and all training samples is computed, while Image-to-Class Dynamic Time Warping (I2C-DTW) approach searches for the minimal warping path between a test sample and a training sample’s compositional features Cheng et al. (2016). We evaluated the proposed approach on UESTC-ASL in (I2I) and (h-h). In (I2I), we choose randomly one image of each subject for training and the rest images for testing. Our approach achieves the high accuracy of 83.9% in (I2I), while Thresholding Decomposition+FEMD, and (DTW) achieved 82.8%, and 81.8%, respectively. Also, our result in (h-h) is 93%, which is considerable. The mismatched cases can be seen in Figures 8(i) and 8(h).

Because of the small inter-class variations in UESTC-ASL dataset, the hand gestures look very similar at some viewpoints. Especially, gestures with 3 fingers are similar, such as gestures 6 and 9, and also gestures 3 and 7.

4.3 Sensitivity Analysis

The variety in the shape of fingers and hand poses has made hand shape variation problem a significant challenge. Researchers introduced shape-based features to address this issue. We introduce graph-based features that can extract stable topological features in case of these variations.

We used the GNG graphs that do not depend on the size of the image, so the features are independent of scale. Since graphs are robust with respect to rotation and articulation, the properties of the GNG graphs were not influenced by rotation and articulation. In Mirehi et al. (2019)

, different experiments were performed to evaluate GNG graphs against scale and rotation, and the results proved the stability of GNG graphs in these cases. The outer boundary of the GNG graphs is a coarse estimation of the boundary of the object. It describes the overall shape of the object without high dependency on the boundary pixels of the object. This leads to improvement in stability against noise, which is an unavoidable and challenging problem in hand gesture recognition. Although using sample thresholding for segmentation results images with lots of boundary noises, our approach achieves the higher recognition rate compared with the state-of-the-art. Sensitivity to noise and GNG parameters was evaluated in

Mirehi et al. (2019).

5 Conclusion

In this paper, we proposed a new graph-based approach for hand gesture recognition (GNG-IEMD) with less dependency on pixels compared to other existing approaches. The hand image is modeled by a GNG graph and the topological and geometrical features of the graph are extracted, and then the dissimilarity of the hand gestures is measured by Improved Earth Mover's Distance. In hand gesture recognition, both the boundary and interior information are utilized. The boundary of the GNG graph models the contour of the image and shows the overall shape of the hand. Hence, the proposed approach is not sensitive to noise on the contour. To test the performance of GNG-IEMD experimentally, we selected 4 known real-life datasets NTU Hand Digits, HKU, HKU multi-angle, and UESTC-ASL of ASL Digits. We applied GNG-IEMD on them and compared the results with the state-of-the-art. Our result shows the higher performance of our approach.


  • Bai and Latecki (2008) Bai X, Latecki LJ (2008) Path similarity skeleton graph matching. IEEE transactions on pattern analysis and machine intelligence 30(7):1282–1292
  • Belongie et al. (2002) Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis & Machine Intelligence (4):509–522
  • Bondy and Murty (2008) Bondy A, Murty MR (2008) Graph Theory, vol 244. Springer-Verlag London
  • Cheng et al. (2016)

    Cheng H, Dai Z, Liu Z, Zhao Y (2016) An image-to-class dynamic time warping approach for both 3d static and trajectory hand gesture recognition. Pattern recognition 55:137–147

  • Cheok et al. (2019) Cheok MJ, Omar Z, Jaward MH (2019) A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics 10(1):131–153
  • Dong et al. (2015) Dong C, Leu MC, Yin Z (2015) American sign language alphabet recognition using microsoft kinect. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 44–52
  • Farooq and Won (2015) Farooq A, Won CS (2015) A survey of human action recognition approaches that use an rgb-d sensor. IEIE Transactions on Smart Processing & Computing 4(4):281–290
  • Fink et al. (2015)

    Fink O, Zio E, Weidmann U (2015) Novelty detection by multivariate kernel density estimation and growing neural gas algorithm. Mechanical Systems and Signal Processing 50:427–436

  • Fritzke (1995) Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, pp 625–632
  • Hogan (2003) Hogan K (2003) Can’t get through: eight barriers to communication. Pelican Publishing
  • Klein (2012) Klein HA (2012) The science of measurement: A historical survey. Courier Corporation
  • Li et al. (2018) Li Y, Wang X, Liu W, Feng B (2018) Deep attention network for joint hand gesture localization and recognition using static rgb-d images. Information Sciences 441:66–78
  • Li and Wachs (2014) Li YT, Wachs JP (2014) Hegm: A hierarchical elastic graph matching for hand gesture recognition. Pattern Recognition 47(1):80–88
  • Maqueda et al. (2015) Maqueda AI, del Blanco CR, Jaureguizar F, García N (2015) Human–computer interaction based on visual hand-gesture recognition using volumetric spatiograms of local binary patterns. Computer Vision and Image Understanding 141:126–137
  • Mirehi et al. (2019) Mirehi N, Tahmasbi M, Targhi AT (2019) Hand gesture recognition using topological features. Multimedia Tools and Applications 78(10):13361–13386
  • Núñez et al. (2018)

    Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition 76:80–94

  • Orts-Escolano et al. (2016) Orts-Escolano S, Garcia-Rodriguez J, Morell V, Cazorla M, Perez JAS, Garcia-Garcia A (2016) 3d surface reconstruction of noisy point clouds using growing neural gas: 3d object/scene reconstruction. Neural Processing Letters 43(2):401–423
  • Plouffe and Cretu (2016) Plouffe G, Cretu AM (2016) Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE transactions on instrumentation and measurement 65(2):305–316
  • Presti and La Cascia (2016) Presti LL, La Cascia M (2016) 3d skeleton-based human action classification: A survey. Pattern Recognition 53:130–147
  • Rautaray and Agrawal (2015)

    Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review 43(1):1–54

  • Ren et al. (2011) Ren Z, Yuan J, Zhang Z (2011) Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In: Proceedings of the 19th ACM international conference on Multimedia, ACM, pp 1093–1096
  • Ren et al. (2013) Ren Z, Yuan J, Meng J, Zhang Z (2013) Robust part-based hand gesture recognition using kinect sensor. IEEE transactions on multimedia 15(5):1110–1120
  • Rubner et al. (2000)

    Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40(2):99–121

  • Stergiopoulou and Papamarkos (2009) Stergiopoulou E, Papamarkos N (2009) Hand gesture recognition using a neural network shape fitting technique. Engineering Applications of Artificial Intelligence 22(8):1141–1158
  • Sun et al. (2017)

    Sun Q, Liu H, Harada T (2017) Online growing neural gas for anomaly detection in changing surveillance scenes. Pattern Recognition 64:187–201

  • Triesch and von der Malsburg (2002) Triesch J, von der Malsburg C (2002) Classification of hand postures against complex backgrounds using elastic graph matching. Image and Vision Computing 20(13-14):937–943
  • Wang et al. (2015) Wang C, Liu Z, Chan SC (2015) Superpixel-based hand gesture recognition with kinect depth camera. IEEE transactions on multimedia 17(1):29–39
  • Wang et al. (2017) Wang C, Liu Z, Zhu M, Zhao J, Chan SC (2017) A hand gesture recognition system based on canonical superpixel-graph. Signal Processing: Image Communication 58:87–98
  • Wang and Yang (2013) Wang Y, Yang R (2013) Real-time hand posture recognition based on hand dominant line using kinect. In: 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, pp 1–4
  • Wang et al. (2019) Wang Y, Jung C, Yun I, Kim J (2019) Spfemd: Super-pixel based finger earth mover’s distance for hand gesture recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 4085–4089
  • Zafrulla et al. (2011) Zafrulla Z, Brashear H, Starner T, Hamilton H, Presti P (2011) American sign language recognition with the kinect. In: Proceedings of the 13th international conference on multimodal interfaces, ACM, pp 279–286
  • Zhang et al. (2013) Zhang C, Yang X, Tian Y (2013) Histogram of 3d facets: A characteristic descriptor for hand gesture recognition. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), IEEE, pp 1–8