1 Introduction
Nonverbal communication, which involves communication through hand gestures, body, and facial movements, contains about 65% of all human communication Hogan (2003). Hand gestures are the most important part of nonverbal communication among body, arm, and facial movements (body language). One of the objectives of intelligent systems is to facilitate natural humancomputer interaction. Among various behaviors of humancomputer interaction, hand gesture is a natural and effective way for communication with significant ability to exchange information.
However, hand gesture recognition is known as a difficult problem in computer vision due to the varieties in the shape, size, and direction of hands or fingers in different hand images. The problem can be generally divided into two categories: static and dynamic conditions. The recognition of dynamic gestures tries to examine spatialtemporal characteristics, while static detection focuses on the internal information of an image. The study of static gesture recognition is an essential part of hand gesture recognition because hand shapes carry specific information without any movement
Li et al. (2018).Most previous approaches have considered developing systems for hand gesture recognition using a combination of preprocessing and machine learning methods. Most of these studies extract pixelbased features and classify hand gestures using machine learning methods
Dong et al. (2015); Li et al. (2018). The study of hand gesture recognition using meaningful shape features is important because meaningful shape features improve stability against articulation, scale, rotation, and noise. Hence, in recent decades, researchers have been trying to recognize hand gestures using significant features extracted from the shape of the image and its boundary
Ren et al. (2011); Wang et al. (2017). Skeleton, geometry and graphbased methods are the most wellknown methods in this area Ren et al. (2011, 2013); Wang and Yang (2013); Wang et al. (2017).Skeletonbased methods have attractive properties such as invariance to scale and rotation by capturing topological and geometrical information of skeletal branches. The main limitation of skeletonbased methods is low stability against contour noises. In fact, small variation or noise on the boundary of the object can cause redundant branches in the skeleton and significant changes in the structure of its topology.
Geometricbased methods are studied geometric conditions of the image, such as Euclidian distance and angle of fingers with respect to the center of the palm and describe the important information of the object with a summary vector and ignore the redundant information of the pixels. These approaches may be influenced by articulation and viewpoint. Most of them are based on hand contour, but they are often distorted due to low resolution and precision of the current depth cameras. In other words, their performance may be reduced due to orientation and noises on contour
Wang et al. (2017).Graphs are robust against rotation, articulation, and noise. The inherent properties of a graph do not depend on its representation, so it can be used as effective tools for image representation. Hence, various graphbased methods have been presented for hand gesture recognition, but their structure depends on the local information of the pixels, and the loss of some pixel information, such as noise and small inner holes reduces their performance Li and Wachs (2014); Triesch and von der Malsburg (2002); Wang et al. (2015, 2017). Recently, a new graphbased method that uses GNG to construct the graph and linear discriminant analysis (LDA) is introduced for hand gesture recognitionMirehi et al. (2019).
In this paper, we use an idea similar to Mirehi et al. (2019) to form a GNG graph for a given image then, we introduce new topological features for hand gesture recognition. The new features can extract convexity and concavity of boundaries more precisely. We also introduce an improved version of Earth Mover's Distance to measure the dissimilarity between feature vectors. This leads to higher accuracy in different datasets. We evaluate the proposed approach on challenging datasets including NTU Hand Digit dataset, HKU, HKU multiangle, and UESTCASL dataset.
The rest of the paper is organized as follows: Section 2 reviews the related work briefly. Section 3 presents the basic steps of the proposed method and hand gesture recognition approach. The results of the experimental study and a comparison with stateoftheart approaches are presented in Section 4. Finally, Section 5 concludes this paper.
2 Related works
In this section, we review the stateofart approaches briefly. Different skin color methods have been used for hand detection and segmentation. The main decision on providing a skin color model is the choice of color space. However, variations of skin colors and background objects with color distribution similar to human skin can confuse the methods Rautaray and Agrawal (2015). Most of the current methods use Kinect sensors to collect data, detect and segment hand information Maqueda et al. (2015); Cheng et al. (2016); Wang et al. (2017); Li et al. (2018). Depth cameras facilitate the hand segmentation process compared to skinbased models, especially when there are similar texture backgrounds Plouffe and Cretu (2016); Ren et al. (2011, 2013). In these approaches, the hand of the user is considered as the nearest object of the scene to the camera and segmentation is performed by specifying a threshold value. We use the same way for hand detection and segmentation in this study.
For more precision in hand detection and segmentation, some approaches applied both the depth map and skeleton tracking provided by Kinect for hand detection Presti and La Cascia (2016); Zafrulla et al. (2011); Wang et al. (2015). Although these methods may provide more accuracy in hand prediction, they suffer from configuration complexity.
Various hand features can be used for hand recognition. Hand features can be grouped into almost two groups, including pixelbased Maqueda et al. (2015); Zhang et al. (2013) and shapebased features Bai and Latecki (2008); Belongie et al. (2002); Stergiopoulou and Papamarkos (2009). Shapebased features contain geometry, graph, and skeletonbased features.
Belongie et al. introduced a shape context descriptor by computing a logpolar histogram of the relative position of contour points Belongie et al. (2002).
Fritzke et al. presented an incremental network which learns the topological structure of input vectors by a simple Hebblike rule Fritzke (1995).
Stergiopoulou and Papamarkos applied GNG graph for image representation and considered limited geometric features such as the distance and angle between neurons
Stergiopoulou and Papamarkos (2009).The skeleton of objects can be considered another source of shape information for hand gesture recognition Bai and Latecki (2008). Noisy and distorted contours have a significantly negative effect on extraction of the correct skeleton. Zhang et al. used local features for hand gesture recognition. They computed the Histogram of Oriented Gradients (HOG) of 3D point distribution in color images Zhang et al. (2013).
In Maqueda et al. (2015), a Volumetric Spatiograms of Local Binary Patterns (VSLBP) method was employed for hand gesture recognition. Despite the appropriateness of the results, these approaches depend on the local information of the pixels, which reduces their stability against pixel distortions Zhang et al. (2013); Maqueda et al. (2015). Ren et al. in Ren et al. (2011) and Ren et al. (2013) proposed a contourbased method by a Finger Earth Mover's Distance (FEMD) and a template matching approach. Contours are often distorted due to low resolution and the precision of the current depth cameras, which affects the accuracy of contourbased approaches. Wang et al. presented a colordepth Superpixel Graph Earth Mover's Distance (SPEMD) constructed by segmenting pixels to almost the same size superpixels. They applied Earth Mover's Distance (EMD) to measure the similarity of hand gestures and considered the cost of the centroid of superpixels based on their depth information and location, of them which can be influenced by camera conditions and the variety of hand shapes Wang et al. (2015). Wang e.al extended the previous method based on Canonical SuperpixelGraph to reduce hand the shape variation problem Wang et al. (2017). In another study, a superpixel based finger Earth Mover's Distance (SPFEMD) approach was proposed which was considered only superpixel of fingers and was used template matching Wang et al. (2019). An ImagetoClass Dynamic Time Warping (I2CDTW) approach for both 3D static and trajectory hand gesture recognition was introduced in Cheng et al. (2016) by computing the ImagetoClass distance for hand gesture classification.
In Plouffe and Cretu (2016), a Kcurvature algorithm, which is based on the change in the slope angle of the tangent line, was employed for the localization the fingertips over the contour extracted from depth data, and dynamic time warping (DTW) is applied for gesture recognition. This approach depends on the precision and resolution of the depth data.
Moreover, various deep learning approaches have been proposed for developing hand gesture recognition systems
Cheok et al. (2019); Farooq and Won (2015); Li et al. (2018); Núñez et al. (2018). Li et al. provided a deep CNN framework for hand gesture recognition using the fourchannel RGBD (Depth) of the image Li et al. (2018). Their disadvantage is their dependence on lighting conditions. Nunez et al. proposed the combination of a CNN and a LongShort Term Memory (LSTM) network based on human skeleton kinematics for the hand gesture recognition problem
Núñez et al. (2018).3 Proposed approach
In the previous work Mirehi et al. (2019), Growing Neural Gas (GNG) graphs were constructed from binary images and the bulges of hand including fingers and wrist were computed, afterward, topological and geometrical features were extracted from the GNG graph. The hand gestures were first categorized based on the number of bulges and then classified according to defined features by LDA. In the current work, we extend method Mirehi et al. (2019) in the following directions.
1) Defined features are developed to recognize the hand gesture, and new features describing the shape of the boundary, such as concavity, convexity, and overall bulge shape are defined.
2) A new Earth Mover's Distance (EMD) is introduced to measure the dissimilarity of feature vectors extracted GNG graphs.
3) Hand gestures are classified by kNN classifier according to Earth Mover's Distance.
4) The proposed approach is evaluated on more challenging datasets including HKU, HKU multiangle, and UESTCASL.
We first briefly describe the overall framework and then describe each step in detail.
The main steps of the proposed approach contain:
1. Segmentation of the hand gesture image.
We use the simple thresholding on the depth map for hand segmentation.
We consider the hand of the user as the nearest object of the scene to the camera and segment the hand region by thresholding on the depth.
This is a simple and effective method for hand segmentation, which is applied in many approaches Ren et al. (2011, 2013).
2. Constructing the GNG graph of the binary image.
3. Computing the outer boundary of the graph by computational geometry approaches.
4. Extracting the topological and geometrical features of the graph.
5. Measuring the dissimilarity between the hand gestures using a new Earth Mover's Distance.
6. Classifying the hand gestures by kNN algorithm.
3.1 Constructing the GNG graph
Various approaches can be applied to construct a graph for an image. Our target graph should provide the following properties:

The vertices should be distributed almost uniformly within the image, and the edges should be almost equal.

The number of vertices should be constant in other words, the graph should not depend on the scale of the image.

The graph should ignore the holes and cracks inside the image and be robust against the noise on the image contour.
We choose the GNG graph because it can very well satisfy the explained properties. The GNG algorithm includes a lowdimensional subspace of the input data space while learning the topological structure of data distribution Fritzke (1995). Moreover, the GNG algorithm has the ability to follow the behavior of vertices in changing dynamic conditions and can be extended to 3D online representation and object tracking Fink et al. (2015); OrtsEscolano et al. (2016); Sun et al. (2017). In the following, we will describe the GNG algorithm briefly.
3.2 The GNG graph
Growing Neural Gas is an unsupervised learning algorithm
Fritzke (1995). The algorithm starts with two vertices that are located in a random location and then updates the location of some vertices by comparing the distance of vertices with the initial data in each step. Some error is assigned to the closest vertex to the compared input data, which indicates the distance between them. The closest vertices to the compared input data are connected by a zeroage edge and old edges are removed from the graph. Eventually, new vertices are added between the vertices with a high error value. The algorithm is repeated until a finishing criterion occurs. The details are explained in Fritzke (1995). The principal parameters of the algorithm include:
: is the number of vertices.

and : The first closest vertex to the input data and its neighbors are moved towards the input data by fractions of and , respectively.

: The edges older than are removed in every step.

: The number of input data used for comparison is .

: The error value of vertices with most error is decreased, after inserting new vertex between them by a multiple of .

: All error variables are decreased in every step by a multiple of .
We tested different values for GNG parameters and set , , , , , , and .
3.3 Extracting the outer boundary
The outer boundary of a GNG graph can be computed by an algorithm similar to convex hull. The algorithm selects the leftmost vertex of the graph and walks clockwise around the graph to reach initial vertices. More details are described in Mirehi et al. (2019). The GNG boundary is an approximation of the contour of the image so unlike pixelbased boundaries, it is not sensitive to noise on the contour. Figure 1 shows an example of a GNG graph and its boundary.
3.4 Topological features
In this section, we introduce the meaningful features that capture the topological and geometrical properties of the graph. At first, we find the peaks and troughs in the boundary and then defined the features by using them.
Let be the GNG and H be the spanning subgraph of G consisting of boundary edges, note that the number of the vertices of G and H is the same. The adjacency matrices of and are specified with and , respectively. For each peak on the boundary, there are two vertices connecting it to the rest of the image. We select these vertices as the basic vertices of the bulge and the subgraph inside this peak as the bulge itself. The distance of the basic vertices of the bulge in G is less than a multiple of their distance in H.
There is a standard for hand and body measurement Klein (2012). According to this, the length of the middle finger (fingertip to knuckle) is at least 5.5 times its width. The length of the little finger is not smaller than half the length of the middle finger.
The experimental study indicates that the distance of the basic vertices of a finger is at most 2 in a GNG graph of a hand with 300 vertices, so we consider the distance between the base of one finger is 2, and the length of fingers is greater than 4. To find the bulges, we compute the matrix . Nonzero elements of describes the edges of that do not belong to the graph ; therefore, shows the number of walks of length avoiding between vertices Bondy and Murty (2008).
The pairs of vertices whose corresponding entry in are nonzero candidate as basic vertices of a finger. Furthermore, the pairs of vertices whose corresponding entry in are nonzero candidate as basic vertices of sticking fingers.
To find the wrist, we consider the distance between the basic vertices of the wrist in as 6 or 7. The shape of the wrist is close to a rectangle, so the distance between the basic vertices of the wrist in graph is at least 11. In a similar way, matrix
is used for finding the basic vertices of the wrist. Between candidate pairs of basic vertices to find fingers and wrist, we select the pair with the largest distance in . Figure 3 displays an example of bulges (finger and wrist) from the GNG of a hand gesture. The geometrical and topological features are defined as follows.

The ratio of distances between fingers and the wrist ( and )
Feature measures the ratio of distances between the first finger and the wrist, and feature is the ratio of distances between the last finger and the wrist. There are counterclockwise and clockwise paths from the basic vertices of a finger to the basic vertices of the wrist in (We replace other fingers with their base path ). The length of the counterclockwise and clockwise paths are specified with and , respectively. Feature for the first finger is defined as,
Feature for the last finger is defined as,
The features of and illustrate the relative location of the fingers with respect to the wrist. In Figure 4a, the paths indicating and for the thumb are shown with black edges.

Distances between bulges ()
For consecutive bulges and , shows the distance between them in (see Figure 4b). 
The length of bulges ()
This feature measures the length of a bulge. For a given bulge , is the length of the path between basic vertices of in . This path for middle finger, index finger and thumb is specified with black edges in Figure 4c with lengths 17, 17, and 13, respectively. 
The width of bulges ()
This feature contains the distance between the pair of the basic vertices of a bulge (see Figure 4d). 
The number of GNG vertices in a bulge ()
This feature shows the number of GNG vertices that are in a bulge (see Figure 4e). 
The number of GNG vertices that are inside the convex hull of region between bulges ()
This feature measures the number of vertices inside the convex hull of the shortest path between the two consecutive bulges. In fact, this feature measures the convexity or concavity of the region between the two consecutive bulges. In Figure 4f, the convex hull of the shortest path between two fingers is shown with black. 
The aspect ratio of OMBB of a bulge ()
Given a bulge, this feature computes the ratio of the width and length of the oriented minimum bounding box (OMBB) of the bulge. Figure 4g shows the OMBB of the index finger and thumb. 
The aspect ratio of OMBB of the distance between bulges ()
This feature finds the OMBB of the shortest path between bulges and computes the ratio of the width and length of it (see Figure 4h).
3.5 Improved Earth Mover's Distance (IEMD)
We define a new Earth Mover's Distance to measure the dissimilarity between the extracted features of the GNG graphs.
The Earth Mover's Distance (EMD) is a measure of the distance between two probability distributions over a region
Rubner et al. (2000). Different researchers have presented various forms of the EMD in terms of their application Wang et al. (2015, 2017). We compute the IEMD of two GNG graphs as follows. The first GNG graph is defined as a signature with m clusters, where represents the cluster, is the weight of the cluster, and shows the signature of the second GNG graph with n clusters. We consider every computed bulge as a cluster. The weight of a bulge is a vector with length 7, which contains the principle defined information of the clockwise order bulges and their relationship with other bulges as follows.(1) 
Since the basic vertices of the wrist are not necessarily located at the wrist of the image and might be on the forearm, we use and instead of to describe the relative distance of the wrist.
The maximum number of bulges occurs when all fingers are open. In this case, the number of bulges equals to 6 including five fingers and the wrist. To improve partial matching and reduce mismatching, we insert virtual clusters in signatures such that the number of its clusters equals 6. If m is less than 6, virtual clusters with zero weights be inserted into the signature and similarly virtual clusters with zero weights be inserted to signature . For the two bulges and , the cost is defined as
(2) 
We expect that a good matching preserves the order of bulges. In this case, the cost is computed as the difference of the weight of these bulges. Otherwise, a penalty in the form is added to the cost. We need to find a flow between two signatures where is the flow between and
with the following constraints:
(3) 
Then, IEMD is defined as,
(4) 
We used IEMD to measure the dissimilarity between two hand gestures. This measurement does not depend on the location and depth of the pixels Wang et al. (2015, 2017), which results more stability against hand shape variations. Moreover, IEMD can be applied to measure the dissimilarity in other approaches.
3.6 Hand gesture recognition
Finally, we use the knearest neighbors (kNN) algorithm for hand gesture classification. In this algorithm, the value of k is chosen according to the data. In our experiments, the value of k is considered 3 and the class of a hand gesture is predicted by three of the nearest neighbors of the training class. The size and variety of training sets can affect accuracy. In order to comprehensively evaluate the proposed approach, the training data sets are chosen as follows. Using half of the data for training and the other half for testing (hh), leavepsubjectout (lpo). In (lpo) validation protocol, if the dataset includes N subjects, Np subjects are chosen for training and the remaining p subjects are used for testing. This procedure is repeated for every combination of p subjects, and then the average accuracy is reported. We choose leaveonesubjectout (loo) and leave4subjectout (l4o) crossvalidation, which are more common protocol to evaluate approaches.
4 Experimental study
In this section, we evaluate and compare our approach with some stateoftheart approaches such as Ren et al. (2011), skeleton matching Ren et al. (2013), Hand dominant line Wang and Yang (2013), H3DF Zhang et al. (2013), VSLBP Maqueda et al. (2015), SPEMD Wang et al. (2015), CSGEMD Wang et al. (2017), and GNG+LDA Mirehi et al. (2019) on different datasets. These datasets are NTU Hand Digits, HKU, HKU multiangle, and UESTCASL datasets. First we introduce these datasets briefly.
4.1 Datasets
4.1.1 NTU Hand Digits dataset
The NTU Hand Digits dataset is collected with Kinect and includes 1000 color images and their depth maps in cluttered backgrounds. It contains 10 hand gestures of decimal digits 09, which are performed by 10 subjects with 10 samples per gesture. The subjects pose with variations such as orientation, articulation, and scale in gestures Ren et al. (2011). Figure 5 shows some of these images.
4.1.2 HKU dataset
HKU dataset is captured using Kinect from 5 subjects. It consists of 1000 joint colordepth images with 10 gestures from labels 0 to 9. The subjects have performed each gesture in 20 different poses Wang et al. (2015). This dataset contains 10 gestures with 20 different poses from 5 subjects. In this dataset, the hand motions include large inplane rotation and moderate outofplane rotation. In Figure, 6 gesture samples are shown.
4.1.3 HKU multiangle dataset
The HKU multiangle hand gesture dataset is an extension of HKU dataset with challenging samples from 4 different viewing angles (approximately 0, 10 and 20) with 5 subjects. The HKU multiangle includes 2000 color images ^{1}^{1}1The dataset downloaded from the link reported Wang et al. (2015) contains 2000 images while the authors indicated that the dataset includes 3000 images. for testing Wang et al. (2015). In Figure 7, gesture samples are shown.
4.1.4 UESTCASL dataset
UESTCASL dataset is collected 1100 color images from ASL digit gestures by Kinect. Gestures from 1 to 10 are performed 11 times by 10 subjects in different orientations, depths, and scales Cheng et al. (2016). Figure 8 shows some samples of UESTCASL dataset. Due to the high similarity among the gestures of ASL digits and small interclass variations, this dataset is really challenging.
4.2 Mean accuracy
We test the proposed approach on NTU Hand Digits, HKU, HKU multiangle, and UESTCASL datasets in a 3GHz CPU with Matlab implementation. The experimental results and comparison with the stateoftheart approaches are reported in Tables 14.
Table 1 shows the results and the comparison on NTU Hand Digits dataset. The proposed approach is compared with wellknown approaches as Thresholding Decomposition Ren et al. (2011), skeleton matching Ren et al. (2013), Hand dominant line Wang and Yang (2013), H3DF Zhang et al. (2013), VSLBP Maqueda et al. (2015), CSGEMD Wang et al. (2017), and GNG+LDA Mirehi et al. (2019).
Approaches based on (hh)  Mean accuracy 
Skeleton matching Ren et al. (2013)  78.6 
Nearconvex Decomposition+FEMD Ren et al. (2011)  93.9 
Hand dominant line + SVM Wang and Yang (2013)  97.1 
VSLBP + SVM Maqueda et al. (2015)  97.3 
GNG+LDA Mirehi et al. (2019)  98.68 
GNGIEMD  99.7 
Approaches based on Deep learning  Mean accuracy 
Deep network + RGBD images Li et al. (2018)  98.5 
Approaches based on (loo)  Mean accuracy 
Thresholding Decomposition+FEMD Ren et al. (2011)  95 
Shape context without bending cost Ren et al. (2013)  97 
Shape context with bending cost Ren et al. (2013)  95.7 
Skeleton matching Ren et al. (2013)  96 
Hand dominant line + SVM Wang and Yang (2013)  91.1 
H3DF Zhang et al. (2013)  95.5 
VSLBP + SVM Maqueda et al. (2015)  95.9 
CSGEMD (shape only) Wang et al. (2017)  99.6 
CSGEMD Wang et al. (2017)  99.7 
GNG+LDA Mirehi et al. (2019)  98.6 
GNGIEMD  99.9 
Approaches based on (l4o)  Mean accuracy 
Thresholding Decomposition+FEMD Ren et al. (2011)  91.025 
Shape context without bending cost Ren et al. (2013)  92.2 
Shape context with bending cost Ren et al. (2013)  85.375 
Skeleton matching Ren et al. (2013)  90.475 
SPEMD Wang et al. (2017) (shap only)  96.5 
SPEMD Wang et al. (2017)  97.2 
GNGIEMD  99.3 

As we can see, the proposed approach (GNGIEMD) achieves the highest mean accuracy of 99.9%, 99.7%, and 99.3% in (hh), (loo) and (l4o) cross validation protocols, respectively. Although GNGIEMD uses only the binary images and others utilize the color and depth information Wang et al. (2015, 2017); Li et al. (2018), our results are more significant. One of the reasons for this is the use of graph distances instead of Euclidean distances. Moreover, as presented in Table 1, the mean accuracies in (l4o) CV and (loo) CV do not differ significantly compared with other approaches, which indicates the insensitivity of the approach to the training data. Figures 8(b), 8(c), and 8(a)
show the confusion matrix of hand gestures in this dataset. In a few cases, gestures with the same number of fingers have been mismatched. The reason might be the similarity in the topology of gestures and inaccuracy in segmentation.
The results on HKU dataset is presented in Table 2. It can be seen that GNGIEMD achieved considerable recognition rates in (loo) CV and (l4o) CV among other approaches. Another substantial point is the smallest difference between (loo) and (l4o) recognition rates, which results in the independency of our approach on the user and the training data. The confused cases are shown in the Figures 8(e), and 8(d).
Approaches  (loo CV)  (l4o CV) 

Thresholding Decomposition+FEMD Ren et al. (2011)  95  91 
Skeleton matching Ren et al. (2013)  96  90.5 
SPEMD (shape only) Wang et al. (2015)  98.7  96.1 
SPEMD Wang et al. (2015)  99.2  97.3 
CSGEMD (shape only) Wang et al. (2017)  99.4  97.4 
CSGEMD Wang et al. (2017)  99.4  97.9 
GNGIEMD  99.6  98.65 
Approaches  (loo CV)  (l4o CV) 

Thresholding Decomposition+FEMD Ren et al. (2011)  96.2  89.7 
Skeleton matching Ren et al. (2013)  95.1  90.3 
SPEMD (shape only) Wang et al. (2015)  95.3  92.5 
SPEMD Wang et al. (2015)  97.8  94.7 
CSGEMD (shape only) Wang et al. (2017)  96.1  93.7 
CSGEMD Wang et al. (2017)  97.9  95.6 
GNGIEMD  96.9  96.4 
The evaluation on HKU multiangle dataset is shown in Table 3. The recognition accuracies of the proposed approach in both (loo) and (lo4) CV are appropriate, which shows the stability of our approach against rotation. Not that GNGIEMD does not use depth and color data for recognition. Figures 8(f) and 8(g) show the confusion matrix.
Input sign  Thresholding Decomposition+FEMD Ren et al. (2011)  I2IDTW Cheng et al. (2016)  I2CDTW Cheng et al. (2016)  GNGIEMD ( I2I)  GNGIEMD (hh) 

1  100  92  100  96  96 
2  100  99  100  100  100 
3  95  99  95  85  98 
4  100  96  100  100  100 
5  100  98  100  100  100 
6  59  58  80  53  76 
7  80  66  77  80  94 
8  64  73  90  78  84 
9  57  44  72  47  82 
10  73  93  91  100  100 
Average 
82.8  81.8  90.5  83.9  93 
The number of studies in UESTCASL dataset is limited Ren et al. (2011); Cheng et al. (2016). We compared GNGIEMD with these approaches in Table 4. In Cheng et al. (2016), two Dynamic Time Warping approaches including I2IDTW, I2CDTW and, FEMD Ren et al. (2011) were evaluated on UESTCASL. In ImagetoImage Dynamic Time Warping (I2IDTW) approach, the distance between the testing sample and all training samples is computed, while ImagetoClass Dynamic Time Warping (I2CDTW) approach searches for the minimal warping path between a test sample and a training sample’s compositional features Cheng et al. (2016). We evaluated the proposed approach on UESTCASL in (I2I) and (hh). In (I2I), we choose randomly one image of each subject for training and the rest images for testing. Our approach achieves the high accuracy of 83.9% in (I2I), while Thresholding Decomposition+FEMD, and (DTW) achieved 82.8%, and 81.8%, respectively. Also, our result in (hh) is 93%, which is considerable. The mismatched cases can be seen in Figures 8(i) and 8(h).
Because of the small interclass variations in UESTCASL dataset, the hand gestures look very similar at some viewpoints. Especially, gestures with 3 fingers are similar, such as gestures 6 and 9, and also gestures 3 and 7.
4.3 Sensitivity Analysis
The variety in the shape of fingers and hand poses has made hand shape variation problem a significant challenge. Researchers introduced shapebased features to address this issue. We introduce graphbased features that can extract stable topological features in case of these variations.
We used the GNG graphs that do not depend on the size of the image, so the features are independent of scale. Since graphs are robust with respect to rotation and articulation, the properties of the GNG graphs were not influenced by rotation and articulation. In Mirehi et al. (2019)
, different experiments were performed to evaluate GNG graphs against scale and rotation, and the results proved the stability of GNG graphs in these cases. The outer boundary of the GNG graphs is a coarse estimation of the boundary of the object. It describes the overall shape of the object without high dependency on the boundary pixels of the object. This leads to improvement in stability against noise, which is an unavoidable and challenging problem in hand gesture recognition. Although using sample thresholding for segmentation results images with lots of boundary noises, our approach achieves the higher recognition rate compared with the stateoftheart. Sensitivity to noise and GNG parameters was evaluated in
Mirehi et al. (2019).5 Conclusion
In this paper, we proposed a new graphbased approach for hand gesture recognition (GNGIEMD) with less dependency on pixels compared to other existing approaches. The hand image is modeled by a GNG graph and the topological and geometrical features of the graph are extracted, and then the dissimilarity of the hand gestures is measured by Improved Earth Mover's Distance. In hand gesture recognition, both the boundary and interior information are utilized. The boundary of the GNG graph models the contour of the image and shows the overall shape of the hand. Hence, the proposed approach is not sensitive to noise on the contour. To test the performance of GNGIEMD experimentally, we selected 4 known reallife datasets NTU Hand Digits, HKU, HKU multiangle, and UESTCASL of ASL Digits. We applied GNGIEMD on them and compared the results with the stateoftheart. Our result shows the higher performance of our approach.
References
 Bai and Latecki (2008) Bai X, Latecki LJ (2008) Path similarity skeleton graph matching. IEEE transactions on pattern analysis and machine intelligence 30(7):1282–1292
 Belongie et al. (2002) Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis & Machine Intelligence (4):509–522
 Bondy and Murty (2008) Bondy A, Murty MR (2008) Graph Theory, vol 244. SpringerVerlag London

Cheng et al. (2016)
Cheng H, Dai Z, Liu Z, Zhao Y (2016) An imagetoclass dynamic time warping approach for both 3d static and trajectory hand gesture recognition. Pattern recognition 55:137–147
 Cheok et al. (2019) Cheok MJ, Omar Z, Jaward MH (2019) A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics 10(1):131–153
 Dong et al. (2015) Dong C, Leu MC, Yin Z (2015) American sign language alphabet recognition using microsoft kinect. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 44–52
 Farooq and Won (2015) Farooq A, Won CS (2015) A survey of human action recognition approaches that use an rgbd sensor. IEIE Transactions on Smart Processing & Computing 4(4):281–290

Fink et al. (2015)
Fink O, Zio E, Weidmann U (2015) Novelty detection by multivariate kernel density estimation and growing neural gas algorithm. Mechanical Systems and Signal Processing 50:427–436
 Fritzke (1995) Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems, pp 625–632
 Hogan (2003) Hogan K (2003) Can’t get through: eight barriers to communication. Pelican Publishing
 Klein (2012) Klein HA (2012) The science of measurement: A historical survey. Courier Corporation
 Li et al. (2018) Li Y, Wang X, Liu W, Feng B (2018) Deep attention network for joint hand gesture localization and recognition using static rgbd images. Information Sciences 441:66–78
 Li and Wachs (2014) Li YT, Wachs JP (2014) Hegm: A hierarchical elastic graph matching for hand gesture recognition. Pattern Recognition 47(1):80–88
 Maqueda et al. (2015) Maqueda AI, del Blanco CR, Jaureguizar F, García N (2015) Human–computer interaction based on visual handgesture recognition using volumetric spatiograms of local binary patterns. Computer Vision and Image Understanding 141:126–137
 Mirehi et al. (2019) Mirehi N, Tahmasbi M, Targhi AT (2019) Hand gesture recognition using topological features. Multimedia Tools and Applications 78(10):13361–13386

Núñez et al. (2018)
Núñez JC, Cabido R, Pantrigo JJ, Montemayor AS, Vélez JF (2018) Convolutional neural networks and long shortterm memory for skeletonbased human activity and hand gesture recognition. Pattern Recognition 76:80–94
 OrtsEscolano et al. (2016) OrtsEscolano S, GarciaRodriguez J, Morell V, Cazorla M, Perez JAS, GarciaGarcia A (2016) 3d surface reconstruction of noisy point clouds using growing neural gas: 3d object/scene reconstruction. Neural Processing Letters 43(2):401–423
 Plouffe and Cretu (2016) Plouffe G, Cretu AM (2016) Static and dynamic hand gesture recognition in depth data using dynamic time warping. IEEE transactions on instrumentation and measurement 65(2):305–316
 Presti and La Cascia (2016) Presti LL, La Cascia M (2016) 3d skeletonbased human action classification: A survey. Pattern Recognition 53:130–147

Rautaray and Agrawal (2015)
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review 43(1):1–54
 Ren et al. (2011) Ren Z, Yuan J, Zhang Z (2011) Robust hand gesture recognition based on fingerearth mover’s distance with a commodity depth camera. In: Proceedings of the 19th ACM international conference on Multimedia, ACM, pp 1093–1096
 Ren et al. (2013) Ren Z, Yuan J, Meng J, Zhang Z (2013) Robust partbased hand gesture recognition using kinect sensor. IEEE transactions on multimedia 15(5):1110–1120

Rubner et al. (2000)
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40(2):99–121
 Stergiopoulou and Papamarkos (2009) Stergiopoulou E, Papamarkos N (2009) Hand gesture recognition using a neural network shape fitting technique. Engineering Applications of Artificial Intelligence 22(8):1141–1158

Sun et al. (2017)
Sun Q, Liu H, Harada T (2017) Online growing neural gas for anomaly detection in changing surveillance scenes. Pattern Recognition 64:187–201
 Triesch and von der Malsburg (2002) Triesch J, von der Malsburg C (2002) Classification of hand postures against complex backgrounds using elastic graph matching. Image and Vision Computing 20(1314):937–943
 Wang et al. (2015) Wang C, Liu Z, Chan SC (2015) Superpixelbased hand gesture recognition with kinect depth camera. IEEE transactions on multimedia 17(1):29–39
 Wang et al. (2017) Wang C, Liu Z, Zhu M, Zhao J, Chan SC (2017) A hand gesture recognition system based on canonical superpixelgraph. Signal Processing: Image Communication 58:87–98
 Wang and Yang (2013) Wang Y, Yang R (2013) Realtime hand posture recognition based on hand dominant line using kinect. In: 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), IEEE, pp 1–4
 Wang et al. (2019) Wang Y, Jung C, Yun I, Kim J (2019) Spfemd: Superpixel based finger earth mover’s distance for hand gesture recognition. In: ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 4085–4089
 Zafrulla et al. (2011) Zafrulla Z, Brashear H, Starner T, Hamilton H, Presti P (2011) American sign language recognition with the kinect. In: Proceedings of the 13th international conference on multimodal interfaces, ACM, pp 279–286
 Zhang et al. (2013) Zhang C, Yang X, Tian Y (2013) Histogram of 3d facets: A characteristic descriptor for hand gesture recognition. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), IEEE, pp 1–8
Comments
There are no comments yet.