Log In Sign Up

Position and Rotation Invariant Sign Language Recognition from 3D Point Cloud Data with Recurrent Neural Networks

by   Prasun Roy, et al.
IIT Roorkee

Sign language is a gesture based symbolic communication medium among speech and hearing impaired people. It also serves as a communication bridge between non-impaired population and impaired population. Unfortunately, in most situations a non-impaired person is not well conversant in such symbolic languages which restricts natural information flow between these two categories of population. Therefore, an automated translation mechanism can be greatly useful that can seamlessly translate sign language into natural language. In this paper, we attempt to perform recognition on 30 basic Indian sign gestures. Gestures are represented as temporal sequences of 3D depth maps each consisting of 3D coordinates of 20 body joints. A recurrent neural network (RNN) is employed as classifier. To improve performance of the classifier, we use geometric transformation for alignment correction of depth frames. In our experiments the model achieves 84.81


page 4

page 8

page 9


A new architecture for hand-worn Sign language to Speech translator

People with speech and hearing impairments often rely on sign language t...

SignCol: Open-Source Software for Collecting Sign Language Gestures

Sign(ed) languages use gestures, such as hand or head movements, for com...

Multilingual Communication System with Deaf Individuals Utilizing Natural and Visual Languages

According to the World Federation of the Deaf, more than two hundred sig...

Online interpretation of numeric sign language using 2-d skeletal model

Gesturing is one of the natural modes of human communication. Signs prod...

DeepASL: Enabling Ubiquitous and Non-Intrusive Word and Sentence-Level Sign Language Translation

There is an undeniable communication barrier between deaf people and peo...

PerSign: Personalized Bangladeshi Sign Letters Synthesis

Bangladeshi Sign Language (BdSL) - like other sign languages - is tough ...

1 Introduction

Sign language generally involves manual and non-manual gestures [zafrulla2011american]. Manual sign gestures include upper body movements, hand and finger movements whereas non-manual sign gestures include facial expressions, eye movement [von2008significance]. A Sign Language Recognition (SLR) system is a form of machine translation scheme which translates sign language into natural language and vice versa. Therefore, such systems act as a bidirectional communication channel between hearing and speech impaired population and non-impaired population. Similar to their natural language counterparts, different forms of sign language exist due to independent geographical and social context; such as American [zafrulla2011american], Australian [potter2013leap], Chinese [chai2013sign], Greek [ong2012sign], Indian [kumar2017coupled], [KUMAR_Inf], Spanish [incertis2006hand] etc.

With rapid development of cost efficient depth sensors in recent years such as Kinect [kinect, vidalon2016brazilian, aliyu2016arabie] and LEAP Motion [leapmotion, naglot2016real, bird2020british], possibility of building real-time SLR systems have been emerged. These devices include specialized sensor units along with software development kits (SDK) to construct 3D point cloud of the environment. SLR system is next developed with the 3D point cloud [Mittal19]. For example, Chong et al. [chong2018american]

developed an American Sign Language (ASL)-based alphabet and digit recognition system using Leap motion sensor. The recognition process was carried out by SVM and deep neural network using features extracted from fingers and hand motions. Likewise, a modified LSTM framework was proposed for continuous Indian Sign Language (ISL) in

[Mittal19]. The authors have used a confidence threshold to break the continuous sign sentences into isolated sign words.

In literature, a number of frameworks have been proposed to improve the performance of the system. It includes multimodal systems by fusing multiple devices [KUMAR_Inf], decision and feature fusion [xiao2019multimodal], etc. A multimodal SLR system using Kinect and Leap motion is proposed in [kumar2017coupled] where the authors modelled the incoming data sequences using a coupled HMM. A similar work can be found in [KUMAR_Inf] where the authors fused facial expressions of the signer with hand and finger movements using a Bayesian framework. Nevertheless, the general pipeline for such SLR systems generally consist of sensor hardware and SDK for data acquisition followed by SLR software routines for analysis, feature extraction and recognition from the captured data as shown in Fig. 1

. In an automated end-to-end scenario, a signer needs to perform sign gestures in front of the sensor with six degrees of freedom and the system produces final recognition results after processing point cloud data received by the sensor. This approach suffers from self occlusion and distorted view if signer performs gesture with an angle with vertical axis on sensor plane. This geometric orientation problem can be addressed by affine transformation

[kumar2018position] on point cloud data acquired by the sensor.

In this paper we use 3D point cloud data acquired by a Kinect v1 sensor to estimate 20 primary body joints of a signer. The movements of these body joints are recorded as frame-wise temporal information while performing sign gestures. A long short-term memory (LSTM) based RNN is then used as a discriminator to classify sign gestures. The main contributions of the paper are as follows:

  • Firstly, the recorded frames are processed with Affine transformation for rotation and position invariance.

  • Secondly, the gesture sequence are modeled using deep-LSTM framework to achieve state of the art results. Finally, the performance is compared with existing approaches.

Rest of this paper is organized as follows. In section 2, a literature review of recent works on this domain is presented. Details regarding our work is presented in section 3. Experimental results are discussed in section 4. Finally, a conclusive discussion is done in section 5.

Figure 1: General workflow of SLR systems.

2 Proposed Work

The core attributes of any sign language are largely influenced by diverse geographical and social context. Due to such wide contextual variation it is unlikely to have a general SLR system. This also accounts for the difficulty of comparison among these systems. Therefore, a valid comparison should involve systems for a specific sign language having a predefined vocabulary. In this study, we perform experiments on a dataset [kumar2018position]

of Indian sign language containing 2700 sign gestures uniformly distributed over 30 distinct categories. After acquiring 3D point cloud data using Kinect sensor, associated SDK constructs a skeletal structure by estimating spatial coordinates of 20 major body joints from the point cloud. Afterwards, the skeleton undergoes geometric transformation to minimize the effects of self occlusion due to translation and rotation


. These temporal depth frames are then fed to a sequential model to assign class probabilities.

2.1 Skeletal structure

Kinect SDK constructs skeletal structure of body by estimating spatial coordinates of 20 major body joints from 3D point cloud data acquired by Kinect v1 sensor. It employs a generic camera to capture RGB frame along with an IR sensor to create depth map. Distance of each pixel of the RGB frame from center of sensor is retrieved by querying the depth frame with spatial location of that pixel in RGB frame. A visual representation of enumerated body joints of Kinect v1 sensor is shown in Fig. 2.

Figure 2: Enumerated body joints of Kinect v1 sensor.

2.2 Position invariance

A position invariant transformation helps to minimize recognition error due to translation of the point cloud in sensor plane. We assume a signer performs gestures on plane and translation of point cloud occurs when spine midpoint is shifted from origin of coordinate system

. Therefore, the translation vector

is given by,


Considering a point in 3D, spatial coordinate of the translated point is given by,


2.3 Rotation invariance

A rotation invariant transformation is used as a heuristic to improve performance of the SLR system by minimizing recognition error due to rotation in

plane around axis. This geometric transformation aims to align the body plane parallel to Kinect sensor. We assume the spatial plane through spine midpoint , left shoulder and right shoulder as body plane . The unit vector along the normal to plane is given by,


Angle of rotation is estimated as the angle between projection of on plane and unit vector along axis from dot product of vectors and as follows,


Considering a point in 3D and rotation around axis, spatial coordinate of the rotated point is given by,


2.4 Sequence modelling

Recurrent Neural Networks (RNN) are particularly useful in modelling temporal data. However, they suffer from vanishing and exploding gradient problems which restrict them from learning long-term dependencies with gradient descent

[bengio1994learning]. Long Short Term Memory (LSTM) networks [hochreiter1997long] address these issues by introducing a few modifications in generic RNN architecture. The architecture of a generic LSTM network is shown in Fig. 3 where , and denotes input vector, output vector and hidden state vector respectively at -th time step.

Figure 3: Architecture of a generic LSTM network.

General architecture of individual LSTM cell is shown in Fig. 4

. An LSTM cell consists of forget gate layer, memory layer and selection layer. These constituent layers of an LSTM cell are constructed using point-wise operations (addition and multiplication) and vector operations (concatenation and copy) along with activation functions (sigmoid and tanh).

Figure 4: Architecture of a generic LSTM cell.

At first, input vector of current timestamp and hidden state vector of previous timestamp are concatenated and passed through a activation layer (forget gate layer). The resulting vector thus contains values in the range and determines which values of previous cell state need to be dropped.


Next, another activation layer (input gate layer) is used to determine the values that need to be updated. Also a activation layer is used to generate new candidate values that can be added to the cell state.


Next, the cell state is updated to using forget gate layer output , input gate layer output and candidate vector .


Finally, another activation layer is used to select the parts of cell state that need to be included in the final output. Also a activation layer is used to map cell state in the range .


In all of the above equations (6) - (11), and signifies weight matrix and bias vector of respective layer.

3 Results

In this section we discuss a detailed description of the dataset used followed by a description of the experimental protocol along with test results.

3.1 Dataset

In this study, we perform experiments on a dataset [kumar2018position] of Indian sign language containing 2700 sign gestures uniformly distributed over 30 distinct categories. The dataset contains 16 sign gestures performed with single hand and 14 sign gestures performed with both hands. The sign gestures are performed by 10 different signers where each sign has been recorded 9 times by every signer. This leads to a total recorded gestures. The variation in the recorded dataset is increased by performing signs at random positions in plane with a rotation around axis. Discretely, each of the 30 sign gestures are performed with a random translation along with a rotation of (3 times), (3 times) and (3 times) by all 10 signers to constitute a dataset of 2700 samples.

3.2 Experimental protocol and test results

In this work, an user-independent training and validation approach has been adopted where it does not require a specific signer to get registered beforehand. For a valid and direct comparison with the only known work [kumar2018position]

using a Hidden Markov Model (HMM)

[baum1966statistical] on this dataset, we perform Leave-One-Out Cross Validation (LOOCV) during training and cross-validation. In this approach, each training round uses sequentially recorded temporal data from 9 signers while keeping the data of 1 signer for cross-validation. This scheme is repeated for all 10 possible combinations followed by taking the average of all rounds as final validation score. The training of the system was performed separately for single, double-handed gestures. A combined model involving both (single & double-handed) gestures was also trained. The results are depicted in Fig. 5, where accuracies of 82.44% and 86.02% were recorded for single and double-handed gestures. However, a drop in performance of the system was recorded for the combined model where an accuracy of 84.81% was recorded.

Figure 5: Sign gesture recognition using the proposed LSTM framework.

A comparative analysis of the proposed system has also been performed with SVM classifier as proposed in [kumar2018position]

. The authors trained the SVM classifier with mean and standard deviation features for each gesture. The results are depicted in Fig.

6, where the proposed system is outperforming in all the training scenarios with a margin of 10.69%, 8.25% and 13.9% in single, double and combined gesture recognition.

Figure 6: Comparative performance analysis of the proposed LSTM-based SLR system with SVM classifier.

3.3 Comparative Analysis

A comparison of recognition rates with state of the art technique is also performed. Kumar et al. [kumar2018position] have used HMM-based sequence classification technique for developing the SLR system. The authors extracted a number of handcrafted features including gesture direction, velocity, curvature, etc. from the gesture series. The results were calculated with and without including HMM dynamic features. The comparison is presented in Table 1, where the proposed LSTM-based SLR system is outperforming in all three cases.

Type of
sign gestures
HMM without dynamic features
HMM with dynamic features
(our approach)
Single-handed 75.29% 81.29% 82.44%
Double-handed 78.56% 84.81% 86.02%
Combined 73.54% 83.77% 84.81%
Table 1: Comparison of recognition rates

4 Conclusion

In this paper, we have presented a position and rotation invariant, user independent sign language recognition system using LSTM network. The network is trained on a dataset of 2700 sign gestures uniformly distributed over 30 distinct words from Indian sign language vocabulary and performed by 10 signers. Due to the temporal nature of the recorded sequential point cloud data, an LSTM based recurrent neural network has been employed as classifier. Final validation score is estimated as the average accuracy over 10-fold cross-validation. The results are compared with previous benchmarks using HMM under similar experimental protocols. We have shown, experimentally, that LSTM based approach improves the recognition rate over previous approach using HMM. In future, the present approach may be further improved by more complex sequence models such as bidirectional LSTM and attention networks. Additionally, constructing larger dataset with more variations and including image frame sequences along with depth frames may also be worth exploring.