Evaluation of Deep Learning based Pose Estimation for Sign Language Recognition

02/29/2016 ∙ by Srujana Gattupalli, et al. ∙ The University of Texas at Arlington 0

Human body pose estimation and hand detection are two important tasks for systems that perform computer vision-based sign language recognition(SLR). However, both tasks are challenging, especially when the input is color videos, with no depth information. Many algorithms have been proposed in the literature for these tasks, and some of the most successful recent algorithms are based on deep learning. In this paper, we introduce a dataset for human pose estimation for SLR domain. We evaluate the performance of two deep learning based pose estimation methods, by performing user-independent experiments on our dataset. We also perform transfer learning, and we obtain results that demonstrate that transfer learning can improve pose estimation accuracy. The dataset and results from these methods can create a useful baseline for future works.



There are no comments yet.


page 2

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sign language communication is multi-modal. Some information is conveyed via manual features, such as hand motion and hand shape. Another information channel consists of facial features, such as lip movement, eye gaze, and facial expressions. A third channel is body posture, which can add to the meaning of a sign, or indicate change of subject in dialogues or stories.

As pointed out in [4], not much work has been done to utilize the non-manual feature body posture, which can be beneficial to SLR to aid recognition of certain positional signs e.g.“bruise” or “tattoo” that involve pointing to or performing the movement on a certain body location. Moreover also pointed out in [4], is that body posture can also be useful to differentiate between sign language dialogues and stories by observing changes in body positions while addressing different people.

Our work on evaluation of human pose estimation techniques is aimed to contribute towards American Sign Language(ASL) recognition and hence we focus on localization of only upper body joints. Deep learning methods have found a lot of success in recent years on achieving good performance for classification problems as well as for localization and detection. The main benefit of using convolutional neural networks (CNNs) to address our problem is that they do not require features to be input from the programmer and are therefore less prone to human errors with regards to selecting appropriate features. Also, CNNs are holistic and take the entire image as input and are hence able to capture certain context that can be too complex to be performed by conventional technologies. Faster and accurate applications of CNNs can be successfully implemented with the help of GPUs and large amounts of available data. For our ASL recognition domain, we do not have large amount of data, so we improve performance accuracy of the deep learning methods using a technique called transfer learning.

In this study, we introduce an RGB ASL image dataset (ASLID) and then investigate deep learning approaches on ASLID by performing body posture recognition by estimating positions of key upper body joint locations. Our ultimate goal is perform sign language recognition by efficiently obtaining upper body pose estimates over long video sequences and be able to perform multi-view pose estimation aided by this monocular pose estimation. The paper is furthermore organized as follows: In Section 2 we discuss related work; Section 3 we introduce our dataset ASLID; Section 4 transfer learning is explain and how we apply it towards pose estimation; In Section 5 we explain our experimental setup, training details and the evaluation protocol. Results for some existing methods evaluated in this paper are given in Section 6.

Figure 1:

Ground truth training dataset variance

Figure 2: Example image annotations from ASLID

2 Related Work

Many techniques have been proposed for vision-based human pose estimation. Several recent approaches use CNNs to address the task of joint localization for human body pose estimation. In [14], Tompson et al. address the challenging problem of articulated human pose estimation in monocular images by capturing geometric relations between body parts, by first detecting the heatmaps of these body joint using a CNN architecture, and then applying a graphical model to validate these predictions. In [17], Yang and Ramanan create a tree-like model using local deformable joint parts and solve it using a linear SVM to achieve good results for pose estimation. In [3]

, Chen and Yuille further extend this by using a graphical model and employing Deep CNNs to learn the conditional probabilities of presence of joints and their spatial relationships. Toshev and Szegedy

[16] propose a deep learning based AlexNet-like CNN model for human pose estimation method by which they were able to localize body joints as solution to a regression problem, and then improve on the estimation precision by using a cascade of these pose regressors.

In [6, 15], Jain, Tompson et al. perform pose estimation using a model that combines CNN to regress over joint heatmaps and a Markov Random Field(MRF). A relevant work on pose estimation specific to SLR domain is performed by [2]

where they estimate joint locations over frames of sign language videos by first performing background subtraction and then predicting joints as a regression problem solved using random forest. The work by Pfister et al.

[10] is also relevant to our SLR domain. In that work, the authors use a deep CNN to regress over heatmap of body joints, and improve performance by the use of temporal information from consecutive frames.

Figure 3: Ground truth test dataset variance

3 ASLID Dataset

We present an American Sign Language Image Dataset (ASLID), with images extracted from Gallaudet Dictionary Videos[12]

and the American Sign Language Lexicon Video Dataset (ASLLVD)

[1]. We provide annotations for upper body joint locations to perform body joint recognition. We have divided our dataset into training and testing sets, to help conduct user-independent experiments. Our training set consists of 808 ASLID images from different signs, performed by six different ASL signers. For the test set we have 479 ASLID images from two ASL signers from ASLLVD videos. The training and testing sets vary in terms of different users, signs and different colored backgrounds. We provide annotations for seven key upper body joint locations, namely left hand(LH), left elbow(LE), left shoulder(LS), head(H), right hand(RH), right shoulder(RS), right hand(RH).

Figure 4: Loss for training and testing for Method(1)

Figure 2 shows examples of annotated images from ASLID. Variations in the range of training and testing poses are shown in the ground truth scatter plots figures 1 and 3.

Figure 5: Example visualizations of Pose Estimation by Method(1)
Figure 6: Example visualizations of Pose Estimation by Method(5)

Our dataset and code to display annotations is available from:

4 Transfer Learning

Transfer learning is a way to improve the performance of a learning algorithm by utilizing knowledge that is acquired from previously solved similar problem[9]. As pointed out in [18], initializing a network with transfer learned weights obtained even from a different task can improve performance compared to using random weights for initialization of a network. [18] further points out that effectiveness of transfer learning is better if the difference between the original task and the target task is smaller. In our case, the original task is human body pose estimation and the target task is ASL specific upper body pose estimation, which are relatively similar. Hence, transfer learning helps in finetuning the pose estimator, so as to obtain better joint localization estimates for the ASL domain. In this paper we transfer the learned parameters from one method, to use as initial parameters before training using another method (See Table:1). In Method 3, we use the Deeppose network[16] to train on the FLIC dataset, and and transfer its learned parameters as initial weights for training on ASLID training set in Method 4. This shows significant improvement in performance, mainly attributed to the transferred learned weights. As pointed out in [18], this is also a way to avoid overfitting during training, even when we have a smaller target dataset than the original dataset.

Method Training Model and Dataset Test Dataset Number of training images Number of test images Number of joints trained and tested
1 Method[16] trained on ASLID training set ASLID test set 808 479 7
2 Method[16] trained on Chalearn training set ASLID test set 433 479 7
3 Method[16] trained on FLIC dataset ASLID test set 17,378 479 7
4 Method[16] trained on ASLID dataset started with weights from Method 3 ASLID test set 17,378 + 808 479 7
5 Method[10] pre-trained on FLIC dataset ASLID test set 4.5K 479 7
6 Method[16] trained on ASLID training set(hands and face) ASLID test set 808 479 3
Table 1: Deep Learning based Pose Estimation Experiment Details

5 Experiments

We evaluate performance of deep learning based pose estimators on static frames from ASL videos, by conducting user-independent experiments on images from the ASLID dataset. In this section we describe the experimental details and evaluation protocol. In the next section we show the results and present comparisons of methods. The caffe

[7] and chainer[13] frameworks were used for implementation.

5.1 Pose Estimation Methods

The method proposed by Toshev et al.[16] uses deep neural networks for capturing context of body joints. We have trained the method mentioned in [16] on our ASLID training images and obtained results on our test set. We also use the model trained by Pfister[10] on FLIC dataset to obtain results on our dataset. In using the model of [10], we only use the heatmap regression and spatial fusion parts of the method. We do not use optical flow, as we conduct pose estimation on static images, where flow information is not available. We compare results of ASLID pose estimation by models trained on other popular datasets (FLIC[14, 11] and Chalearn[5]) with the results by training on our dataset. Details of our experiments can be found in Table: 1.

5.2 Training Details

Toshev and Szegedy[16] proposed a deep learning based method, which localizes body joints by solving a regression problem, and further improves on estimation precision by using a cascade of these pose regressors. Their work demonstrates that a general deep learning based network originally formed for a classification problem can be fine-tuned and used to solve localization and detection problems. We have trained the model of Toshev et al.[16] on our dataset. We have performed user-independent evaluations using our ASLID training set and also on the benchmark Chalearn[5] and FLIC[11] datasets.

In this paper we also use a pre-trained model on our dataset and compare its results with a model trained with our training set. Also this comparison creates a baseline for calculating improvement on results of other methods on the dataset. The pre-trained model which we have used for our evaluations is trained by Pfister[10]. The method by Pfister et.al. is interesting as they regress over heatmap of body joints instead of single centre co-ordinate of a body joint location, and they further improve performance by the use of temporal information from consecutive frames.

We have trained the network for Methods 1 to 3 for 100 epochs, for the 7 key upper body joint locations. For Method 4, we have trained the network for 30 epochs and for Method 5 we use weights from Method 4 and train the model on ASLID training set for 10 epochs. The training and testing loss for Method 1 is shown in Figure-

4. To improve accuracy of hand detection on ASLID, we performed experiments by training with DeepPose[16] only on the head and hand joint locations for 10 epochs (Method 6).

5.3 Evaluation Protocol

We apply a quantitative evaluation measure similar to [8] which they have applied for measuring hand detection accuracy. We extend this evaluation measure to be applicable for joint detection evaluation for seven upper body joint locations. The estimation is determined to be correct if the distance between detected joint and ground truth is less than a threshold(). The average face size of our test dataset is a reasonable threshold and we also calculate results using different threshold values. Given a detected joint location estimate for an image , , a ground truth joint location and a threshold, . We define our joint wise accuracy as:


We display overall accuracy for corresponding joints as a mean of left and right joint detection accuracy for shoulders, elbows and hands.

6 Results

To measure performance accuracy, we use the evaluation protocol as mentioned above. For methods 1 to 5, the average face width is 25 for the cropped and resized test dataset images. Our results for upper body joint locations are shown in graphs in Figures-7,8,9,10. Example visualizations of some results on the ASLID test set are shown in Figures-5 and 6. Here is the threshold value for measuring accuracy according to equation(1). The results demonstrate good pose estimation accuracy performance for sign language dataset.

In Method 4, we use the Deeppose network[16] to train on the FLIC dataset and transfer its learned weights as initial weights for training on ASLID. The results show huge improvement on Method 3 and demonstrate that transfer learning can help improve pose estimation performance of a method through the transferred knowledge from another trained model. In our experiments, we have also conducted training on 3 joint locations(H, LH, RH) to improve hand localization accuracy. We measuring accuracy for i=3 using equation(1). Here, the average face width is 50 for the cropped and resized test dataset images. The resultant accuracy achieved for hands and face is shown in Figure-11.

We try six different methods on our dataset and show the comparison of results for experiments. As the figures show Method 4 and 5 work very well on detecting the body joints.

Figure 7: Head detection results for Methods 1 to 5
Figure 8: Hand detection results for Method 1 to 5
Figure 9: Shoulder detection results for Method 1 to 5
Figure 10: Elbow detection results for Method 1 to 5
Figure 11: Results for Method 6

7 Conclusion and Future Work

The work in this paper focuses on pose estimation in static images. An interesting direction is to extend the work to pose estimation on ASL videos, so as to provide useful features for ASL recognition. Our aim is to improve sign recognition accuracy, by performing motion analysis on features from an efficient automatic human pose tracker, that detects and tracks upper body joint positions over continuous sign language video sequences.

In summary, this paper has presented a new image dataset for pose estimation, aimed towards applications in sign language recognition. This dataset can be used for measurement of performance of existing methods as well as methods to be developed in the future. In this paper we have selected two deep learning based state-of-the-art methods for human pose estimation, and we have measured the accuracy of them on our dataset. This measurement creates a baseline for other methods in this domain.

8 Acknowledgments

This work was partially supported by National Science Foundation grants IIS-1055062 and CNS-1338118. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the National Science Foundation.


  • [1] V. Athitsos, C. Neidle, S. Sclaroff, J. P. Nash, A. Stefan, Q. Yuan, and A. Thangali. The american sign language lexicon video dataset. In CVPR Workshops 2008, Anchorage, AK, USA, 23-28 June, 2008, pages 1–8, 2008.
  • [2] J. Charles, T. Pfister, M. Everingham, and A. Zisserman. Automatic and efficient human pose estimation for sign language videos. International Journal of Computer Vision, 95:180–197, 2011.
  • [3] X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [4] H. Cooper, B. Holt, and R. Bowden. Sign language recognition. In Visual Analysis of Humans - Looking at People., pages 539–562. 2011.
  • [5] I. Guyon, V. Athitsos, P. Jangyodsuk, B. Hamner, and H. J. Escalante. Chalearn gesture challenge: Design and first results. In CVPRW, 2012 IEEE Computer Society Conference on, pages 1–6. IEEE, 2012.
  • [6] A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler. Learning human pose estimation features with convolutional networks. CoRR, abs/1312.7302, 2013.
  • [7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  • [8] L. Karlinsky, M. Dinerstein, D. Harari, and S. Ullman. The chains model for detecting parts by their context. In

    Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on

    , pages 25–32, June 2010.
  • [9] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang. Transfer learning using computational intelligence: a survey. Knowledge-Based Systems, 80:14–23, 2015.
  • [10] T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In IEEE International Conference on Computer Vision, 2015.
  • [11] B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In In Proc. CVPR, 2013.
  • [12] R. Tennant. American sign language handshape dictionary. In Gallaudet University Press, Washington, D.C. 2010.
  • [13] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep learning. In

    Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS)

    , 2015.
  • [14] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
  • [15] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1799–1807, 2014.
  • [16] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR. 2014.
  • [17] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1385–1392. IEEE, 2011.
  • [18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014.