Sign language communication is multi-modal. Some information is conveyed via manual features, such as hand motion and hand shape. Another information channel consists of facial features, such as lip movement, eye gaze, and facial expressions. A third channel is body posture, which can add to the meaning of a sign, or indicate change of subject in dialogues or stories.
As pointed out in , not much work has been done to utilize the non-manual feature body posture, which can be beneficial to SLR to aid recognition of certain positional signs e.g.“bruise” or “tattoo” that involve pointing to or performing the movement on a certain body location. Moreover also pointed out in , is that body posture can also be useful to differentiate between sign language dialogues and stories by observing changes in body positions while addressing different people.
Our work on evaluation of human pose estimation techniques is aimed to contribute towards American Sign Language(ASL) recognition and hence we focus on localization of only upper body joints. Deep learning methods have found a lot of success in recent years on achieving good performance for classification problems as well as for localization and detection. The main benefit of using convolutional neural networks (CNNs) to address our problem is that they do not require features to be input from the programmer and are therefore less prone to human errors with regards to selecting appropriate features. Also, CNNs are holistic and take the entire image as input and are hence able to capture certain context that can be too complex to be performed by conventional technologies. Faster and accurate applications of CNNs can be successfully implemented with the help of GPUs and large amounts of available data. For our ASL recognition domain, we do not have large amount of data, so we improve performance accuracy of the deep learning methods using a technique called transfer learning.
In this study, we introduce an RGB ASL image dataset (ASLID) and then investigate deep learning approaches on ASLID by performing body posture recognition by estimating positions of key upper body joint locations. Our ultimate goal is perform sign language recognition by efficiently obtaining upper body pose estimates over long video sequences and be able to perform multi-view pose estimation aided by this monocular pose estimation. The paper is furthermore organized as follows: In Section 2 we discuss related work; Section 3 we introduce our dataset ASLID; Section 4 transfer learning is explain and how we apply it towards pose estimation; In Section 5 we explain our experimental setup, training details and the evaluation protocol. Results for some existing methods evaluated in this paper are given in Section 6.
2 Related Work
Many techniques have been proposed for vision-based human pose estimation. Several recent approaches use CNNs to address the task of joint localization for human body pose estimation. In , Tompson et al. address the challenging problem of articulated human pose estimation in monocular images by capturing geometric relations between body parts, by first detecting the heatmaps of these body joint using a CNN architecture, and then applying a graphical model to validate these predictions. In , Yang and Ramanan create a tree-like model using local deformable joint parts and solve it using a linear SVM to achieve good results for pose estimation. In 
, Chen and Yuille further extend this by using a graphical model and employing Deep CNNs to learn the conditional probabilities of presence of joints and their spatial relationships. Toshev and Szegedy propose a deep learning based AlexNet-like CNN model for human pose estimation method by which they were able to localize body joints as solution to a regression problem, and then improve on the estimation precision by using a cascade of these pose regressors.
In [6, 15], Jain, Tompson et al. perform pose estimation using a model that combines CNN to regress over joint heatmaps and a Markov Random Field(MRF). A relevant work on pose estimation specific to SLR domain is performed by 
where they estimate joint locations over frames of sign language videos by first performing background subtraction and then predicting joints as a regression problem solved using random forest. The work by Pfister et al. is also relevant to our SLR domain. In that work, the authors use a deep CNN to regress over heatmap of body joints, and improve performance by the use of temporal information from consecutive frames.
3 ASLID Dataset
We present an American Sign Language Image Dataset (ASLID), with images extracted from Gallaudet Dictionary Videos
and the American Sign Language Lexicon Video Dataset (ASLLVD). We provide annotations for upper body joint locations to perform body joint recognition. We have divided our dataset into training and testing sets, to help conduct user-independent experiments. Our training set consists of 808 ASLID images from different signs, performed by six different ASL signers. For the test set we have 479 ASLID images from two ASL signers from ASLLVD videos. The training and testing sets vary in terms of different users, signs and different colored backgrounds. We provide annotations for seven key upper body joint locations, namely left hand(LH), left elbow(LE), left shoulder(LS), head(H), right hand(RH), right shoulder(RS), right hand(RH).
Our dataset and code to display annotations is available from:
4 Transfer Learning
Transfer learning is a way to improve the performance of a learning algorithm by utilizing knowledge that is acquired from previously solved similar problem. As pointed out in , initializing a network with transfer learned weights obtained even from a different task can improve performance compared to using random weights for initialization of a network.  further points out that effectiveness of transfer learning is better if the difference between the original task and the target task is smaller. In our case, the original task is human body pose estimation and the target task is ASL specific upper body pose estimation, which are relatively similar. Hence, transfer learning helps in finetuning the pose estimator, so as to obtain better joint localization estimates for the ASL domain. In this paper we transfer the learned parameters from one method, to use as initial parameters before training using another method (See Table:1). In Method 3, we use the Deeppose network to train on the FLIC dataset, and and transfer its learned parameters as initial weights for training on ASLID training set in Method 4. This shows significant improvement in performance, mainly attributed to the transferred learned weights. As pointed out in , this is also a way to avoid overfitting during training, even when we have a smaller target dataset than the original dataset.
|Method||Training Model and Dataset||Test Dataset||Number of training images||Number of test images||Number of joints trained and tested|
|1||Method trained on ASLID training set||ASLID test set||808||479||7|
|2||Method trained on Chalearn training set||ASLID test set||433||479||7|
|3||Method trained on FLIC dataset||ASLID test set||17,378||479||7|
|4||Method trained on ASLID dataset started with weights from Method 3||ASLID test set||17,378 + 808||479||7|
|5||Method pre-trained on FLIC dataset||ASLID test set||4.5K||479||7|
|6||Method trained on ASLID training set(hands and face)||ASLID test set||808||479||3|
We evaluate performance of deep learning based pose estimators on static frames from ASL videos, by conducting user-independent experiments on images from the ASLID dataset. In this section we describe the experimental details and evaluation protocol. In the next section we show the results and present comparisons of methods. The caffe and chainer frameworks were used for implementation.
5.1 Pose Estimation Methods
The method proposed by Toshev et al. uses deep neural networks for capturing context of body joints. We have trained the method mentioned in  on our ASLID training images and obtained results on our test set. We also use the model trained by Pfister on FLIC dataset to obtain results on our dataset. In using the model of , we only use the heatmap regression and spatial fusion parts of the method. We do not use optical flow, as we conduct pose estimation on static images, where flow information is not available. We compare results of ASLID pose estimation by models trained on other popular datasets (FLIC[14, 11] and Chalearn) with the results by training on our dataset. Details of our experiments can be found in Table: 1.
5.2 Training Details
Toshev and Szegedy proposed a deep learning based method, which localizes body joints by solving a regression problem, and further improves on estimation precision by using a cascade of these pose regressors. Their work demonstrates that a general deep learning based network originally formed for a classification problem can be fine-tuned and used to solve localization and detection problems. We have trained the model of Toshev et al. on our dataset. We have performed user-independent evaluations using our ASLID training set and also on the benchmark Chalearn and FLIC datasets.
In this paper we also use a pre-trained model on our dataset and compare its results with a model trained with our training set. Also this comparison creates a baseline for calculating improvement on results of other methods on the dataset. The pre-trained model which we have used for our evaluations is trained by Pfister. The method by Pfister et.al. is interesting as they regress over heatmap of body joints instead of single centre co-ordinate of a body joint location, and they further improve performance by the use of temporal information from consecutive frames.
We have trained the network for Methods 1 to 3 for 100 epochs, for the 7 key upper body joint locations. For Method 4, we have trained the network for 30 epochs and for Method 5 we use weights from Method 4 and train the model on ASLID training set for 10 epochs. The training and testing loss for Method 1 is shown in Figure-4. To improve accuracy of hand detection on ASLID, we performed experiments by training with DeepPose only on the head and hand joint locations for 10 epochs (Method 6).
5.3 Evaluation Protocol
We apply a quantitative evaluation measure similar to  which they have applied for measuring hand detection accuracy. We extend this evaluation measure to be applicable for joint detection evaluation for seven upper body joint locations. The estimation is determined to be correct if the distance between detected joint and ground truth is less than a threshold(). The average face size of our test dataset is a reasonable threshold and we also calculate results using different threshold values. Given a detected joint location estimate for an image , , a ground truth joint location and a threshold, . We define our joint wise accuracy as:
We display overall accuracy for corresponding joints as a mean of left and right joint detection accuracy for shoulders, elbows and hands.
To measure performance accuracy, we use the evaluation protocol as mentioned above. For methods 1 to 5, the average face width is 25 for the cropped and resized test dataset images. Our results for upper body joint locations are shown in graphs in Figures-7,8,9,10. Example visualizations of some results on the ASLID test set are shown in Figures-5 and 6. Here is the threshold value for measuring accuracy according to equation(1). The results demonstrate good pose estimation accuracy performance for sign language dataset.
In Method 4, we use the Deeppose network to train on the FLIC dataset and transfer its learned weights as initial weights for training on ASLID. The results show huge improvement on Method 3 and demonstrate that transfer learning can help improve pose estimation performance of a method through the transferred knowledge from another trained model. In our experiments, we have also conducted training on 3 joint locations(H, LH, RH) to improve hand localization accuracy. We measuring accuracy for i=3 using equation(1). Here, the average face width is 50 for the cropped and resized test dataset images. The resultant accuracy achieved for hands and face is shown in Figure-11.
We try six different methods on our dataset and show the comparison of results for experiments. As the figures show Method 4 and 5 work very well on detecting the body joints.
7 Conclusion and Future Work
The work in this paper focuses on pose estimation in static images. An interesting direction is to extend the work to pose estimation on ASL videos, so as to provide useful features for ASL recognition. Our aim is to improve sign recognition accuracy, by performing motion analysis on features from an efficient automatic human pose tracker, that detects and tracks upper body joint positions over continuous sign language video sequences.
In summary, this paper has presented a new image dataset for pose estimation, aimed towards applications in sign language recognition. This dataset can be used for measurement of performance of existing methods as well as methods to be developed in the future. In this paper we have selected two deep learning based state-of-the-art methods for human pose estimation, and we have measured the accuracy of them on our dataset. This measurement creates a baseline for other methods in this domain.
This work was partially supported by National Science Foundation grants IIS-1055062 and CNS-1338118. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the National Science Foundation.
-  V. Athitsos, C. Neidle, S. Sclaroff, J. P. Nash, A. Stefan, Q. Yuan, and A. Thangali. The american sign language lexicon video dataset. In CVPR Workshops 2008, Anchorage, AK, USA, 23-28 June, 2008, pages 1–8, 2008.
-  J. Charles, T. Pfister, M. Everingham, and A. Zisserman. Automatic and efficient human pose estimation for sign language videos. International Journal of Computer Vision, 95:180–197, 2011.
-  X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems (NIPS), 2014.
-  H. Cooper, B. Holt, and R. Bowden. Sign language recognition. In Visual Analysis of Humans - Looking at People., pages 539–562. 2011.
-  I. Guyon, V. Athitsos, P. Jangyodsuk, B. Hamner, and H. J. Escalante. Chalearn gesture challenge: Design and first results. In CVPRW, 2012 IEEE Computer Society Conference on, pages 1–6. IEEE, 2012.
-  A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler. Learning human pose estimation features with convolutional networks. CoRR, abs/1312.7302, 2013.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
L. Karlinsky, M. Dinerstein, D. Harari, and S. Ullman.
The chains model for detecting parts by their context.
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 25–32, June 2010.
-  J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang. Transfer learning using computational intelligence: a survey. Knowledge-Based Systems, 80:14–23, 2015.
-  T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In IEEE International Conference on Computer Vision, 2015.
-  B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In In Proc. CVPR, 2013.
-  R. Tennant. American sign language handshape dictionary. In Gallaudet University Press, Washington, D.C. 2010.
S. Tokui, K. Oono, S. Hido, and J. Clayton.
Chainer: a next-generation open source framework for deep learning.
Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015.
-  J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014.
-  J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1799–1807, 2014.
-  A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR. 2014.
-  Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1385–1392. IEEE, 2011.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014.