Hand Segmentation for Hand-Object Interaction from Depth map

03/08/2016 ∙ by Byeongkeun Kang, et al. ∙ 0

Hand-object interaction is important for many applications such as augmented reality, medical application, and human-robot interaction. To understand hand-object interaction, hand segmentation is a necessary pre-process. However, current method is based on color information which is not robust to objects with skin color, skin pigment difference, and light condition variations. Therefore, we propose the first hand segmentation method for hand-object interaction using only depth map. The proposed method includes randomized decision forest (RDF), bilateral filtering, decision adjustment, and post-processing. We demonstrate the effectiveness of the method by testing for five objects. The method achieves the average F_1 score of 0.8409 and 0.8163 for the same object and new object, respectively. Also, the method takes less than 10ms to process each frame.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, with the expansion of virtual reality (VR), augmented reality (AR), robotics, and intelligent vehicles, the development of new interaction technologies has become unavoidable since these applications require more natural interaction methods rather than input devices. For these applications, many researches have been conducted such as gesture recognition and hand pose estimation. However, most technologies focus on understanding interactions which do not involve touching or handling any real world objects although understanding interactions with objects is important in many applications. We believe that this is because hand segmentation is much more difficult in hand-object interaction. Thus, we present a framework of hand segmentation for hand-object interaction.

1.1 Related work

Hand segmentation has been studied for many applications such as hand pose estimation [1, 2, 3, 4, 5, 6], hand tracking [7, 8, 9], and gesture/sign/grasp recognition [10, 11]. In color image-based methods, skin color-based method has been popular [12, 13, 14, 10, 15, 16]. For hand-object interaction, Oikonomidis et al. and Romero et al. segmented hands by thresholding skin color in HSV space [7, 4, 5, 8]. Wang et al. processed hand segmentation using a learned probabilistic model where the model is constructed from the color histogram of the first frame [6]. Tzionas et al.

applied skin color-based segmentation using the Gaussian mixture model 

[17]. However, skin color-based segmentation has limitations in interacting with objects in skin color, segmenting from other body parts, skin pigment difference, and light condition variations. An alternative method is wearing a specific color glove [18].

For depth map-based methods, popular methods are using a wrist band [11, 9, 3] or using random decision forest (RDF) [1, 19, 2]. Although the method using a black wristband is uncomplicated and effective, it is inconvenient. Moreover, the method cannot segment hands from objects during hand-object interaction since it processes segmentation by finding connected components. Tompson et al. [1] and Sharp et al. [2] proposed the RDF-based methods based on [19]. Although the purposes of the methods are slightly different comparing to the proposed method, the methods are the most relevant methods.

In this paper, we propose the hand segmentation method for hand-object interaction using only a depth map to avoid the limitations of skin color-based methods. We present the two-stage RDF method to achieve high accuracy efficiently.

Figure 1: Random decision forest. Red, black, and green circles represent root nodes, split nodes, and leaf nodes, respectively.

2 Method

We propose two-stage RDF for hand segmentation for hand-object interaction. In our two-stage RDF, the first RDF detects hands by processing the RDF on an entire depth map. Then, the second RDF segments hands in pixel-level by applying the RDF in the detected region. This cascaded architecture is designed for the second RDF to focus on the segmentation of hands from objects and close body parts such as an arm.

RDF consists of a collection of decision trees as shown in Fig. 

1

. Each decision tree is composed of a root node, splitting nodes, and leaf nodes. Given an input data at the root node, it is classified to child nodes based on the split function at each splitting node until it reaches a leaf node. In this paper, the input data is the location of each pixel on a depth map. The split function uses the feature of the depth difference between two relative points on the depth map in 

[19]

. At a leaf node, a conditional probability distribution is learned in a training stage, and the learned probability is used in a testing stage. For more details about RDF, we refer the readers to 

[20, 21, 22].

2.1 Training

Given a training dataset , the algorithm randomly selects a set of depth maps and then randomly samples a set of data points in the region of interest (ROI) on the selected depth maps . The ROI is the entire region of the depth maps in the first stage. It is the detected regions using the first RDF in the second stage (see Fig. 2). The sampled set of data points and the corresponding depth maps are inputs to the training of a decision tree.

Using the inputs (, ), the algorithm learns a split function at each splitting node and a conditional probability distribution at each leaf node. First, learning the split function includes learning a feature and a criteria . We use the feature of the depth difference between two relative points in [19] as follows:

(1)

where denotes the depth at a pixel on a depth map ; and

represent offset vectors for each relative point. Then, the criteria

decides to split the data to the left child or the right child.

(2)

Thus, the algorithm learns two offset vectors () and a criteria at each splitting node.

Figure 2: Detection of hands using the RDF in the first stage.

Since the goal is separating the data points of different classes to different child nodes, the objective function is designed to evalutate the separation using the learned offset vectors and criteria as follows:

(3)

where and are indexes for child nodes and for classes, respectively; denotes the number of data points in the child node; is the estimated probability of being the class at the child node .

To learn offsets and a criteria, the algorithm randomly generates possible candidates and selects the candidate with a minimum loss as follows:

(4)

Learning a split function at each splitting node is repeated until the node satisfies the condition for a leaf node. The condition is based on (1) the maximum depth of the tree, (2) the probability distribution , and (3) the amount of training data at the node. Specifically, it avoids too many splitting nodes by limiting the maximum depth of the tree and by terminating if the child node has a high probability for a class or if the amount of remaining training data is too small.

At each leaf node, the algorithm stores the conditional probability (probability of being each class given reaching the node ) for the prediction in a testing stage.

2.2 Testing

Using the learned RDF, the algorithm predicts the probability of being a class for a new data . The new data is classified to child nodes using the learned split function at each splitting node until it reaches a leaf node. At the leaf node , the learned conditional probability is loaded. These steps are repeated for entire trees in the forest . Then, the probabilities are averaged to predict the probability of being a class for the new data .

(5)

where is the number of trees in the learned forest .

In the first stage, the first RDF is applied on an entire depth map to compute a probability map. Then, the probability map is used to detect hands as shown in Fig. 2. In the second stage, the second RDF processes the data points in the detected regions to predict the probability of being each class. The proposed two-stage RDF improves both accuracy and efficiency by focusing on each task in each stage.

Decision boundaries are exhaustively searched with the step size of 0.01 using the predicted probability maps of the validation dataset as shown in Fig. 3. Although the most typical boundary is 0.5 for a probability map, we found that it is not the best parameter. The selected boundaries are shown in Table 2.3.

Figure 3: Scores depending on the decision boundary on the validation dataset. (a) Score of the RDF in the first stage. (b) Score of the two-stage RDF with filtering in Section 2.3.

2.3 Modified bilateral filter

Before classifying a data to a class , modified bilateral filter is applied to the predicted probability to make the probability more robust. Since the probability is predicted for each pixel independently, the probability is stabilized by averaging the probabilities of the data points in close distance and similar intensity on the depth map.

Unlike typical bilateral filter whose weights are based on the input image (in this case, the probability map) [23], the weights in the modified bilateral filter are based on a separate image, the depth map. The filtering is defined as follows:

(6)

where is the set of pixels within the filter’s radius and the pre-defined depth difference; is the normalization term, ; and are the Gaussian functions for the depth difference and for the spatial distance from the data point , respectively. The parameters in the filter were selected based on the experiments using validation dataset. The selected parameters are as follows: the maximum depth difference to be considered is

. Both standard deviations (

and ) are 100.

Figure 4: Analysis of accuracy and efficiency.

to 0.96c—c—X[c,m]—X[c,m]—X[c,m]—X[c,m]—c Method   & Score   & Processing time
Method & Boundary & Filter & Precision & Recall & score & ()
RDF [1, 19] & 0.50 & - & 38.1 & 91.2 & 53.7 & 6.7
RDF [1, 19] + Proposed in Sec. 2.2 & 0.78 & - & 54.5 & 72.7 & 62.3 & 6.7
FCN-32s [24, 25] & - & - & 70.0 & 68.6 & 69.3 & 376
FCN-16s [24, 25] & - & - & 68.0 & 72.2 & 70.1 & 376
FCN-8s [24, 25] & - & - & 70.4 & 74.4 & 72.3 & 377
Proposed method & 0.50, 0.50 & - & 59.2 & 77.4 & 67.1 & 8.9
& 0.50, 0.52 & - & 60.8 & 75.1 & 67.2 & 8.9
& 0.50, 0.52 & 11 11 & 62.9 & 75.6 & 68.7 & 10.7

3 Experimental Evaluations

3.1 Dataset

We collected a new dataset111https://github.com/byeongkeun-kang/HOI-dataset using Microsoft Kinect v2 [26]. The newly collected dataset consists of 27,525 pairs of depth maps and ground truth labels from 6 people (3 males and 3 females) interacting with 21 different objects. Also, the dataset includes the cases of one hand and both hands in a scene. The dataset is separated into 19,470 pairs for training, 2,706 pairs for validation, and 5,349 pairs for testing, respectively.

(a) (b) (c) (d) (e)
Figure 5: Visual comparison. (a) Ground truth label. (b) Result using RDF [1, 19]. (c) Result using RDF [1, 19] with the proposed method in Section 2.2. (d) Result using FCN-8s [24, 25]. (e) Result using the proposed method. The results and ground truth label are visualized using different color channels for better visualization.

3.2 Results

The proposed method is analyzed by demonstrating the results on the dataset in Section 3.1. For the quantitative comparison of accuracy, we measure

score, precision, and recall as follows:

(7)

where tp, fp, and fn represent true positive, false positive, and false negative, respectively. For the comparison of efficiency, we measure the processing time using a machine with Intel i7-4790K CPU and Nvidia GeForce GTX 770.

The proposed method is compared with the RDF-based method in [1, 19] and the fully convolutional networks (FCN) in [24, 25] using only a depth map. The proposed method is not compared with color-based methods since the characteristics of depth sensors and color imaging sensors are quite different. For example, a captured depth map using a depth sensor does not vary depending on light condition. However, a captured color image varies a lot depending on light condition. Thus, choosing the capturing environment affects the comparison of results using depth maps and color images. Hence, we only compare the proposed method with the state-of-the-art methods which can process using only depth maps.

Table 2.3 and Fig. 5 show quantitative results and visual results. The scores in Table 2.3 are scaled by a factor of 100. The quantitative results show that the proposed method achieves about 25% and 8% relative improvements in score comparing to the RDF-based methods [1, 19] and its combination with the proposed method in Section 2.2

, respectively. Comparing to the deep learning-based methods 

[24, 25], the proposed method achieves about 7% lower accuracy, but processes in about 42 times shorter processing time. Thus, deep learning-based methods can not be used in real-time applications. Fig. 4 shows the comparison of methods in accuracy and efficiency. The proposed method achieves high accuracy in short processing time.

4 Conclusion

In this paper, we present two-stage RDF method for hand segmentation for hand-object interaction using only a depth map. The two stages consist of detecting the region of interest and segmenting hands. The proposed method achieves high accuracy in short processing time comparing to the state-of-the-art methods.

References

  • [1] J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., 2014.
  • [2] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, D. Freedman, P. Kohli, E. Krupka, A. Fitzgibbon, and S. Izadi, “Accurate, robust, and flexible real-time hand tracking,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015.
  • [3] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014.
  • [4] J. Romero, H. Kjellstrom, and D. Kragic, “Hands in action: real-time 3d reconstruction of hands in interaction with objects,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, May 2010.
  • [5] J. Romero, H. Kjellstrom, C. H. Ek, and D. Kragic, “Non-parametric hand pose estimation with object context,” Image Vision Comput., vol. 31, no. 8, Aug. 2013.
  • [6] Y. Wang, J. Min, J. Zhang, Y. Liu, F. Xu, Q. Dai, and J. Chai, “Video-based hand manipulation capture through composite motion control,” ACM Trans. Graph., vol. 32, no. 4, July 2013.
  • [7] I. Oikonomidis, N. Kyriazis, and A.A. Argyros, “Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints,” in Computer Vision (ICCV), 2011 IEEE International Conference on, 2011.
  • [8] A. A. Argyros and M. I. A. Lourakis, “Real-time tracking of multiple skin-colored objects with a possibly moving camera,” in Computer Vision - ECCV, 2004.
  • [9] B. Kang, Y. Lee, and T. Nguyen, “Efficient hand articulations tracking using adaptive hand model and depth map,” in Advances in Visual Computing, Dec. 2015.
  • [10] M. Cai, K.M. Kitani, and Y. Sato, “A scalable approach for understanding the visual structures of hand grasps,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, May 2015.
  • [11] B. Kang, S. Tripathi, and T. Nguyen,

    “Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”

    in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.
  • [12] M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” International Journal of Computer Vision, vol. 46, Jan. 2002.
  • [13] R. Khan, A. Hanbury, J. Stöttinger, and A. Bais, “Color based skin classification,” Pattern Recognition Letters, vol. 33, Jan. 2012.
  • [14] C. Li and K.M. Kitani, “Pixel-level hand detection in ego-centric videos,” in Computer Vision and Pattern Recognition, 2013 IEEE Conference on, June 2013.
  • [15] S. L. Phung, A. Bouzerdoum, and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, 2005.
  • [16] P. Kakumanu, S. Makrogiannis, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern Recognition, vol. 40, Mar. 2007.
  • [17] D. Tzionas and J. Gall, “3d object reconstruction from hand-object interactions,” in International Conference on Computer Vision (ICCV), Dec. 2015.
  • [18] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Trans. Graph., vol. 28, 2009.
  • [19] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011.
  • [20] A. Criminisi and J. Shotton, “Decision forests for computer vision and medical image analysis,” Advances in Computer Vision and Pattern Recognition, 2013.
  • [21] L. Breiman,

    Random forests,”

    Machine Learning, vol. 45, no. 1, 2001.
  • [22] T. Ho, “Random decision forests,” in Proceedings of the Third International Conference on Document Analysis and Recognition, 1995.
  • [23] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Computer Vision, Sixth International Conference on, Jan. 1998.
  • [24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [25] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [26] B. Kang, Y. Lee, and T. Q. Nguyen, “Depth adaptive deep neural network for semantic segmentation,” IEEE Transactions on Multimedia, 2018.
Table 1: Quantitative comparison. The two boundaries for the proposed method are for each stage.

3 Experimental Evaluations

3.1 Dataset

We collected a new dataset111https://github.com/byeongkeun-kang/HOI-dataset using Microsoft Kinect v2 [26]. The newly collected dataset consists of 27,525 pairs of depth maps and ground truth labels from 6 people (3 males and 3 females) interacting with 21 different objects. Also, the dataset includes the cases of one hand and both hands in a scene. The dataset is separated into 19,470 pairs for training, 2,706 pairs for validation, and 5,349 pairs for testing, respectively.

(a) (b) (c) (d) (e)
Figure 5: Visual comparison. (a) Ground truth label. (b) Result using RDF [1, 19]. (c) Result using RDF [1, 19] with the proposed method in Section 2.2. (d) Result using FCN-8s [24, 25]. (e) Result using the proposed method. The results and ground truth label are visualized using different color channels for better visualization.

3.2 Results

The proposed method is analyzed by demonstrating the results on the dataset in Section 3.1. For the quantitative comparison of accuracy, we measure

score, precision, and recall as follows:

(7)

where tp, fp, and fn represent true positive, false positive, and false negative, respectively. For the comparison of efficiency, we measure the processing time using a machine with Intel i7-4790K CPU and Nvidia GeForce GTX 770.

The proposed method is compared with the RDF-based method in [1, 19] and the fully convolutional networks (FCN) in [24, 25] using only a depth map. The proposed method is not compared with color-based methods since the characteristics of depth sensors and color imaging sensors are quite different. For example, a captured depth map using a depth sensor does not vary depending on light condition. However, a captured color image varies a lot depending on light condition. Thus, choosing the capturing environment affects the comparison of results using depth maps and color images. Hence, we only compare the proposed method with the state-of-the-art methods which can process using only depth maps.

Table 2.3 and Fig. 5 show quantitative results and visual results. The scores in Table 2.3 are scaled by a factor of 100. The quantitative results show that the proposed method achieves about 25% and 8% relative improvements in score comparing to the RDF-based methods [1, 19] and its combination with the proposed method in Section 2.2

, respectively. Comparing to the deep learning-based methods 

[24, 25], the proposed method achieves about 7% lower accuracy, but processes in about 42 times shorter processing time. Thus, deep learning-based methods can not be used in real-time applications. Fig. 4 shows the comparison of methods in accuracy and efficiency. The proposed method achieves high accuracy in short processing time.

4 Conclusion

In this paper, we present two-stage RDF method for hand segmentation for hand-object interaction using only a depth map. The two stages consist of detecting the region of interest and segmenting hands. The proposed method achieves high accuracy in short processing time comparing to the state-of-the-art methods.

References

  • [1] J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., 2014.
  • [2] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, D. Freedman, P. Kohli, E. Krupka, A. Fitzgibbon, and S. Izadi, “Accurate, robust, and flexible real-time hand tracking,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015.
  • [3] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014.
  • [4] J. Romero, H. Kjellstrom, and D. Kragic, “Hands in action: real-time 3d reconstruction of hands in interaction with objects,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, May 2010.
  • [5] J. Romero, H. Kjellstrom, C. H. Ek, and D. Kragic, “Non-parametric hand pose estimation with object context,” Image Vision Comput., vol. 31, no. 8, Aug. 2013.
  • [6] Y. Wang, J. Min, J. Zhang, Y. Liu, F. Xu, Q. Dai, and J. Chai, “Video-based hand manipulation capture through composite motion control,” ACM Trans. Graph., vol. 32, no. 4, July 2013.
  • [7] I. Oikonomidis, N. Kyriazis, and A.A. Argyros, “Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints,” in Computer Vision (ICCV), 2011 IEEE International Conference on, 2011.
  • [8] A. A. Argyros and M. I. A. Lourakis, “Real-time tracking of multiple skin-colored objects with a possibly moving camera,” in Computer Vision - ECCV, 2004.
  • [9] B. Kang, Y. Lee, and T. Nguyen, “Efficient hand articulations tracking using adaptive hand model and depth map,” in Advances in Visual Computing, Dec. 2015.
  • [10] M. Cai, K.M. Kitani, and Y. Sato, “A scalable approach for understanding the visual structures of hand grasps,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, May 2015.
  • [11] B. Kang, S. Tripathi, and T. Nguyen,

    “Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”

    in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.
  • [12] M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” International Journal of Computer Vision, vol. 46, Jan. 2002.
  • [13] R. Khan, A. Hanbury, J. Stöttinger, and A. Bais, “Color based skin classification,” Pattern Recognition Letters, vol. 33, Jan. 2012.
  • [14] C. Li and K.M. Kitani, “Pixel-level hand detection in ego-centric videos,” in Computer Vision and Pattern Recognition, 2013 IEEE Conference on, June 2013.
  • [15] S. L. Phung, A. Bouzerdoum, and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, 2005.
  • [16] P. Kakumanu, S. Makrogiannis, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern Recognition, vol. 40, Mar. 2007.
  • [17] D. Tzionas and J. Gall, “3d object reconstruction from hand-object interactions,” in International Conference on Computer Vision (ICCV), Dec. 2015.
  • [18] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Trans. Graph., vol. 28, 2009.
  • [19] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011.
  • [20] A. Criminisi and J. Shotton, “Decision forests for computer vision and medical image analysis,” Advances in Computer Vision and Pattern Recognition, 2013.
  • [21] L. Breiman,

    Random forests,”

    Machine Learning, vol. 45, no. 1, 2001.
  • [22] T. Ho, “Random decision forests,” in Proceedings of the Third International Conference on Document Analysis and Recognition, 1995.
  • [23] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Computer Vision, Sixth International Conference on, Jan. 1998.
  • [24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [25] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [26] B. Kang, Y. Lee, and T. Q. Nguyen, “Depth adaptive deep neural network for semantic segmentation,” IEEE Transactions on Multimedia, 2018.

4 Conclusion

In this paper, we present two-stage RDF method for hand segmentation for hand-object interaction using only a depth map. The two stages consist of detecting the region of interest and segmenting hands. The proposed method achieves high accuracy in short processing time comparing to the state-of-the-art methods.

References

  • [1] J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., 2014.
  • [2] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, D. Freedman, P. Kohli, E. Krupka, A. Fitzgibbon, and S. Izadi, “Accurate, robust, and flexible real-time hand tracking,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015.
  • [3] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014.
  • [4] J. Romero, H. Kjellstrom, and D. Kragic, “Hands in action: real-time 3d reconstruction of hands in interaction with objects,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, May 2010.
  • [5] J. Romero, H. Kjellstrom, C. H. Ek, and D. Kragic, “Non-parametric hand pose estimation with object context,” Image Vision Comput., vol. 31, no. 8, Aug. 2013.
  • [6] Y. Wang, J. Min, J. Zhang, Y. Liu, F. Xu, Q. Dai, and J. Chai, “Video-based hand manipulation capture through composite motion control,” ACM Trans. Graph., vol. 32, no. 4, July 2013.
  • [7] I. Oikonomidis, N. Kyriazis, and A.A. Argyros, “Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints,” in Computer Vision (ICCV), 2011 IEEE International Conference on, 2011.
  • [8] A. A. Argyros and M. I. A. Lourakis, “Real-time tracking of multiple skin-colored objects with a possibly moving camera,” in Computer Vision - ECCV, 2004.
  • [9] B. Kang, Y. Lee, and T. Nguyen, “Efficient hand articulations tracking using adaptive hand model and depth map,” in Advances in Visual Computing, Dec. 2015.
  • [10] M. Cai, K.M. Kitani, and Y. Sato, “A scalable approach for understanding the visual structures of hand grasps,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, May 2015.
  • [11] B. Kang, S. Tripathi, and T. Nguyen,

    “Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”

    in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.
  • [12] M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” International Journal of Computer Vision, vol. 46, Jan. 2002.
  • [13] R. Khan, A. Hanbury, J. Stöttinger, and A. Bais, “Color based skin classification,” Pattern Recognition Letters, vol. 33, Jan. 2012.
  • [14] C. Li and K.M. Kitani, “Pixel-level hand detection in ego-centric videos,” in Computer Vision and Pattern Recognition, 2013 IEEE Conference on, June 2013.
  • [15] S. L. Phung, A. Bouzerdoum, and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, 2005.
  • [16] P. Kakumanu, S. Makrogiannis, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern Recognition, vol. 40, Mar. 2007.
  • [17] D. Tzionas and J. Gall, “3d object reconstruction from hand-object interactions,” in International Conference on Computer Vision (ICCV), Dec. 2015.
  • [18] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Trans. Graph., vol. 28, 2009.
  • [19] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011.
  • [20] A. Criminisi and J. Shotton, “Decision forests for computer vision and medical image analysis,” Advances in Computer Vision and Pattern Recognition, 2013.
  • [21] L. Breiman,

    Random forests,”

    Machine Learning, vol. 45, no. 1, 2001.
  • [22] T. Ho, “Random decision forests,” in Proceedings of the Third International Conference on Document Analysis and Recognition, 1995.
  • [23] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Computer Vision, Sixth International Conference on, Jan. 1998.
  • [24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [25] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [26] B. Kang, Y. Lee, and T. Q. Nguyen, “Depth adaptive deep neural network for semantic segmentation,” IEEE Transactions on Multimedia, 2018.

References

  • [1] J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., 2014.
  • [2] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, D. Freedman, P. Kohli, E. Krupka, A. Fitzgibbon, and S. Izadi, “Accurate, robust, and flexible real-time hand tracking,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015.
  • [3] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robust hand tracking from depth,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014.
  • [4] J. Romero, H. Kjellstrom, and D. Kragic, “Hands in action: real-time 3d reconstruction of hands in interaction with objects,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, May 2010.
  • [5] J. Romero, H. Kjellstrom, C. H. Ek, and D. Kragic, “Non-parametric hand pose estimation with object context,” Image Vision Comput., vol. 31, no. 8, Aug. 2013.
  • [6] Y. Wang, J. Min, J. Zhang, Y. Liu, F. Xu, Q. Dai, and J. Chai, “Video-based hand manipulation capture through composite motion control,” ACM Trans. Graph., vol. 32, no. 4, July 2013.
  • [7] I. Oikonomidis, N. Kyriazis, and A.A. Argyros, “Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints,” in Computer Vision (ICCV), 2011 IEEE International Conference on, 2011.
  • [8] A. A. Argyros and M. I. A. Lourakis, “Real-time tracking of multiple skin-colored objects with a possibly moving camera,” in Computer Vision - ECCV, 2004.
  • [9] B. Kang, Y. Lee, and T. Nguyen, “Efficient hand articulations tracking using adaptive hand model and depth map,” in Advances in Visual Computing, Dec. 2015.
  • [10] M. Cai, K.M. Kitani, and Y. Sato, “A scalable approach for understanding the visual structures of hand grasps,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, May 2015.
  • [11] B. Kang, S. Tripathi, and T. Nguyen,

    “Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”

    in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.
  • [12] M. J. Jones and J. M. Rehg, “Statistical color models with application to skin detection,” International Journal of Computer Vision, vol. 46, Jan. 2002.
  • [13] R. Khan, A. Hanbury, J. Stöttinger, and A. Bais, “Color based skin classification,” Pattern Recognition Letters, vol. 33, Jan. 2012.
  • [14] C. Li and K.M. Kitani, “Pixel-level hand detection in ego-centric videos,” in Computer Vision and Pattern Recognition, 2013 IEEE Conference on, June 2013.
  • [15] S. L. Phung, A. Bouzerdoum, and D. Chai, “Skin segmentation using color pixel classification: analysis and comparison,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, 2005.
  • [16] P. Kakumanu, S. Makrogiannis, and N. Bourbakis, “A survey of skin-color modeling and detection methods,” Pattern Recognition, vol. 40, Mar. 2007.
  • [17] D. Tzionas and J. Gall, “3d object reconstruction from hand-object interactions,” in International Conference on Computer Vision (ICCV), Dec. 2015.
  • [18] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Trans. Graph., vol. 28, 2009.
  • [19] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, “Real-time human pose recognition in parts from single depth images,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011.
  • [20] A. Criminisi and J. Shotton, “Decision forests for computer vision and medical image analysis,” Advances in Computer Vision and Pattern Recognition, 2013.
  • [21] L. Breiman,

    Random forests,”

    Machine Learning, vol. 45, no. 1, 2001.
  • [22] T. Ho, “Random decision forests,” in Proceedings of the Third International Conference on Document Analysis and Recognition, 1995.
  • [23] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Computer Vision, Sixth International Conference on, Jan. 1998.
  • [24] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [25] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [26] B. Kang, Y. Lee, and T. Q. Nguyen, “Depth adaptive deep neural network for semantic segmentation,” IEEE Transactions on Multimedia, 2018.