HOI-dataset
None
view repo
Hand-object interaction is important for many applications such as augmented reality, medical application, and human-robot interaction. To understand hand-object interaction, hand segmentation is a necessary pre-process. However, current method is based on color information which is not robust to objects with skin color, skin pigment difference, and light condition variations. Therefore, we propose the first hand segmentation method for hand-object interaction using only depth map. The proposed method includes randomized decision forest (RDF), bilateral filtering, decision adjustment, and post-processing. We demonstrate the effectiveness of the method by testing for five objects. The method achieves the average F_1 score of 0.8409 and 0.8163 for the same object and new object, respectively. Also, the method takes less than 10ms to process each frame.
READ FULL TEXT VIEW PDF
We propose a real-time DNN-based technique to segment hand and object of...
read it
We reconstruct 3D deformable object through time, in the context of a li...
read it
In this study, we focus on the egocentric segmentation of arms to improv...
read it
Appearance-based generic object recognition is a challenging problem bec...
read it
Real-time hand articulations tracking is important for many applications...
read it
With the increasing popularity of augmented and virtual reality, retaile...
read it
Current upper extremity outcome measures for persons with cervical spina...
read it
None
Recently, with the expansion of virtual reality (VR), augmented reality (AR), robotics, and intelligent vehicles, the development of new interaction technologies has become unavoidable since these applications require more natural interaction methods rather than input devices. For these applications, many researches have been conducted such as gesture recognition and hand pose estimation. However, most technologies focus on understanding interactions which do not involve touching or handling any real world objects although understanding interactions with objects is important in many applications. We believe that this is because hand segmentation is much more difficult in hand-object interaction. Thus, we present a framework of hand segmentation for hand-object interaction.
Hand segmentation has been studied for many applications such as hand pose estimation [1, 2, 3, 4, 5, 6], hand tracking [7, 8, 9], and gesture/sign/grasp recognition [10, 11]. In color image-based methods, skin color-based method has been popular [12, 13, 14, 10, 15, 16]. For hand-object interaction, Oikonomidis et al. and Romero et al. segmented hands by thresholding skin color in HSV space [7, 4, 5, 8]. Wang et al. processed hand segmentation using a learned probabilistic model where the model is constructed from the color histogram of the first frame [6]. Tzionas et al.
applied skin color-based segmentation using the Gaussian mixture model
[17]. However, skin color-based segmentation has limitations in interacting with objects in skin color, segmenting from other body parts, skin pigment difference, and light condition variations. An alternative method is wearing a specific color glove [18].For depth map-based methods, popular methods are using a wrist band [11, 9, 3] or using random decision forest (RDF) [1, 19, 2]. Although the method using a black wristband is uncomplicated and effective, it is inconvenient. Moreover, the method cannot segment hands from objects during hand-object interaction since it processes segmentation by finding connected components. Tompson et al. [1] and Sharp et al. [2] proposed the RDF-based methods based on [19]. Although the purposes of the methods are slightly different comparing to the proposed method, the methods are the most relevant methods.
In this paper, we propose the hand segmentation method for hand-object interaction using only a depth map to avoid the limitations of skin color-based methods. We present the two-stage RDF method to achieve high accuracy efficiently.
We propose two-stage RDF for hand segmentation for hand-object interaction. In our two-stage RDF, the first RDF detects hands by processing the RDF on an entire depth map. Then, the second RDF segments hands in pixel-level by applying the RDF in the detected region. This cascaded architecture is designed for the second RDF to focus on the segmentation of hands from objects and close body parts such as an arm.
RDF consists of a collection of decision trees as shown in Fig.
1. Each decision tree is composed of a root node, splitting nodes, and leaf nodes. Given an input data at the root node, it is classified to child nodes based on the split function at each splitting node until it reaches a leaf node. In this paper, the input data is the location of each pixel on a depth map. The split function uses the feature of the depth difference between two relative points on the depth map in
[19]. At a leaf node, a conditional probability distribution is learned in a training stage, and the learned probability is used in a testing stage. For more details about RDF, we refer the readers to
[20, 21, 22].Given a training dataset , the algorithm randomly selects a set of depth maps and then randomly samples a set of data points in the region of interest (ROI) on the selected depth maps . The ROI is the entire region of the depth maps in the first stage. It is the detected regions using the first RDF in the second stage (see Fig. 2). The sampled set of data points and the corresponding depth maps are inputs to the training of a decision tree.
Using the inputs (, ), the algorithm learns a split function at each splitting node and a conditional probability distribution at each leaf node. First, learning the split function includes learning a feature and a criteria . We use the feature of the depth difference between two relative points in [19] as follows:
(1) |
where denotes the depth at a pixel on a depth map ; and
represent offset vectors for each relative point. Then, the criteria
decides to split the data to the left child or the right child.(2) |
Thus, the algorithm learns two offset vectors () and a criteria at each splitting node.
Since the goal is separating the data points of different classes to different child nodes, the objective function is designed to evalutate the separation using the learned offset vectors and criteria as follows:
(3) |
where and are indexes for child nodes and for classes, respectively; denotes the number of data points in the child node; is the estimated probability of being the class at the child node .
To learn offsets and a criteria, the algorithm randomly generates possible candidates and selects the candidate with a minimum loss as follows:
(4) |
Learning a split function at each splitting node is repeated until the node satisfies the condition for a leaf node. The condition is based on (1) the maximum depth of the tree, (2) the probability distribution , and (3) the amount of training data at the node. Specifically, it avoids too many splitting nodes by limiting the maximum depth of the tree and by terminating if the child node has a high probability for a class or if the amount of remaining training data is too small.
At each leaf node, the algorithm stores the conditional probability (probability of being each class given reaching the node ) for the prediction in a testing stage.
Using the learned RDF, the algorithm predicts the probability of being a class for a new data . The new data is classified to child nodes using the learned split function at each splitting node until it reaches a leaf node. At the leaf node , the learned conditional probability is loaded. These steps are repeated for entire trees in the forest . Then, the probabilities are averaged to predict the probability of being a class for the new data .
(5) |
where is the number of trees in the learned forest .
In the first stage, the first RDF is applied on an entire depth map to compute a probability map. Then, the probability map is used to detect hands as shown in Fig. 2. In the second stage, the second RDF processes the data points in the detected regions to predict the probability of being each class. The proposed two-stage RDF improves both accuracy and efficiency by focusing on each task in each stage.
Decision boundaries are exhaustively searched with the step size of 0.01 using the predicted probability maps of the validation dataset as shown in Fig. 3. Although the most typical boundary is 0.5 for a probability map, we found that it is not the best parameter. The selected boundaries are shown in Table 2.3.
Before classifying a data to a class , modified bilateral filter is applied to the predicted probability to make the probability more robust. Since the probability is predicted for each pixel independently, the probability is stabilized by averaging the probabilities of the data points in close distance and similar intensity on the depth map.
Unlike typical bilateral filter whose weights are based on the input image (in this case, the probability map) [23], the weights in the modified bilateral filter are based on a separate image, the depth map. The filtering is defined as follows:
(6) |
where is the set of pixels within the filter’s radius and the pre-defined depth difference; is the normalization term, ; and are the Gaussian functions for the depth difference and for the spatial distance from the data point , respectively. The parameters in the filter were selected based on the experiments using validation dataset. The selected parameters are as follows: the maximum depth difference to be considered is
. Both standard deviations (
and ) are 100.to 0.96c—c—X[c,m]—X[c,m]—X[c,m]—X[c,m]—c Method & Score & Processing time
Method & Boundary & Filter & Precision & Recall & score & ()
RDF [1, 19] & 0.50 & - & 38.1 & 91.2 & 53.7 & 6.7
RDF [1, 19] + Proposed in Sec. 2.2 & 0.78 & - & 54.5 & 72.7 & 62.3 & 6.7
FCN-32s [24, 25] & - & - & 70.0 & 68.6 & 69.3 & 376
FCN-16s [24, 25] & - & - & 68.0 & 72.2 & 70.1 & 376
FCN-8s [24, 25] & - & - & 70.4 & 74.4 & 72.3 & 377
Proposed method & 0.50, 0.50 & - & 59.2 & 77.4 & 67.1 & 8.9
& 0.50, 0.52 & - & 60.8 & 75.1 & 67.2 & 8.9
& 0.50, 0.52 & 11 11 & 62.9 & 75.6 & 68.7 & 10.7
We collected a new dataset^{1}^{1}1https://github.com/byeongkeun-kang/HOI-dataset using Microsoft Kinect v2 [26]. The newly collected dataset consists of 27,525 pairs of depth maps and ground truth labels from 6 people (3 males and 3 females) interacting with 21 different objects. Also, the dataset includes the cases of one hand and both hands in a scene. The dataset is separated into 19,470 pairs for training, 2,706 pairs for validation, and 5,349 pairs for testing, respectively.
The proposed method is analyzed by demonstrating the results on the dataset in Section 3.1. For the quantitative comparison of accuracy, we measure
score, precision, and recall as follows:
(7) |
where tp, fp, and fn represent true positive, false positive, and false negative, respectively. For the comparison of efficiency, we measure the processing time using a machine with Intel i7-4790K CPU and Nvidia GeForce GTX 770.
The proposed method is compared with the RDF-based method in [1, 19] and the fully convolutional networks (FCN) in [24, 25] using only a depth map. The proposed method is not compared with color-based methods since the characteristics of depth sensors and color imaging sensors are quite different. For example, a captured depth map using a depth sensor does not vary depending on light condition. However, a captured color image varies a lot depending on light condition. Thus, choosing the capturing environment affects the comparison of results using depth maps and color images. Hence, we only compare the proposed method with the state-of-the-art methods which can process using only depth maps.
Table 2.3 and Fig. 5 show quantitative results and visual results. The scores in Table 2.3 are scaled by a factor of 100. The quantitative results show that the proposed method achieves about 25% and 8% relative improvements in score comparing to the RDF-based methods [1, 19] and its combination with the proposed method in Section 2.2
, respectively. Comparing to the deep learning-based methods
[24, 25], the proposed method achieves about 7% lower accuracy, but processes in about 42 times shorter processing time. Thus, deep learning-based methods can not be used in real-time applications. Fig. 4 shows the comparison of methods in accuracy and efficiency. The proposed method achieves high accuracy in short processing time.In this paper, we present two-stage RDF method for hand segmentation for hand-object interaction using only a depth map. The two stages consist of detecting the region of interest and segmenting hands. The proposed method achieves high accuracy in short processing time comparing to the state-of-the-art methods.
“Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”
in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.We collected a new dataset^{1}^{1}1https://github.com/byeongkeun-kang/HOI-dataset using Microsoft Kinect v2 [26]. The newly collected dataset consists of 27,525 pairs of depth maps and ground truth labels from 6 people (3 males and 3 females) interacting with 21 different objects. Also, the dataset includes the cases of one hand and both hands in a scene. The dataset is separated into 19,470 pairs for training, 2,706 pairs for validation, and 5,349 pairs for testing, respectively.
The proposed method is analyzed by demonstrating the results on the dataset in Section 3.1. For the quantitative comparison of accuracy, we measure
score, precision, and recall as follows:
(7) |
where tp, fp, and fn represent true positive, false positive, and false negative, respectively. For the comparison of efficiency, we measure the processing time using a machine with Intel i7-4790K CPU and Nvidia GeForce GTX 770.
The proposed method is compared with the RDF-based method in [1, 19] and the fully convolutional networks (FCN) in [24, 25] using only a depth map. The proposed method is not compared with color-based methods since the characteristics of depth sensors and color imaging sensors are quite different. For example, a captured depth map using a depth sensor does not vary depending on light condition. However, a captured color image varies a lot depending on light condition. Thus, choosing the capturing environment affects the comparison of results using depth maps and color images. Hence, we only compare the proposed method with the state-of-the-art methods which can process using only depth maps.
Table 2.3 and Fig. 5 show quantitative results and visual results. The scores in Table 2.3 are scaled by a factor of 100. The quantitative results show that the proposed method achieves about 25% and 8% relative improvements in score comparing to the RDF-based methods [1, 19] and its combination with the proposed method in Section 2.2
, respectively. Comparing to the deep learning-based methods
[24, 25], the proposed method achieves about 7% lower accuracy, but processes in about 42 times shorter processing time. Thus, deep learning-based methods can not be used in real-time applications. Fig. 4 shows the comparison of methods in accuracy and efficiency. The proposed method achieves high accuracy in short processing time.In this paper, we present two-stage RDF method for hand segmentation for hand-object interaction using only a depth map. The two stages consist of detecting the region of interest and segmenting hands. The proposed method achieves high accuracy in short processing time comparing to the state-of-the-art methods.
“Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”
in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.In this paper, we present two-stage RDF method for hand segmentation for hand-object interaction using only a depth map. The two stages consist of detecting the region of interest and segmenting hands. The proposed method achieves high accuracy in short processing time comparing to the state-of-the-art methods.
“Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”
in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.“Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,”
in Pattern Recognition, 2015 3rd IAPR Asian Conference on, Nov. 2015.
Comments
There are no comments yet.