hand_object_contact_prediction
Code/data of the paper "Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and Guided Progressive Label Correction" (BMVC2021)
view repo
Every hand-object interaction begins with contact. Despite predicting the contact state between hands and objects is useful in understanding hand-object interactions, prior methods on hand-object analysis have assumed that the interacting hands and objects are known, and were not studied in detail. In this study, we introduce a video-based method for predicting contact between a hand and an object. Specifically, given a video and a pair of hand and object tracks, we predict a binary contact state (contact or no-contact) for each frame. However, annotating a large number of hand-object tracks and contact labels is costly. To overcome the difficulty, we propose a semi-supervised framework consisting of (i) automatic collection of training data with motion-based pseudo-labels and (ii) guided progressive label correction (gPLC), which corrects noisy pseudo-labels with a small amount of trusted data. We validated our framework's effectiveness on a newly built benchmark dataset for hand-object contact prediction and showed superior performance against existing baseline methods. Code and data are available at https://github.com/takumayagi/hand_object_contact_prediction.
READ FULL TEXT VIEW PDFCode/data of the paper "Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and Guided Progressive Label Correction" (BMVC2021)
Recognizing how hands interact with objects is crucial to understand how we interact with the world. Hand-object interaction analysis contributes to several fields such as action prediction [Dessalene et al.(2021)Dessalene, Devaraj, Maynord, Fermuller, and Aloimonos], rehabilitation [Likitlersuang et al.(2019)Likitlersuang, Sumitro, Cao, Visée, Kalsi-Ryan, and Zariffa], robotics [Rajeswaran et al.()Rajeswaran, Kumar, Gupta, Vezzani, Schulman, Todorov, and Levine], and virtual reality [Han et al.(2020)Han, Liu, Cabezas, Twigg, Zhang, Petkau, Yu, Tai, Akbay, Wang, et al.].
Every hand-object interaction begins with contact. In determining which hand-object pairs are interacting, it is important to infer when hands and objects are in contact. However, despite its importance, finding the beginning and the end of hand-object interaction has not received much attention. For instance, prior works on action recognition (e.g., [Feichtenhofer et al.(2019)Feichtenhofer, Fan, Malik, and He]) attempt to recognize different types of hand object interactions at the video clip level, i.e., recognizing one action for each video clip given as input. Some other works on action localization (e.g., [Lin et al.(2019)Lin, Liu, Li, Ding, and Wen]) can be used for detecting hand object interactions but localized action segments are not necessarily related to the beginning and the end of contact between hands and objects. Contact between a hand and an object has been studied in the context of 3D reconstruction of hand object interaction [Hasson et al.(2019)Hasson, Varol, Tzionas, Kalevatykh, Black, Laptev, and Schmid, Cao et al.(2021)Cao, Radosavovic, Kanazawa, and Malik]
. However, they assumed that hands and objects are already interacting with each other. Only the moment when hands and objects are interacting in a stable grasp was targeted for analysis.
In this work, we tackle the task of predicting contact between a hand and an object from visual input. Predicting contact between a hand and an object from visual input is not trivial. For example, even if the hand area and the bounding box of an object overlap, it does not necessarily mean that the hand and the object are in contact (see Figure 1
). In determining whether a hand and an object are in contact, it is essential to consider the spatiotemporal relationship between them. While some methods claim that the hand contact state can be classified by looking at hand shape
[Shan et al.(2020)Shan, Geng, Shu, and Fouhey, Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai], they did not explicitly predict the contact state between a specific pair of a hand and an object, limiting their utility.We propose a video-based method for predicting binary contact states (contact or no-contact) between a hand and an object in every frame. We assume tracks of hands and objects specified by bounding boxes (hand-object tracks) as input. However, annotating a large number of hand-object tracks and their contact states can become too costly. To overcome this difficulty, we propose a semi-supervised framework consisting of (i) automatic training data collection with motion-based pseudo-labels and (ii) guided progressive label correction (gPLC) which corrects noisy pseudo-labels with a small amount of trusted data.
Given unlabeled videos, we apply off-the-shelf detection and tracking models to form a set of hand-object tracks. Then we assign pseudo-contact state labels to each track by looking at its motion pattern. Specifically, we assign a contact label when a hand and an object are moving in the same direction and a no-contact label when a hand is moving alone.
While generated pseudo-labels can provide valuable information on determining the state of contact states with various types of objects when training a prediction model, the pseudo-labels also contain errors that hurt the model’s performance. To alleviate this problem, we correct those errors by the guidance of an additional model trained on a small amount of trusted data. In gPLC, we train two networks each trained with noisy labels and trusted labels. During the training, we iteratively correct noisy pseudo-labels based on both network’s confidence scores. We use the small-scale trusted data to guide which label to be corrected and yield reliable training labels for automatically extracted hand-object tracks.
Since there was no benchmark suitable for this task, we newly annotated contact states to various types of interactions appearing in the EPIC-KITCHENS dataset [Damen et al.(2018)Damen, Doughty, Maria Farinella, Fidler, Furnari, Kazakos, Moltisanti, Munro, Perrett, Price, et al., Damen et al.(2021)Damen, Doughty, Farinella, , Furnari, Ma, Kazakos, Moltisanti, Munro, Perrett, Price, and Wray] which includes in-the-wild cooking activities. We show that our prediction model achieves superior performance against frame-based models [Shan et al.(2020)Shan, Geng, Shu, and Fouhey, Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai], and the performance further boosted by using motion-based pseudo-labels along with the proposed gPLC scheme.
Our contributions include: (1) A video-based method of predicting contact between a hand and an object leveraging temporal context; (2) A semi-supervised framework of automatic pseudo-contact state label collection and guided label correction to complement lack of annotations; (3) Evaluation on newly collected annotation over a real-world dataset.
Reconstruction of the spatial configuration of hands and their interacting objects plays a crucial role to understand hand-object interaction. 2D segmentation [Narasimhaswamy and Vazir(2019)]
and 3D pose/mesh estimation
[Romero et al.(2010)Romero, Kjellström, and Kragic, Tzionas and Gall(2015), Hasson et al.(2019)Hasson, Varol, Tzionas, Kalevatykh, Black, Laptev, and Schmid, Hampali et al.(2020)Hampali, Rad, Oberweger, and Lepetit, Liu et al.(2021)Liu, Jiang, Xu, Liu, and Wang] of hand-object interaction were studied actively in recent years. However, they assume (1) 3D CAD models exist for initialization (except [Hasson et al.(2019)Hasson, Varol, Tzionas, Kalevatykh, Black, Laptev, and Schmid]) (2) the hand is interacting with objects, making the methods inapplicable when hand and object are not interacting with each other. While multiple datasets appear for hand-object interaction analysis [Garcia-Hernando et al.(2018)Garcia-Hernando, Yuan, Baek, and Kim, Narasimhaswamy and Vazir(2019), Brahmbhatt et al.(2019)Brahmbhatt, Ham, Kemp, and Hays, Shan et al.(2020)Shan, Geng, Shu, and Fouhey], no dataset focused on the entire process of interaction including beginning and termination of contact. It is worth mentioning DexYCB [Chao et al.(2021)Chao, Yang, Xiang, Molchanov, Handa, Tremblay, Narang, Van Wyk, Iqbal, Birchfield, et al.], which captured sequences of picking up an object. However, the performed action was very simple and their analysis focused on 3D pose estimation rather than contact modeling between hands and objects. We study the front stage of the hand-object reconstruction problem—whether the hand interacts with the object or not.
Contact prediction is found to be a difficult problem because contact cannot be directly observed due to occlusions. To avoid using intrusive hand-mounted sensors, contact and force prediction from visual input was studied [Pham et al.(2017)Pham, Kyriazis, Argyros, and Kheddar, Akizuki and Aoki(2019), Taheri et al.(2020)Taheri, Ghorbani, Black, and Tzionas, Ehsani et al.(2020)Ehsani, Tulsiani, Gupta, Farhadi, and Gupta]. For example, Pham et al [Pham et al.(2017)Pham, Kyriazis, Argyros, and Kheddar] present an RNN-based force estimation method trained on force and kinematics measurements from force transducers. These methods require a careful setup of sensors, making it hard to apply them in an unconstrained environment. Instead of precise force measurement, a few methods study contact state classification (e.g., no contact, self contact, other people contact, object contact) from an image [Shan et al.(2020)Shan, Geng, Shu, and Fouhey, Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai]. Shan et al [Shan et al.(2020)Shan, Geng, Shu, and Fouhey] collected a large-scale dataset of hand-object interaction along with annotated bounding boxes of hands and objects in contact. They train a network that detects hands and their contact state from their appearance. Narasimhaswamy et al [Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai] extends the task into multi-class prediction. While their formulation is simple, they did not take the relationship between hands and objects explicitly and were prone to false-positive prediction. To balance utility and convenience, we take the middle way between the two approaches—binary contact state prediction between a hand and an object specified by bounding boxes.
Since dense labels are often costly to collect, methods to learn from large unlabeled data are studied. While learning features from weak cues are studied in object recognition [Pathak et al.(2017)Pathak, Girshick, Dollár, Darrell, and Hariharan] and instance segmentation [Pathak et al.(2018)Pathak, Shentu, Chen, Agrawal, Darrell, Levine, and Malik], it was not well studied in a sequence prediction task. Generated pseudo-labels typically include noise that hurts the model’s performance. Various approaches such as loss correction [Patrini et al.(2017)Patrini, Rozza, Krishna Menon, Nock, and Qu, Hendrycks et al.(2018)Hendrycks, Mazeika, Wilson, and Gimpel], label correction [Tanaka et al.(2018)Tanaka, Ikami, Yamasaki, and Aizawa, Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen], sample selection [Jiang et al.(2018)Jiang, Zhou, Leung, Li, and Fei-Fei], and co-teaching [Han et al.(2018)Han, Yao, Yu, Niu, Xu, Hu, Tsang, and Sugiyama, Li et al.(2020)Li, Socher, and Hoi] are proposed to deal with noisy labels. However, most methods assume feature-independent feature noise which is over-simplified, and only a few works study realistic feature-dependent label noise [Chen et al.(2021)Chen, Ye, Chen, Zhao, and Heng, Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen]. Zhang et al [Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen] propose progressive label correction (PLC) which iteratively corrects labels based on the network’s confidence score with theoretical guarantees against feature-dependent noise patterns. Inspired by PLC [Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen], we propose gPLC which iteratively corrects noisy labels by not only the prediction model but also with the clean model trained on small-scale trusted labels.
In contrast to prior works [Shan et al.(2020)Shan, Geng, Shu, and Fouhey, Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai] we formalize the hand-object contact prediction problem as predicting the contact states between a hand and a specific object appearing in a image sequence. We assume video frames , hand instance masks , and target object bounding boxes as inputs, forming a hand-object track .
Our goal is to predict a sequence of a binary contact state (“no contact” or “contact”) given a hand-object track . If any physical contact between the hand and the object exists, the binary contact state is set to , otherwise . Although we do not explicitly model two-hands manipulation, we consider the presence of another hand as side information (see Section 3.3 for details).
However, collecting a large number of hand-object tracks and contact states for training can become too costly. We deal with this problem by automatic pseudo-label collection based on motion analysis and a semi-supervised label correction scheme.
We automatically detect hand-object tracks and assign pseudo-labels to them based on two critical assumptions. (i) When a hand and an object are in contact, they exhibit similar motion pattern. (ii) When a hand and an object are not in contact, the hand moves while the object remains static (see Figure 2 left for illustration). Because these assumptions are simple yet applicable regardless of object appearance and motion direction, we can use these motion-based pseudo-labels for training to achieve generalization against novel objects.
Given a video clip, we first use the hand-object detection model [Shan et al.(2020)Shan, Geng, Shu, and Fouhey] to detect bounding boxes of hands and candidate objects appearing in each frame. Note that the detected object’s contact state is unknown and objects which overlap with hands are detected. For each hand detection, we further apply a hand segmentation model trained on EGTEA dataset [Li et al.(2018)Li, Liu, and Rehg] to each hand detection to obtain segmentation masks.
Next, we associate adjacent detections using a Kalman Filter-based tracker
[Bewley et al.(2016)Bewley, Ge, Ott, Ramos, and Upcroft]. However, since [Shan et al.(2020)Shan, Geng, Shu, and Fouhey] does not detect objects away from the hand, we extrapolate object tracks one second before and after using a visual tracker [Li et al.(2019)Li, Wu, Wang, Zhang, Xing, and Yan], producing and . Finally, we construct the hand-object track by looking for pairs of hand and object tracks which include a spatial overlap between hand mask and object bounding box.We find contact (and no-contact) moments by looking at the correlation between hand and object motion. First, we estimate optical flows between adjacent frames. Since we are interested in relative movement of hands and objects against backgrounds, we obtain background motion-compensated optical flow and its magnitude
by homography estimation. Specifically, we sample flow vectors outside detected bounding boxes as matches between frames and estimate the homography using RANSAC
[Fischler and Bolles(1981)].Let be a binary mask of foreground moving region its magnitude larger than a certain threshold . For each hand and object binary region mask and , we calculate the ratio of moving region within each region: , . We assign a label to a frame if and and above certain thresholds. Similarly, we assign a no-contact label if or above threshold but below threshold. However, the above procedure may wrongly assign contact labels if the motion direction of hand and object are different (e.g.
, the object handled by the other hand). Thus we calculate the cosine similarity between the average motion vector of hand and object region and assign a contact label if above threshold otherwise a no-contact label. To deal with errors in flow estimation, we cancel the assignment if the background motion ratio
( denotes background mask other than and ) is above threshold.The above procedure assigns labels on hand-moving frames, but it does not assign labels when hands are moving slowly or still. To assign labels also on those frames, we extend the assigned contact states if the relationship between hands and objects does not change from the timing when pseudo-labels are assigned (see Figure 2 right).
To track hand-object distance, we find point trajectories from hand and object region which satisfy forward-backward consistency [Sundaram et al.(2010)Sundaram, Brox, and Keutzer]. We then calculate the distance between each hand-object point pair and compare the average distance of them in each frame. We extend the last contact state if the average distance is within a certain range of that of the starting frame. Figure 3 shows an example of the generated pseudo-labels.
![]() ![]() |
While generated pseudo-labels include useful information in determining contact states, they also include errors induced by irregular motion patterns. The model may overfit to noise if we simply train it based on these noisy labels. To utilize reliable labels from noisy pseudo-labels, we propose a semi-supervised procedure called guided progressive label correction (gPLC), which works with a small number of trusted labels (see Figure 5 for overview).
We assume a noisy dataset with generated pseudo-labels and a trusted dataset with manually annotated trusted labels. We train two identical networks, each called noisy model and clean model. The noisy model is trained on both and while the clean model is trained on and a clean dataset which is introduced later. We perform label correction against generated pseudo-labels in using the prediction of both models.
As the training of proceeds, will give high confidence against some samples. Similar to PLC [Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen], we try to correct labels on which
gives high confidence. Note that we correct labels in a frame-wise manner, assuming output contact probability is produced per frame. In gPLC, we correct labels only when
has high confidence and does not contradict the clean network ’s prediction. Because is generated from motion cues, the decision boundary of may be different from that of the optimal classifier. Thus the label correction on alone would not converge to the desired decision boundary. Therefore, we guide the correction process by using . Starting with a strict threshold on , we iteratively correct labels upon training. When the number of corrected labels gets small enough, we increase to loosen the threshold and continue the same procedure. However, since is trained on a small-scale data, it has the risk of overfitting to . To prevent overfitting, we iteratively add data that gives high confidence to another dataset called clean dataset and feed them to so that also grows through training. Initially will not contain labels, but high-confident labels will be added over time. See Algorithm 1 for detail. In implementation, and are trained beforehand by and before starting the gPLC iterations.Contact states are predicted by using an RNN-based model that takes RGB images, optical flow, and mask information as input (see Figure 5). For each modality, we crop the input by taking the union of the hand region and the object bounding box. The foreground mask is a four-channel binary mask that tells the presence of a target hand instance mask, a target object bounding box, other detected hand instance masks, and other detected object bounding boxes. The latter two channels prevent confusion when the target hand or object interacts with other entities. RGB and flow images are fed into each encoder branch, concatenated at the middle, and then passed to another encoder. Both encoders consist of several convolutional blocks, each consisted of 3
3 convolution followed by a ReLU and a LayerNorm layer
[Ba et al.(2016)Ba, Kiros, and Hinton], and a 22 max-pooling layer. The foreground mask encoder consists of three convolutional layers each followed by a ReLU layer, producing a 1
1 feature map encoding the positional relationship between the target hand, the target object, and the other hands and objects. After concatenating the features extracted from the foreground mask, contact probability is calculated through four bi-directional LSTM layers and three layers of MLP.
We train the network by a standard binary cross-entropy loss weighted by the ratio of the amount of labels in the training data. We did not propagate the error for non-labeled frames.
Since there was no benchmark suitable for our task, we newly annotated hand-object tracks and contact states between hands and objects against videos in EPIC-KITCHENS dataset [Damen et al.(2018)Damen, Doughty, Maria Farinella, Fidler, Furnari, Kazakos, Moltisanti, Munro, Perrett, Price, et al.]. We collected tracks with various objects (e.g., container, pan, knife, sink). The amount of the annotation was 1,200 tracks (67,000 frames) in total. We split the data into a training set (240 tracks), validation set (260 tracks), and test set (700 tracks). For the noisy dataset, we have generated 96,000 tracks with motion-based pseudo-labels.
We used FlowNet2 [Ilg et al.(2017)Ilg, Mayer, Saikia, Keuper, Dosovitskiy, and Brox] for optical flow estimation. We used Adam [Kingma and Ba(2015)]
for optimization with a learning rate of 3e-4. We trained the network for 500,000 iterations with a batch size of one and selected the best model by frame accuracy on the validation set. The hyperparameters were set to
.We prepared several metrics to evaluate the performance. Frame Accuracy: Frame-wise accuracy balanced by the ground truth label ratio; Boundary Score: F-measure of boundary detection. Performs bipartite graph matching between ground truth and predicted boundary [Perazzi et al.(2016)Perazzi, Pont-Tuset, McWilliams, Van Gool, Gross, and Sorkine-Hornung]. Count as correct if the predicted boundary within six frames from the ground truth boundary; Peripheral Accuracy: Frame-wise accuracy within six frames from the ground truth boundary; Edit Score: Segmental metric using Levenshtein distance between segments [Lea et al.(2016)Lea, Reiter, Vidal, and Hager]. We assume both contact and no-contact labels are foreground; Correct Track Ratio: The ratio of tracks which gives frame accuracy above 0.9 and boundary score of 1.0.
We evaluated several baseline methods: Fixed: Predicts always as “contact”; IoU: Calculate the mask IoU between the input hand mask and object bounding box. If the score is larger than zero predicts as contact, otherwise no-contact; ContactHands [Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai]: Predicts as a contact if the detected hand’s contact state is “object”; Shan-Contact [Shan et al.(2020)Shan, Geng, Shu, and Fouhey]: Predicts as a contact if corresponding hand’s contact state prediction is “portable”; Shan-Bbox [Shan et al.(2020)Shan, Geng, Shu, and Fouhey]: Predicts as contact if there is enough overlap between the detected object bounding box and input object bounding box; Shan-Full [Shan et al.(2020)Shan, Geng, Shu, and Fouhey]: Combines predictions of Shan-Contact and Shan-Bbox; Supervised: Our proposed prediction model, trained by trusted data alone. We note that for the Shan-* baselines, the 100k+ego pre-trained model provided by the authors was used, which is trained on egocentric video datasets including the EPIC-KITCHENS dataset.
Method | Frame Acc. | Boundary Score | Peripheral Acc. | Edit Score | Correct Ratio |
---|---|---|---|---|---|
Fixed | 0.500 | 0.394 | 0.534 | 0.429 | 0.166 |
IoU | 0.642 | 0.505 | 0.613 | 0.678 | 0.259 |
ContactHands [Narasimhaswamy et al.(2020)Narasimhaswamy, Nguyen, and Hoai] | 0.555 | 0.440 | 0.596 | 0.468 | 0.136 |
Shan-Contact [Shan et al.(2020)Shan, Geng, Shu, and Fouhey] | 0.608 | 0.516 | 0.656 | 0.507 | 0.180 |
Shan-Bbox [Shan et al.(2020)Shan, Geng, Shu, and Fouhey] | 0.688 | 0.435 | 0.639 | 0.631 | 0.189 |
Shan-Full [Shan et al.(2020)Shan, Geng, Shu, and Fouhey] | 0.746 | 0.477 | 0.687 | 0.583 | 0.193 |
Supervised (train) | 0.770 | 0.563 | 0.649 | 0.718 | 0.394 |
Supervised (train+val) | 0.816 | 0.636 | 0.695 | 0.793 | 0.487 |
Proposed | 0.836 | 0.681 | 0.730 | 0.793 | 0.519 |
Method | Frame Acc. | Boundary Score | Peripheral Acc. | Edit Score | Correct Ratio |
---|---|---|---|---|---|
Noisy Label only | 0.780 | 0.569 | 0.703 | 0.687 | 0.344 |
Noisy + Trusted Label | 0.811 | 0.624 | 0.708 | 0.759 | 0.453 |
Noisy + Trusted Label w/ PLC [Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen] | 0.821 | 0.636 | 0.730 | 0.768 | 0.480 |
Pseudo-Labeling [Lee et al.(2013)] | 0.784 | 0.590 | 0.703 | 0.737 | 0.417 |
RGB | 0.787 | 0.546 | 0.681 | 0.709 | 0.363 |
Flow | 0.833 | 0.672 | 0.725 | 0.789 | 0.519 |
Proposed (RGB+Flow) | 0.836 | 0.681 | 0.730 | 0.793 | 0.519 |
We report the performance in Table 1. Our proposed method consistently outperforms the baseline models on all the metrics, achieving a double correct track ratio compared to IoU based on the overlap between hand and object bounding boxes. The frame-based methods (ContactHands, Shan-*) performed equal or worse than IoU, producing many false positive predictions. These results suggest that previous methods claiming contact state prediction fails to infer physical contact between hands and objects. While Supervised performed well, gPLC further boosted the performance by leveraging diverse motion-based cues with label correction, especially on boundary score.
![]() |
![]() |
![]() |
Figure 6 shows the qualitative results. As shown in the top, our method distinguish contact and no-contact states by looking at the interaction between hands and objects while baseline methods yield false positive predictions by looking at box overlaps. The middle shows a typical no-contact case of a hand floating above an object. Our proposed model trained on motion-based pseudo-labels avoid producing false positive prediction.
To show the effectiveness of the proposed gPLC, we report ablations on other robust learning/semi-supervised learning methods (see Table
2 top). As expected, training using motion-based pseudo-labels performed worse due to labeling errors. Joint training with noisy and trusted labels gives marginal gain against the supervised model, but the boundary score remains low since it overfits against pseudo-label noise. We also applied the existing label correction method [Zhang et al.(2021)Zhang, Zheng, Wu, Goswami, and Chen] on a single network with fine-tuning on trusted labels, but its performance was almost equal to joint training, suggesting that label correction on a single network does not yield good correction. We also tried a typical pseudo-labeling [Lee et al.(2013)] without motion-based labels. However, it showed only a marginal improvement over the supervised baseline, suggesting that our motion-based pseudo-labels are necessary for better generalization.The bottom of Table 2 reports the ablation results of changing the input modalities. We observed that using RGB images alone impacts the boundary score, suggesting the difficulty of determining the contact state change without motion information. In contrast, the optical flow-based model achieved nearly the same performance as the full model, suggesting that motion information is crucial for accurate prediction.
While our method can better predict contact states by utilizing the rich supervision from motion-based pseudo-labels, we observed several failure patterns. As shown in Figure 6 bottom, our method often ignored contacts when a person instantly touched objects without yielding apparent object motion. We also observed failures due to unfamiliar grasps, complex in-hand motion, and failure in determining object regions (see supplemental for more results). These errors indicate the limitation of the motion-based pseudo-labels which assigns labels only when clear joint motion is observed. To better deal with subtle/complex hand motions, additional supervision or rules on such patterns may be required.
![]() ![]() |
Accuracy of noisy labels when initialized by corrupted ground-truth labels. Horizontal axis shows elapsed epochs (“0” denotes initial labels). Vertical axis shows frame accuracy (solid) and boundary score (dashed).
To understand the behavior of gPLC, we measured how gPLC corrects labels during training. We included the validation set into the training data with two patterns of initial labels: (i) randomly corrupted labels from ground truth (with three different corruption ratios ) (ii) motion-based pseudo-labels. We trained the full model and measured the accuracy of the labels for every epoch.
First, gPLC succeeded to correct randomly corrupted label even in the case of high corruption ratio of 0.5 (see Figure 8). However, in the case of a small corruption ratio of 0.1, gPLC made wrong corrections which means that both the noisy model and clean model got the prediction wrong. Improved boundary scores showed that gPLC can iteratively suppress inconsistent boundary errors. In the more realistic case of motion-based pseudo-labels, pseudo-labels were assigned to around 44% of the total frames, and achieved initial mean frame accuracy of 91.4% for the labeled frames. While gPLC reduced the error rate by 20% PLC wrongly flipped the contact state, which may have harmed the final performance (see Figure 8). These results indicate that gPLC effectively corrects noisy labels during training.
We have presented a simple yet effective method of predicting the contact state between hands and objects. We have introduced a semi-supervised framework of motion-based pseudo-label generation and guided progressive label correction that corrects noisy pseudo-labels guided by a small amount of trusted data. We have newly collected annotation for evaluation and showed the effectiveness of our framework against several baseline methods.
This work was supported by JST AIP Acceleration Research Grant Number JPMJCR20U1, JSPS KAKENHI Grant Number JP20H04205 and JP21J11626, Japan. TY was supported by Masason Foundation. We are grateful for the insightful suggestions and feedback from the anonymous reviewers.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 8709–8719, 2019.Proceedings of the AAAI Conference on Artificial Intelligence
, 2021.Co-teaching: Robust training of deep neural networks with extremely noisy labels.
In Neural Information Processing Systems, 2018.International Conference on Machine Learning
, pages 2304–2313, 2018.Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.
In Workshop on challenges in representation learning, ICML, volume 3, 2013.Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations.
In Proceedings of Robotics: Science and Systems.