Reconstructing Hand-Object Interactions in the Wild

12/17/2020
by   Zhe Cao, et al.
2

In this work we explore reconstructing hand-object interactions in the wild. The core challenge of this problem is the lack of appropriate 3D labeled data. To overcome this issue, we propose an optimization-based procedure which does not require direct 3D supervision. The general strategy we adopt is to exploit all available related data (2D bounding boxes, 2D hand keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) to provide constraints for the 3D reconstruction. Rather than optimizing the hand and object individually, we optimize them jointly which allows us to impose additional constraints based on hand-object contact, collision, and occlusion. Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens and the 100 Days of Hands datasets, across a range of object categories. Quantitatively, we demonstrate that our approach compares favorably to existing approaches in the lab settings where ground truth 3D annotations are available.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

page 5

page 7

page 9

page 10

08/16/2021

Towards unconstrained joint hand-object reconstruction from RGB videos

Our work aims to obtain 3D reconstruction of hands and manipulated objec...
04/28/2020

Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction

Modeling hand-object manipulations is essential for understanding how hu...
04/10/2019

H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions

We present a unified framework for understanding 3D hand and object inte...
08/01/2022

S^2Contact: Graph-based Network for 3D Hand-Object Contact Estimation with Semi-Supervised Learning

Despite the recent efforts in accurate 3D annotations in hand and object...
07/27/2020

Point-to-set distance functions for weakly supervised segmentation

When pixel-level masks or partial annotations are not available for trai...
07/02/2021

HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset

HO-3D is a dataset providing image sequences of various hand-object inte...

1 Introduction

Our hands are the primary way we interact with objects in the world. In turn, we designed our world with hands in mind. Therefore, understanding hand-object interactions is an important ingredient for building agents that perceive and act in the world. For example, it can allow them to learn object affordances [Gibson1977], infer human intents [Meltzoff1995], and learn manipulation skills from humans [Radosavovic2020, Mandikal2020].

What does it mean to understand hand-object interactions? We argue that fully capturing the richness of hand-object interactions requires 3D understanding. In general, recovering 3D from a single RGB image is an underconstrained problem. In the case of hand-object interactions, the problem is even more challenging due to heavy occlusions and varied viewpoints that arise naturally when hands are interacting with objects [Nakamura2017].

Figure 2: Availability of labeled data. Our community has made substantial progress in hand-object reconstruction. However, the gap between the in-the-lab datasets with 3D annotations (left) and the unlabeled in-the-wild data (right) remains considerable. This setting motivates our work.

Overall, there has already been a considerable progress toward building better models for hand-object reconstruction [hasson19_obman, hampali2020honnotate, hasson20_handobjectconsist]. Due to the difficulty of obtaining annotations for this problem, the focus has largely been on synthetic and in-the-lab datasets. However, the gap between the in-the-lab and in-the-wild settings remains large (Figure 2). Thus, methods trained on existing datasets do not typically generalize to in-the-wild settings.

In this paper, we make a step toward the more challenging setting and explore reconstructing hand-object interactions in the wild. We argue that manual 3D annotation of in-the-wild data is challenging and may not even be necessary. The general strategy we adopt is to exploit all available related data (2D bounding boxes, 2D instance masks, 2D keypoints, 3D object models, and 3D in-the-lab MoCap) to provide constraints for 3D reconstruction. This general philosophy has proven successful for 2D recognition [Radosavovic2018] and human 3D reconstruction [Kanazawa2018a].

To tackle this setting, we employ a multi-step optimization procedure whose components build on prior work [Kulon_2020_CVPR, Sun2018, jiang2020mpshape, zhang2020phosa, GRAB:2020]

. Our method consists of four steps: (1) optimizing the hand pose using 2D hand keypoints; (2) optimizing the object pose using differentiable rendering; (3) jointly optimizing the hand-object arrangement in 3D; and (4) refining the pose estimation results using a refinement network that captures 3D contact priors.

Our method produces compelling reconstructions on the challenging in-the-wild data from the EPIC Kitchens [Damen2018] and the 100 Days of Hands datasets [Shan2020], across a range of object categories (Figure 1). Quantitatively, our approach compares favorably to existing approaches in the lab settings where ground truth 3D annotations are available.

In addition to the supervision already embodied in the collection of the images we use, our pipeline has an additional step which requires human supervision, namely, the selection of the 3D CAD model of the object being manipulated. This extra labelling step is a practical constraint for creating large datasets of hand-object interactions using our techniques, but we expect advances in recognition of 3D CAD models from 2D images to help automate this step.

In summary, our key contributions are: (1) We present an optimization-based procedure that leverages all available related data for reconstructing hand-object interactions in the wild; (2) We show promising results on challenging real-world images; (3) We curate a set of images with corresponding instance category and 3D CAD models.

2 Related Work

3D hand pose estimation.

Many previous works on hand pose estimation directly predict 3D joint locations from either depth 

[sharp2015accurate, sridhar2013interactive, tagliasacchi2015robust, tzionas2016capturing, ye2016spatial, yuan2018depth, moon2018v2v] or RGB [romero2010hands, mueller2018ganerated, cai2018weakly, yang2019disentangling, zimmermann2017learning] images. While this representation of sparse hand joints is effective in modeling the hand action and motion, it does not model the hand shape and hand-object contacts. In contrast, some previous works predict 3D hand joint rotations and shape parameters for 3D parametric hand model such as MANO [romero2017embodied]. Fitting-based approaches [SMPL-X:2019, xiang2019monocular, Kulon_2020_CVPR]

fit such parametric models to 2D keypoint detections to optimize 3D hand parameters. Learning-based approaches 

[zhou2020monocular, rong2020frankmocap] utilize deep networks to directly predict the MANO hand parameter from RGB image input. Recently, [Kulon2019, Kulon_2020_CVPR] proposes to use mesh convolution to directly predict 3d hand mesh reconstruction. We use a learning-based method [rong2020frankmocap] to obtain the initial hand pose estimation and further improve the result by imposing constraints on 2D hand keypoints and 3D hand-object contact priors.

3D object pose estimation. There are many existing works on estimating 3D object pose from a single image. Some approaches [factored3dTulsiani17, kundu20183d, Gkioxari2019, kuo2020mask2cad]

utilize neural network to predict the object shape, translation, and global orientation in the camera coordinate. These methods are trained with limited object categories and have difficulty generalizing to new objects. On the other hand, some approaches 

[lim2013parsing, Michel2017, Xiang2018, zhang2020phosa, Sun2018] assume known object CAD model and focus on 6DOF object pose prediction. In this work, we take a fitting-based approach similar to [Kanazawa2018b, Sun2018, Sahasrabudhe2019, zhang2020phosa] and we further refine object pose based on its arrangement with the hand.

Figure 3: Reconstruction procedure. To reconstruct hand-object interactions in the wild, we leverage all available related data (2D keypoints, 2D instance masks, 3D object models, 3D in-the-lab MoCap) through an optimization-based procedure that consists of four steps: (a) hand pose estimation by 2D keypoints fitting, (b) object pose estimation via differentiable rendering, (c) joint optimization for spatial arrangement, and (d) pose refinement using 3D contact priors.

3D hand and object pose estimation. Early approaches in modeling hand and object require the input of multi-view image [oikonomidis2011full] or RGB-D sequence [tzionas2016capturing]. Recently, [hasson19_obman] proposes a deep model to reconstruct hand and object meshes from a monocular RGB image. To train the model with 3D supervision, they collect a synthetic dataset from simulation.  [tekin2019h] designs a neural network to jointly predict 3D hand pose and 3D object bounding boxes with a focus on egocentric scenarios. Given object CAD model as input, [hasson20_handobjectconsist] proposes to leverage photometric consistency from temporal frames as additional signal for training the model with sparse set of annotated data. All these approaches were trained and tested on in-the-lab or synthetic datasets. In contrast, our method does not require direct 3D supervision and can be applied to in the wild images.

3D hand-object datasets. Early datasets in hand grasping scenario requires manual annotations [sridhar2016real] or depth tracking  [tzionas2016capturing] to obtain the ground truth, resulting in limited dataset size. To avoid the manual efforts,  [hernando2018cvpr] uses motion capture system with magnetic sensors to collect hand pose annotation in RGB-D video sequence. [hasson19_obman] introduces a synthetic dataset of hands grasping objects collected from simulation. [zimmermann2019freihand, hampali2020honnotate] introduces large-scale dataset with 3D annotation optimized from multi-view setups. Some recent datasets [Brahmbhatt_2019_CVPR, Brahmbhatt_2020_ECCV, GRAB:2020] also provide annotation for hand-object contact area in addition to 3D hand pose and object pose. [Brahmbhatt_2020_ECCV] collects large-scale RGB-D image sequences of hand-object interaction that are automatically annotated using multi-view setups and a thermal sensor. [GRAB:2020] captures whole-body human grasping objects with accurate 3D annotation using marker-based MoCap system. Due to the requirement of specific camera setup, all the mentioned datasets are collected in the lab setting. In this work, we utilize 3D hand-object contact priors learnt from these datasets to further refine our reconstructions for in-the-wild images.

Hand-object data in the wild. Datasets like EPIC [Damen2018], 100 Days of Hands [Shan2020], Something-Something [Goyal2017], Something-Else [Materzynska2020], MECCANO [ragusa2020meccano], contain data with natural hand-object interactions in different scenarios. However, they do not provide 3D annotations. The goal of our work is to reconstruct hand-object interactions in such scenarios without requiring direct 3D supervision.

Human-object interactions. A number of previous works study different aspects of human-object interactions in 2D. Examples include detection [Gkioxari2018, Gupta2019b], video recognition [Xiao2019], interaction hotspots [interaction-hotspots], and affordances [ego-topo]. Recently, Zhang  [zhang2020phosa] propose a method to infer 3D spatial arrangements of humans and objects. In this work, we focus on understanding hand-object interactions in 3D.

Figure 4: Intermediate results. Top row: input images. 2nd row: results from individually optimizing hand and object. 3rd row: results from joint optimization (two viewpoints per example). Bottom row: results after the refinement.

3 Reconstructing Hand-Object Interactions

In this section, we describe our optimization-based procedure for reconstruction hand-object interactions from a single RGB image . Our pipeline, shown in Figure 3, involves 4 steps: we first optimize the object pose and hand pose individually, then we optimize their 3D arrangement jointly, and finally we refine the pose estimation result based on 3D contact priors. We show the intermediate results from each step in Figure 4.

We motivate and describe our modular design for each stage in the rest of the section. We note that while our method can be applied to multiple hands, for brevity, we describe the optimization procedure for one pair of hand object interaction in the rest of section.

3.1 Hand Pose Estimation

The first step of our pipeline involves hand pose estimation (Figure 3a). We represent the hand using a parametric model defined by MANO [romero2017embodied]:

(1)

where are hand pose parameters and are the shape parameters for hand model.

Given a single RGB image, we use [Shan2020] to detect 2D hand bounding box. Taking the cropped hand patch, we use FrankMocap [rong2020frankmocap] to estimate the weak-perspective camera model , and initial 3D hand parameters and . We further optimize the hand pose by a fitting procedure which uses 2D hand keypoints from OpenPose [8765346, simon2017hand]

The hand pose optimization objective is to minimize the difference between 2D keypoints detection and 2D projection of 3D hand keypoints [Kulon_2020_CVPR, Bogo2016]:

(2)

consisting of a 2D keypoints distance term and a regularization term for hand shape .

We convert the weak-perspective camera to perspective camera by assuming a fixed focal length . The depth of the hand is approximated by the the focal length divided by the camera scale . The final hand vertices are obtained by:

(3)

3.2 Object Pose Estimation

In the next step of our method, Figure 3b, we recover the object pose using differentiable neural mesh renderer [kato2018renderer], similar to [Kanazawa2018b, Sahasrabudhe2019, zhang2020phosa]. Given an input image, we first estimate the 2D instance masks using PointRend [kirillov2020pointrend]. We further estimate the object scale , 3D rotation , and translation by fitting the corresponding 3D object model to the 2D object mask . The object parameters are then optimized by minimizing the difference between the rendered object mask and the estimated object mask .

(4)

where denotes the differentiable operation which renders 3D mesh into 2D mask. The final object vertices can be obtained by:

(5)

The optimization is under-constrained. In other words, there are multiple possible solutions for object pose which may have 2D masks similar to ground-truth. We randomly initialize the object rotations for multiple times and select the result with lowest loss after the optimization. We note that more sophisticated objectives are possible. For example,  [zhang2020phosa] uses a silhouette loss with distance transform for optimizing object pose. In our experiments, we find that using a simple mask difference loss can achieve better results on smaller objects, such as pen, spoon, and fork.

3.3 Joint Optimization

So far we described how we optimize hand pose and object pose individually. To obtain the final reconstruction, we optimize their 3D arrangement jointly.

Naively putting the hand and object together may result in implausible results (Figure 4, row 2), i.e., the hand and object are far away from each other in 3D or having interpenetration. Obtaining a coherent 3D arrangement of hand and object in one shot is challenging due to the depth and scale ambiguity given only 2D input. For example, a large object distant from the camera can have the same rendering result in 2D as a smaller object closer to the camera.

To resolve the ambiguity issue, we perform one more step to optimize the hand and object jointly (Figure 3c). The human hand scale varies much less than the object scale and can thus be used as a reference to constrain the object scale. Moreover, by optimizing hand and object together, we can impose additional constraints based on their depth ordering, hand-object contact, and interpenetration.

Here, we first define the overall objective for jointly optimizing the hand and the object:

(6)

In the following, we describe each loss term in detail. We note that including additional constraints to enforce more in-domain knowledge is possible. However, we find these simple constraints to work reasonably well in practice.

Figure 5: Qualitative results, EPIC. Results of our reconstruction procedure on images from the EPIC Kitchens [Damen2020] dataset. Overall, the reconstructions have reasonably high quality, though as expected there are still imperfections (bottle example).

Ordinal depth loss. One common error in the reconstruction is the wrong depth ordering of hand and object, e.g., a bottle lying on the back side of human hand resulting in an infeasible grasp. To encourage correct depth ordering, we add the ordinal depth loss term introduced in Jiang et al [jiang2020mpshape]. With instance masks prediction from [kirillov2020pointrend], we first obtain a joint mask for hand and object, where stores the category label for the pixel , it is if the pixel in background, if the pixel in hand, if the pixel in object. We can also obtain a rendered mask using the 3D hand and object into 2D. In the meantime, we render depth maps for both object and hand, denoted as and respectively. Finally, we calculate the ordinal depth loss for all pixels with a wrong depth ordering:

(7)

where represents the set of pixels for image which have a different category label in the segmentation mask from PointRend and the rendered mask.

Interaction loss. The reconstructed hand and object with correct depth ordering could still be distant in 3D space. However, when the hand is interacting with objects, their distance should be small. To push the interaction pair closer in 3D, we define an interaction loss based on their distance:

(8)

where represents the operation to calculate the average of all vertices, i.e., the 3D center location of the mesh, similar to [zhang2020phosa]. We find this simple loss term is helpful in correcting the object scale by moving it closer to the hand.

Interpenetration loss. Using the interaction loss alone may result in implausible artifacts, e.g., hand colliding with the object. To resolve the issue, we add an interpenetration loss term to penalize the object vertices that are inside the hand mesh. We use the Signed Distance Field (SDF) from the hand mesh to check if any object vertex is inside the hand. We first calculate a tight box around the hand and voxelize it as a 3D grid for storing the SDF value. Following [jiang2020mpshape], we use a modified SDF function for the hand mesh:

(9)

For each voxel cell in the 3D grid, if the cell is inside the hand mesh, takes positive values, proportional to the distance from the hand surface, while is 0 outside of the hand mesh. Then, the interpenetration loss can be calculated as:

(10)

where samples the SDF value of each object vertex from the 3D hand grid in a differentiable way.

Figure 6: Object category mismatch. We show example EPIC Kitchens [Damen2018] images with predictions from a PointRend [kirillov2020pointrend]

model trained on COCO 

[Lin2014]. We note that while the category labels are incorrect the instance segmentation masks are very reasonable. We leverage such masks for reconstructing object categories that are not in COCO.

3.4 Pose Refinement

A physically plausible hand-object reconstruction should not only be collision-free, but also have enough hand-object contact area to support the action. However, the interaction loss described in Section 3.3 does not take into account the fine-grained hand-object contact. To further refine the 3D reconstruction, we impose constraints on the hand-object contact as the final step of our procedure (Figure 3d).

Addressing this issue would be easy if we had per-vertex contact area annotation for both hand and object as we can enforce the contact region to be closer. However, obtaining such annotations for large collection of in-the-wild images is challenging. As a more practical solution, we learn 3D contact priors from a large-scale hand MoCap dataset [GRAB:2020]. The priors include the region of an object that the person is likely to touch or grab. For example, human is more likely to hold the mug by its handle.

Given the hand mesh and object mesh obtained from the joint optimization, we want to update the hand pose so that it has more reasonable contact with the object. Inspired by RefineNet from [GRAB:2020], we train a small network to perform hand pose refinement.

The input to the network are the initial hand parameters and the distance field from the hand vertices to the object vertices . For each hand vertex , we compute the distance to its nearest object vertex:

(11)

Then, the network refines the hand parameters in an iterative fashion. After each iteration, the distance field between hand and object is updated so that it can be used as input to the next step. The training data is obtained by perturbing the ground-truth hand pose parameters and translation to simulate noisy input estimates. As shown in Figure 4, we can observe the results after refinement (4th row) can reconstruct more realistic interaction between hand and object than the previous step (3rd row).

3.5 Implementation Details

To make the optimization more efficient, we preprocess our object CAD models to have 800 faces. When optimizing the object pose, we use initializations for object rotation sampled uniformly from . Object translation is zero initialized. We select the object pose with the lowest optimization loss. When optimizing the hand pose, we weight the 2D keypoint loss from Eq. 2 using the confidence score of each predicted hand joint.

For the joint optimization of hand and object, our default loss weights are as follows: . We note that in some cases the weights need to be adjusted further. For example, increasing to bring the object closer to the hand or increasing to reduce the collision. When computing the interpenetration loss using SDF, we adopt the CUDA implementation of SDF from [jiang2020mpshape]. We voxelize the hand mesh in a grid.

For the refinement networks, we use object vertices to compute the distance field from hand to object and perform 3 refinement iterations. In practice, it takes 1 minute to run our method on an image using a GTX 2080 Ti GPU, including 18 seconds for optimizing the object pose and 30 seconds for the joint optimization.

object COCO categories
spatula fork, toothbrush, scissors, spoon, chair
kitchen knife knife, scissors, tennis racket, bird
wooden spoon knife, spoon
plate bowl, frisbee, toy, remote, mouse, toilet
cereal box book, cake
frying pan bowl, spoon
screwdriver scissors, knife, cell phone
rubiks cube book, keyboard
Table 1: Category correspondences. We show example mappings between the true object categories (observed in an image) and the sets of predicted COCO categories.
Figure 7: Qualitative results, 100DOH. Results of our reconstruction procedure on images from the 100 Days of Hands [Shan2020] dataset. Our method produces reconstructions of reasonably high-quality across a range of viewpoints, activities, and objects.

4 Data collection

In the previous section, we described our reconstruction procedure assuming access to unlabeled in-the-wild images, 2D instance masks, 3D object models, and object-mask correspondences. Here, we discuss our procedure for acquiring each one in turn.

In-the-wild images. As a source of unlabeled in the wild data we use frames from the EPIC Kitchens [Damen2018, Damen2020] and the 100 Days of Hands [Shan2020] datasets, noting that we do not exploit any temporal information. These datasets contain a range of interesting hand-object interaction scenarios with varied objects and viewpoints (both first- and third-person).

3D object models. Our procedure for collecting 3D object models is semi-automatic and relies on the human in the loop. In particular, we started with a set of 3D object models we expected to see in these datasets (, forks, spoons, plates, ) and kept growing the set over time as we observed new objects (, microphone, motor oil, wood block, ). Two primary object sources we used are the YCB dataset [calli2015ycb] from the robotics community and the Free3D online platform. We collected 78 object models with both within and across category variation (and their corresponding COCO candidate categories; discussed next). We note, however, that a number of them are still outside the scope of our method (see Figure 8 for discussion).

2D instance masks. Modern 2D recognition models trained on large labeled datasets are now accurate enough that it is possible to apply them to real-world data and obtain reasonable predictions [Radosavovic2018]. However, in our case we require instance masks for a range of object categories that are not present in the available labeled datasets (, COCO categories). Thus, we cannot expect the available models to categorize the objects correctly in our setting. However, we observe that even if the predicted categories are incorrect, the instance masks are still quite reasonable for a variety of objects. For example, the models do not know what a spatula is called but are able to segment it (Figure 6). Consequently, we can still leverage the available models to obtain 2D instance masks for our use-case. Specifically, we use a PointRend model [kirillov2020pointrend] trained on COCO [Lin2014].

Object-mask correspondences. Given an input image we first manually select the appropriate 3D model for the object of interest. Next, we need to determine the corresponding 2D instance mask to use for reconstruction. One way to do this would be to perform exhaustive matching with all detected instances (typically up to 100) and select the one that results in the lowest loss. Alternatively, we could manually inspect and select the appropriate instance mask. In practice, we observe that each object typically gets recognized under a small set of COCO categories. Thus, we establish correspondences between the 3D object models and sets of candidate COCO categories (see Table 1 for examples). At reconstruction time, we use the precomputed mappings to select the candidate object instances to use for reconstruction. We note that in the cases where the object of interest is not detected at all the reconstruction fails. We expect that this step can be improved further as models trained on larger vocabulary datasets [Gupta2019a] become available.

5 Experiments

In this section, we evaluate our method in two settings: in-the-wild and in-the-lab. First, we show qualitative results in the wild. Next, we perform quantitative comparisons to existing approaches in the lab settings where ground truth 3D annotations are available. Finally, we present ablation studies and discuss the failure modes of our approach.

5.1 Results in the wild

We consider two datasets as the source of in-the-wild data for hand-object reconstruction: EPIC Kitchens [Damen2018] and 100 Days of Hands [Shan2020]. Due to the lack of appropriate ground truth annotations, quantitative evaluation in this setting is challenging. We present qualitative results here and supplement them with quantitative evaluation in the lab settings where ground truth annotations are available (next subsection). We try to showcase examples that work fairly well, have minor imperfections, and represent clear failure cases. We show reconstructions for images from the two aforementioned datasets in Figure 5, Figure 7, and Figure 9. Overall, our approach produces promising reconstructions across a variety of viewpoints, activities, and objects.

5.2 Results in the lab

Dataset.

For quantitative evaluation of our method, we use HO3D [hampali2020honnotate] dataset which contains video sequences of hand interacting with daily objects. All images have 3D annotations for both the hand and object. The dataset consists of 77,558 frames, 68 sequences, 10 persons, and 10 objects. The object CAD models are from YCB dataset [calli2015ycb]. Although our method does not require extra training, to have a fair comparison to existing work which requires 3D supervision, we split the dataset into training and testing set based on object categories. We use 3 categories, i.e., banana, cracker box, and cleanser bottle for evaluation. We do not use the official testing set from HO3D [hampali2020honnotate]

because it does not contain evaluation for object pose. In all of the experiments, we use the default optimization hyperparameters for our method (Section 

3.5).

 [hasson20_handobjectconsist]  [hasson20_handobjectconsist] Ours
Hand joint MAE 8.9 18.8 9.5
Hand verts MAE 9.0 18.3 9.8
@15mm 63.8 30.1 52.3
@30mm 83.6 59.7 78.6
Chamfer distance 18.4 29.3 20.5
Table 2: Quantitative comparison in HO3D dataset.  [hasson20_handobjectconsist] denotes the author’s trained model (some images from the testing set were used for training due to different split).  [hasson20_handobjectconsist] denotes the model trained by us with the new split. MAE is the mean average error in mm. @15mm denotes the score with the distance threshold of 15mm.
Bottle Box Banana
Hand joint MAE 9.0 9.8 10.1
Hand verts MAE 9.1 9.8 10.9
@15mm 54.8 49.6 54.7
@30mm 79.1 82.9 74.1
Chamfer distance 19.8 19.9 21.6
Table 3: Consistent performance across categories.

Evaluation metric. For evaluating hand pose, we report the mean average error (MAE) over 21 hand joints and 778 hand mesh vertices respectively. The error measures the Euclidean distance between predicted and ground truth hand joints/vertices. Following [hampali2020honnotate], we calculate the error after aligning the position of the hand root joint and global scale with the ground-truth.

For evaluating object pose, we calculate the Chamfer distance between ground-truth object vertices and predicted object vertices (obtained by rotating the input CAD model with the predicted object pose). We remove the global translation of the object for this evaluation. We also report the F-score 

[manning2008introduction] between the predicted object vertices and the ground truth at different distance threshold . For two point clouds,

score is calculated as the harmonic mean between recall (fraction of ground-truth points within

of a predicted point) and precision (fraction of predicted points within of a ground-truth point).

Top poses CF Dist @15mm @30mm
=1 20.1 52.9 79.3
=3 15.9 64.3 87.5
=5 13.7 70.6 91.4
=10 12.0 76.4 94.3
Table 4: Ablations on the number of object pose hypotheses. We report Chamfer distance (CF Dist) and score between predicted object mesh and ground-truth object mesh. With more hypotheses of object pose, we can obtain better reconstruction result.
HO Distance Collision Score
Individual results 414.8 0
+ Interaction loss 71.5 39.8
+ Ordinal depth loss 75.2 17.6
+ Penetration loss 76.4 7.7
+ Refinement 75.8 6.5
Table 5: Ablations on loss terms and pose refinement. From top to bottom, we add each component one by one (cumulative) and evaluate the prediction in terms of the distance between hand and object, and the collision score.

Quantitative comparison. In Table 2, we perform quantitative evaluation of our object and hand pose prediction in the HO3D dataset [hampali2020honnotate]. We compare against the state-of-the-art approach [hasson20_handobjectconsist] with the same input of monocular RGB image and the known object CAD model. [hasson20_handobjectconsist]

uses a feed-forward neural network to predict 3D hand pose and object pose assuming object CAD model is available. With the new split based on object categories, we retrain their model using their released code and report their evaluation result (denoted as 

[hasson20_handobjectconsist] in Table 2). For a reference, we also report the evaluation results (denoted as [hasson20_handobjectconsist] in Table 2) from their released model which split the data based on images. Due to the split difference, some testing images from the new split were used to train the original reference model. In comparison, we directly test our optimization-based method on this dataset without extra tuning.

As shown in the table, our method achieves close performance to the reference model, with a hand joint error of mm mm, and Chamfer distance of . The retrained model using the new split has worse performance than the reference model, with a larger hand joints error ( mm mm) and smaller score ( mm at 15 mm distance threshold). The reasons are two folds: (1) the reference model was trained on part of the images in the new split and thus performs better. (2) The model retrained with the category-based split has difficulty in generalizing to new object categories. In Table 3, we directly evaluate our method in different object categories without further tuning on this dataset, and we do not observe large performance difference across object categories.

Figure 8: Failure cases. We show representative failure cases of our reconstruction procedure on the EPIC Kitchens and the 100 Days of Hands datasets. We observe several failure modes due to the failure of the individual steps in our procedure: hand pose estimation (first column), object pose estimation (columns 2-4), and the joint optimization (last column).

5.3 Ablation Studies

In this section, we perform ablation studies to better understand different components of our method. We sample a subset of images from the test set for this analysis.

Number of object pose hypotheses. In Table 4, we did experiment of having multiple hypotheses about object pose estimation result. When optimizing the object pose, we start with initializations for object rotation sampled uniformly from . After the optimization, instead of using the top result with lowest loss, we can use the top results as input to the later stage. This will result in different final hand-object reconstruction results after jointly optimizing hand-object arrangement. In the table, we report the object pose error for the result which best matches to the ground truth. When we increase the value of , we can see the Chamfer distance between the predicted object mesh and ground-truth object mesh decreases substantially from to . This indicates having more object pose hypotheses could help resolve some ambiguity resulting from 2D masks loss.

Optimization loss terms. In Table 5, we evaluate the influence from each loss term for the joint optimization and RefineNet in the reconstruction result. We report the distance between the estimated hand center and object center. The collision score is calculated as the sum of all SDF values using Equation 9 and 10. The deeper the object intersects with the hand, the larger the collision score will be. A good reconstruction should have small collision and small hand-object distance in most cases. From the table, we can observe that the initial reconstructed hand and object from individual optimization are far away from each other, resulting in large center distance, i.e., mm and zero collision. By adding the coarse interaction loss, their distance decreases quickly to mm but end up with large collision, i.e, . Adding the ordinal depth and penetration loss could further reducing the collision score to while keeping a similar hand-object distance, i.e., mm. The RefineNet makes small adjustment to the final result and can slightly reduce the collision score ( ) and hand-object distance ( mm mm) in the same time. These findings align with our visualization in Figure 4.

5.4 Failure Cases

While our method shows promising results, there is still a lot of room for improvement. In Figure 8, we show the representative failure modes of our method and hope that they may help inform future methods. In the first column, we observe that the hand pose estimation fails when the hand is highly occluded or cropped by the image. In the columns 2-4, we see incorrect object pose predictions due to the ambiguity of the 2D mask loss when optimizing the object pose. In the last column, we observe examples where the joint optimization converges to an undesirable local minima. To resolve some ambiguity in this task, we experiment with having multiple object pose hypotheses and observe large improvement as shown in Table 4. We believe that semi-automated procedures which utilize multiple hypotheses could be a promising direction for obtaining large collections of hand-object reconstructions in-the-wild.

6 Conclusion

In this work we explore reconstructing hand-object interactions in the wild. We show that by leveraging all available related data, through an optimization-based procedure, we can obtain promising reconstructions for the challenging in-the-wild data. We hope that our work will attract more attention to this exciting and practical setting.

Figure 9: Additional qualitative results. Our procedure produces promising results across a range of scenarios and objects.

References