HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation

07/02/2019 ∙ by Shreyas Hampali, et al. ∙ 4

We propose a new dataset for 3D hand+object pose estimation from color images, together with a method for efficiently annotating this dataset, and a 3D pose prediction method based on this dataset. The current lack of training data makes the 3D hand+object pose estimation very challenging. This lack is due to the complexity of labeling many real images with both 3D poses and of generating synthetic images with various realistic interaction. Moreover, even if synthetic images could be used for training, annotated real images are still needed for validation. To tackle this challenge, we capture sequences with a simple setup made of a single RGB-D camera. We also use a color camera imaging the sequences from a side view, but only for validation. We introduce a novel method based on global optimization that exploits depth, color, and temporal constraints for efficiently annotating the sequences, which we use to train another novel method that predicts both the 3D poses of the hand and the object from a single color image. Our hope is to encourage other researchers to develop better annotation methods for our dataset: One can then apply such method to capture and easily annotate sequences captured with a single RGB-D camera to easily create additional training data thus solving one of the main problems of 3D hand+object pose estimation.



There are no comments yet.


page 2

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our proposed HO-3D dataset Existing datasets
Figure 1: We introduce HO-3D, a novel large-scale dataset of diverse hand-object interaction with 3D annotations of hand and object pose. On the left we show the aligned 3D models of the hand and object using our annotated poses. HO-3D contains 8 objects with 8 users, totaling over 25,000 frames. In comparison, existing datasets have several limitations: The 3D objects are very simple, the interaction is not realistic, the images are corrupted by sensors, and/or the dataset contains only a limited number of samples.

Methods for 3D pose estimation of rigid objects and hands from monocular images have made significant progress recently, thanks to the development of Deep Learning, and the creation of large datasets or the use of synthetic images for training 

[75, 60, 91, 47, 42, 88]. However, these recent methods still fail when a hand interacts with an object, mostly because of large mutual occlusions, and of the absence of datasets specific to 3D pose estimation for hand+object interaction. Breaking this limit is highly desirable though, as being able to obtain accurate estimates for the hand and the object 3D poses would be very useful in augmented reality applications, or for learning by imitation in robotics, for example.

Several pioneer works have already consider this problem, sometimes with impressive success [72, 30, 81]. These works typically rely on tracking algorithms to exploit temporal constraints, often also considering physical constraints between the hand and the object for improving the pose estimates. While these temporal and physical constraints remain relevant, we would like to also benefit from the power of data-driven methods, for 3D hand+object pose estimation from a single image: Being able to estimate these poses from a single frame would avoid manual initialization and drift of tracking algorithms. A data-driven approach, however, requires real images annotated with the 3D poses of the object and the hand, or synthetic images, or both. Unfortunately, creating annotated data for the hand+object problem is very challenging. Both common options for creating 3D annotations, annotating real images and generating synthetic images, raise problems:

Annotating real images. One approach is to rely on some algorithm for automated annotation, since manual annotation would be prohibitive. This is actually the approach of current benchmarks for 3D hand pose estimation [76, 59, 74, 88], where the “ground truth” annotations are obtained automatically with a tracking algorithm. These annotations are usually taken for granted and used for training and evaluation, but are actually noisy [48]. Another approach is to use sensors attached to the hand as in [17] (bottom right image of Fig. 1). This can directly provide the 3D poses, however, this is not an attractive alternative, as the sensors would be visible in the images, and thus bias the learning algorithm.

Generating synthetic images. Relying on synthetic images is attractive, as the 3D poses are known perfectly. Realistic rendering and domain transfer can be used to train 3D pose estimation on synthetic images [43, 61, 91]. However, generating realistic and natural 3D poses for the hand manipulating an object is difficult. Maybe realistic poses could be captured using sensors, and used for rendering synthetic images, but real images with accurate 3D annotations would still be needed for a proper evaluation.

We therefore propose a new way to approach the problem of creating training and evaluation 3D annotations: Instead of simply introducing a dataset with annotations that have to be taken as granted but likely to be noisy (as it is the case for previous datasets), we propose a dataset for which the annotations can be evaluated and improved.

To do so, as illustrated in Fig. 1, our two first contributions are a dataset and an annotation method that labels sequences captured by a single RGB-D camera with the 3D poses of the hand and the object visible in the images. This method is not data-driven, but optimizes globally all the 3D poses of the hand and the object over the sequence by enforcing temporal constraints, physical constraints between the hand and the object, and exploiting depth and color data. In addition to the RGB-D camera, we use an RGB second camera registered with respect to the RGB-D camera, to capture the hand and the object from a side view. We use this camera for validating the 3D poses recovered by our annotation method by manually annotating some of the frames it captures.

The advantage of this approach is three-fold: a) Our global optimization scheme introduces constraints between past and future frames, and therefore can perform better than a frame-by-frame tracking approach, as our experiments show. b) The 3D poses estimated by our annotation method can be quantitatively evaluated, as we annotated the hand joints and the object poses in some images of the side camera which is registered with respect to the RGB-D camera. This means that future annotation methods can be evaluated and used to generate better 3D annotations. c) Another advantage is in our very simple setup: The side camera is useful for validating the 3D annotation method, but not required for generating the annotations. In practice, one can then easily generate 3D annotations for training by capturing additional sequences with a simple RGB-D camera.

Our third contribution, in addition to the dataset and our method to annotate it, is a method for predicting the 3D poses of a hand and an object from a new monocular color image (the depth information is not used by this method), and trained on the 3D annotations obtained with our annotation method. This validates the fact that the 3D poses estimated by our annotation method can actually be used in a data-driven method for hand+object pose estimation. This second method is based on two CNNs that both take the RGB image as input, and together predict the 3D poses of the hand and the object. For the object pose, 2D projections of the object bounding box are predicted since they are sufficient to compute the 3D pose [60, 78], and for the hand pose, the 3D joint coordinates are predicted relative to the object centroid.

To summarize, we introduce the following contributions:

  • a dataset made of RGB-D sequences of 8 different people manipulating different objects, and manual annotations in side views for evaluation of the 3D poses.

  • an annotation method that estimates the 3D poses of a hand and an object for all the frames of an RGB-D sequence, by global optimization.

  • a 3D pose estimation method that estimates the 3D poses of a hand and an object given a single RGB frame, using our annotation method and our dataset as training data.

In the remainder of this paper, we first discuss previous work related to hand+object pose estimation. We then describe our dataset, our annotation method, and our pose estimation method from a single color image. Finally, we evaluate these two methods.

2 Related Work

The literature on hand and/or object pose estimation is extremely broad, and we review some of the relevant works.

2.1 3D Object Pose Estimation

Estimating the 3D pose of an object from a color image remains one of the fundamental problems of Computer Vision 

[24, 60, 78, 75, 33, 23]. [24] extends SSD [34] to predict object’s 2D bounding boxes and a rough estimate of the viewpoint and in-plane rotation. It then further refines the pose. [33] uses an iterative pose refinement. It predicts a relative pose transformation by matching the rendered image given the initial estimated pose against the observed image. [60, 78] estimate the 3D object pose by first predicting the 2D projection of the object corners and then using a PP algorithm, but they are sensitive to partial occlusions. Also, many works rely on RGB-D data to handle occlusion [6, 39, 89, 13, 25]. However, these methods are not demonstrated on articulated objects such as hands.

2.2 3D Hand Pose Estimation

Single image hand pose estimation is also a very popular problem in Computer Vision, and approaches can be divided into discriminative and generative approaches. Discriminative approaches directly predict the joint locations from RGB or RGB-D images. Recent works based on Deep Networks [79, 86, 90, 45, 20, 47, 12, 42, 91]

show remarkable performance, compared to previous discriminative methods based on Random Forests, for example 

[26, 77, 29]. However, discriminative methods perform poorly in case of partial occlusion.

Generative approaches take advantage of a hand model and its kinematic structure to generate hand pose hypotheses that are physically plausible. [50, 1, 73, 59, 68, 83] introduce approximate hand models to improve computational time when fitting the hand model to the input images. [38, 59, 49, 11] propose various similarity functions that are minimized to find a fit between the input image and the hand model parameters. [50, 49, 59, 68, 87]

introduce Particle Swarm Optimization to perform the optimization efficiently, despite the high dimensionality of the pose vector.

[43, 54] predict 2D joint locations and use a kinematic 3D hand model to lift these predictions to 3D. Generative approaches are usually accurate, and can be made robust to partial occlusions, as these previous methods demonstrated. Generative approaches typically rely on some prior on the hand pose, which may require manual initialization or result in drift when tracking.

Our work is related to both discriminative and generative approaches: We use a generative approach within a global optimization to generate the pose annotations, and we train a discriminative method from these data, to predict the hand and the object poses together. This way, the prediction is robust to mutual occlusions, while benefiting from the robustness of discriminative methods.

2.3 Synthetic Training Images for 3D Pose

Being able to train discriminative methods on synthetic data is valuable as it is difficult to acquire annotations for real images [91]. For example, [24] trains object detection and 3D pose estimation on synthetic color images. However, because of the domain gap between synthetic and real images, this results in sub-optimal performance, and an extensive refinement of the network predictions is required. [75] uses a domain randomization technique, which randomly changes the texture of the target object to make pose estimation robust to the change of appearance of the objects.

Generative Adversarial Networks (GANs) [18]

and Variational Autoencoders (VAEs) 

[27] can be used to align the distributions for the extracted features from the different domains [16, 41, 51]. However, the extracted features do not carry enough information for accurate 3D pose estimation methods [62, 4]. Therefore, a sophisticated GAN is used by [43], but this still requires renderings of high-quality synthetic color images. [61] introduced a domain adaptation technique to decrease the number of real annotated images.

While using synthetic images remain attractive for many problems, creating the virtual scenes can also be expensive and time consuming. Generating animated realistic hand grasps of various objects, as it would be required to solve the problem considered in this paper remains challenging. Being able to use real sequences for training has thus also its advantages.

2.4 Joint Hand+Object Pose Estimation

Hand+object interaction recovery is also a popular problem, because of its application in the robotics and Computer Vision communities. In the context of egocentric action recognition, many researches consider hand+object interaction [14, 15, 58, 2, 37, 70, 40]. These works do not estimate the hand pose, but [17] uses the hand pose estimation method of [87] to reach better action recognition. However, the 3D hand pose itself remains inaccurate.

[28] uses a coarse hand pose estimation to retrieve the 3D pose and shape of hand-held objects. [19] first segments the object and then given this information, estimates the 3D locations of the hand joints. But, they only consider a specific type of object and do not estimate the object pose.

[84, 52, 1, 50] propose using calibrated multi-view camera systems, but this results in a complex setup we would like to avoid to make our approach broadly usable. [53, 82] propose generative methods to track finger contact points for in-hand RGB-D object shape scanning. [66, 65] propose discriminative approaches. However, these approaches can only retrieve poses that are similar to those of the training dataset, as it is based on nearest neighbor matching.

In [71], hand+object interaction in grasping tasks is considered, to predict human intention. [72] proposes a 3D articulated Gaussian mixture alignment strategy for hand+object tracking from a single RGB-D camera. [55, 56] consider sensing from vision to estimate contact forces during hand+object interactions using a single RGB-D camera, and then estimate the hand and the object pose. However, these methods are limited to small occlusions. The method of [21] is robust under occlusions, as it uses part-based trackers, but does not explicitly track the object. [30, 81] proposes to use a physics simulator and a 3D renderer for frame-to-frame tracking of hand and objects from RGB-D. [31]

uses an ensemble of Collaborative Trackers for multi-object and multiple hand tracking from RGB-D images. The accuracy of these methods seems to be qualitatively high, but as the establishment of ground truth in real-world acquisition is known to be hard, they evaluate the proposed method on synthetic datasets, or by measuring the standard deviation of the difference in hand/object poses during a grasping scenario.

The recent [80] considers the problem of tracking a deformable object in interaction with a hand, by optimizing an energy function on the appearance and the kinematics of the hand, together with hand+object contact configurations. However, it is evaluated quantitatively only on synthetic images, which also shows the difficulty of evaluation of real data for hand+object pose estimation. In addition, they only consider scenarios where a hand is visible from top view, which makes the hand poses very limited, and not occluded by any object.

Our annotation method is closely related to these previous works. The main difference is that since pose estimation can be done offline in our case, we can rely on global optimization over the sequence, rather than a frame-by-frame tracking as it is usually the case.

2.5 Hand+Object Datasets

Several datasets for hand+object interactions have already been proposed. Many works provide egocentric RGB or RGB-D sequences for action recognition [7, 8, 15, 2, 64, 36, 82]. However, they focus on grasp and action labels and do not provide 3D poses. [10, 63, 42, 80] synthetically generate datasets with 3D hand pose annotations, but fine interaction between a hand and an object remains difficult to generate accurately.

[81, 83] captured sequences in the context of hand+hand and hand+object interaction, with 2D hand annotations only. [44] collected a dataset of real RGB images of hands holding objects. They also provide 2D joint annotations of pairs of non-occluded and occluded hands, by removing the object from the grasp of the subject, while maintaining their hand in the same pose. [19] proposes two datasets, a hand+object segmentation dataset and a hand+object pose estimation dataset. However, for both datasets, the background pixels have been set to zero, and the training images only consist of a hand interacting with a tennis ball. They provide hand pose annotations and object positions, by manual labeling the joints and using a generative method to refine the joint positions.

[72] proposed an RGB-D dataset of a hand manipulating a cube, which contains manual ground truth for both fingertip positions and 3D poses of the cube. [57] collected a dataset where they measure motion and force under different object-grasp configurations using sensors, but do not provide 3D poses. In contrast to these previous works, [17] provides a dataset of hand and object with 3D annotations for both hand joints and object pose. They used a motion capture system made of magnetic sensors attached to the user s hand and to the object in order to obtain hand 3D pose annotations in RGB-D video sequences. However, this change the appearance of the hand in color images as the sensors and the tape attaching them are visible.

As illustrated in Fig. 1, our HO-3D dataset is the first dataset providing both 3D hand joints and 3D object pose annotations for real images, while the hand and the object are heavily occluded by each other.

3 3D Annotation Method

We describe below our method for annotating a sequence of RGB-D images capturing a hand interacting with an object. Each RGB-D image is made of a depth map and a color image . We use the MANO hand model [67], which is a parametric 3D model of the hand, and we assume we have a 3D model of the object.

We aim to estimate the 3D poses for both the hand and the object in all the images of the sequence, where the hand pose consists of 21 DoF (3 joints for each of the fingers) plus 6 DoF for global rotation and translation, and the object pose consists of 6 DoF for global rotation and translation.

Hand model.

In addition to the pose parameters , the hand model has shape parameters that are fixed for a given person. The MANO hand model framework provides a differentiable function that will allow us to adjust the pose and shape parameters of the generated hand:


where is a Linear Blend Skinning (LBS) function [32] with blend weights , is a template mesh that is rigged with a kinematic tree of joints, and denotes the 3D joint locations.

To reduce the artifacts of LBS, the template mesh function is obtained by adjusting a mean shape with corrective shape and pose blend shapes and :


where gives the -th element of the concatenation of rotation matrices for pose , and denotes the rest pose (, the pose of a flat open hand). The model parameters were learned by scanning hands of different users and registering them to a template mesh model [67].

Object model.

We use objects from the YCB-Video dataset [85] as their corresponding 3D models are available and of good quality.

3.1 Cost Function

As mentioned in the introduction, our method is based on a global optimization over the sequence. We minimize the energy over the hand and object poses :


We use,


where denotes a depth residual term, a silhouette discrepancy term, a physical plausibility term, and a temporal consistency term. are weighting factors, and we use the same values for all the sequences. We detail each of these terms below.

Depth residual term .

This term is defined as the difference between a captured depth map and a depth rendering of the hand and the object under their current estimated poses and :


is the rendered depth map of the hand under pose , and is the rendered depth map of the object under pose . The operator is applied to each pixel location, to merge the two depth maps consistently. The Tukey function is a robust estimator that is similar to the loss close to 0, and constant after a threshold, and is useful to be robust to noise in the captured depth maps.

We use OpenDR [35] to render the depth maps and compute the derivatives of during optimization.

Silhouette discrepancy term .

To further guide the optimization, we compare the silhouettes of the hand and the object models under the current estimated poses and their masks extracted from the color image. To do so, we use a segmentation of the color images into hand, object and background, obtained with DeepLabv3 [9]. To train DeepLabv3 on such data, we create synthetic images by adding images of hands to images of objects at random locations and scales. We use the object masks provided by [85]. The segmented hands were obtained using an RGB-D camera, applying simple depth thresholding, and projecting the depth mask to the color image. We also use additional synthetic hand images from the RHD dataset [91]. Samples of segmented images are shown in Fig. 2. Note that such training images are sufficient for learning to extract an approximate mask, but not the 3D poses of the hand and object, as our experiments show.

The term thus encourages the rendered silhouettes of the hand and the object to align with the masks:


and are the masks extracted from image , for the hand and the object respectively. and extract the silhouettes of the hand and the object respectively from rendering , and are implemented with OpenDR in a differentiable way.

Figure 2: Example of hand and object segmentation obtained with DeepLabV3. Left: input image; Center: object mask; Right: hand mask.

Physical plausibility term .

During optimization, the hand model might interpenetrate the object model, which is physically not possible. To avoid this, we add a repulsion term that pushes the object and the hand apart if they interpenetrate each other. We use the dot-product between the hand joint locations and the surface normals at the closest vertex to on the 3D object surface: If the dot-product is negative, it means the hand penetrates the object and we add a residual that leads to a gradient that pushes the hand outside the object mesh:


We use exponential weights, with .

Temporal consistency term .

The previous terms were applied to each frame independently. The temporal consistency term allows us to constrain together the poses for all the frames. We apply a simple 0-th order motion model on both the hand and object poses:


Since we optimize a sum of these terms over the sequence, this effectively constrains all the poses together.

3.2 Global Optimization

Optimizing Eq. (3.1) from a random initialization would fail, as it is a highly non-convex problem with many parameters to estimate: Each frame has 27 parameters, and our sequences are made of about 1000 frames each. We therefore proceed in several stages.

We first focus on a segment of the sequence. In practice, we use a segment of 150 frames in the middle of the sequence. We use a manual initialization of the object pose for a single

frame of the sequence. We start by computing a first estimate of the object poses only, since the object poses have less degrees of freedom and are therefore easier to recover. Using the manual initialization in a single frame, we greedily optimize Eq. (

3.1) one frame at a time, in forward and backward directions, over the object pose only.

We then obtain a first estimate of the hand poses for the frames of the segment. This is done by manually providing a coarse estimate of the hand pose for one frame of the segment, and optimizing Eq. (3.1) over the hand poses. At this stage, the hand poses are constrained to be related to the object poses by a constant rigid motion. We therefore only estimate 6 degrees of freedom during this stage. We initialize the hand poses for the frames outside the segment, by using the same rigid motion between the hand and the object for these frames.

Finally, we optimize Eq. (3.1) over all the hand and object poses over all the parameters of all the frames using the initial estimates for each frame. We resort to a greedy approach and use and . We iterate over the full sequence and pick one frame at time. We optimize the poses of hand and object of this frame using Dogleg [46], a quasi-Newton least-squares optimizer, and update the pose in the sequence. We then pick the next frame, update the poses, . We move through the sequence in a forward manner, and then backwards. We iterate this forward and backward moves several times (in practice twice). This final refinement stage typically results in an improvement by about 7% of the poses accuracy.

4 Monocular 3D Hand+Object Pose
Estimation Method

Figure 3: Our baseline method for predicting 3D hand+object poses. We use the segmentation of Section 3.1 to first localize the object and hand, and crop an image patch around their centroid. The object patch is fed to a CNN that predicts the 2D projections of the object bounding box corners together with the hand wrist joint 2D location. The hand patch is fed to another CNN that predicts the 3D joint locations of the hand with respect to the wrist joint and the depth of the wrist joint. The 3D joint locations are then translated using the wrist joint depth, its 2D location and the camera intrinsics.

For establishing a baseline on our proposed dataset, we propose a simple discriminative CNN-based method that jointly predicts the 3D pose of the hand and the object from a single color image. Fig. 3 shows an overview of this method.

Network model.

We train a first network to predict the 2D bounding box projections for the object and the 2D wrist joint location of the hand. takes a crop centered on the object location. The 3D pose of the object is calculated using P[22] from the 2D projections and the corresponding 3D bounding box.

A second network takes a cropped image centered on the hand and predicts the 3D joint locations relative to the wrist joint location together with the depth of the wrist joint. We take the hand and object 2D locations as the centroid and of the hand and object segmentations presented in Section 3.1.

For training the networks, we optimize:


for estimating the the 2D projections of the object bounding box corners and of the wrist joint, and


for estimating the relative 3D hand joint locations. returns a patch cropped from image centered around . We use the VGG network architecture [69] for both and , remove the conv5 layer and modify the last fully connected layer to match the corresponding predictions.

Training data.

We use our dataset together with synthetically generated training data. We use 14 out of 15 sequences for training the network and the remaining sequence for evaluation. The synthetic data for training is generated using the MANO model. As the model contains only geometric information, we add color information to the model by mapping the color of registered hand scan data [67] to the hand vertices [3]. For generating the synthetic data, we use the rigid motion between the hand and object from our proposed dataset and render the hand and object together under different poses.

Additionally, we translate the hand along the object surface to generate more realistic data. In total, we use 20k synthetic and 15k real samples to train the network.

For the hand-object localization, we use the segmentation method introduced in Section 3.1. We then crop a px window around the 2D object location, and add a random translation of px to augment the data and account for localization inaccuracies.

5 Benchmarking the Dataset

Figure 4: Qualitative comparison between manual annotations and our annotations using the side view camera. Manual annotations are in grayscale, our automatic annotations in color. Top: Hand comparison. Bottom: Object comparison.

In this section, we evaluate both our annotation method and our baseline for hand+object pose prediction from a color image.

5.1 Evaluation of the Annotation Method

We used our 3D pose annotation method to annotate 15 sequences, totaling about 15,000 frames captured with 8 different users and 3 different objects from the YCB dataset. In each sequence, the object is grasped differently by the person. The image sizes are pixels for both the RGB-D camera and the side view camera. We use two different metrics to evaluate the accuracy of our annotation method: The first metric is based on manual annotations in the side view camera, and the second one is based on depth maps. We detail them below.

PCK metric: Percentage of Correct Keypoints in a Side View.

We use a second RGB camera registered with respect to the RGB-D camera used to capture the sequences. This camera has a viewpoint that is approximately orthogonal to the RGB-D camera. We use the first camera to obtain the pose annotations and the second camera for validation only. We manually annotate the hand pose with the 2D joint locations of visible joints in the second camera view for 10 randomly chosen frames of 15 different sequences, totaling over 150 frames. Further, we annotated the object pose by providing the 2D corner locations of visible selected points in the second camera view totaling over 100 frames.

To evaluate the PCK metric, we project the estimated 3D pose annotations on the second camera, and compute the 2D distances between the manual annotations and the projected automatic annotations. Fig. 5 shows the PCK metric for varying distance thresholds, , the percentage of 3D points that have an error smaller than a threshold. The final stage of the global optimization discussed in Section 3.2 increases the accuracy of annotations by about 7% by refining the poses. Qualitative results before and after the final stage of global optimization are shown in Fig. 6. More qualitative examples on different objects, persons and poses are shown in Fig. 4.

Figure 5: PCK metric (before and after final refinement stage) for evaluating the proposed pose annotation method on 15 sequences with varying distance thresholds. Left: Hand. Right: Object.
Figure 6: Qualitative results of the proposed annotation method. Each column shows the 3D models of the hand and object superimposed on the images from the side view camera before and after refinement stage, respectively. The final refinement still improves the estimated poses.
Figure 7: Reconstruction error. We plot the fraction of depth residuals smaller than a threshold. A large fraction of errors are well below 10 mm, which shows that the articulated models can explain the captured data accurately.
Figure 8: Evaluation of our baseline for hand pose estimation. We plot PCK metric for hand and 2-D reprojection error metric for object over distance threshold. Hand_Only and Object_Only curves show performance of our regressor when trained on synthetic images used for the image segmentation task. Hand_HO-3D and Object_3D depict the performance when trained on HO-3D dataset. All the regressors are tested on a HO-3D sequence

Reconstruction error.

As an additional evaluation, we rely on a reconstruction error. To do so, we compute the difference between the captured depth maps and the the depth maps generated using the rendered models of the hand and the object under the estimated poses. Fig. 7 shows the evaluation for three different objects Mustard bottle, Cracker box and Sugar box. The majority of all errors, around 90%, are below 10 mm. The remaining errors can be attributed to depth noise from the sensor and the wrist and forearm, which are not present in our model.

5.2 Evaluation of the 3D Hand+Object Pose
Estimation Baseline

In order to evaluate baseline methods on our dataset, we need to establish an evaluation protocol. Our proposed HO-3D dataset contains 15 sequences of 8 users with 3 objects. We use 14 sequences for training and the remaining one sequence for testing.

For the evaluation, we follow established error metrics from the fields of hand and object pose estimation, respectively. The hand pose is evaluated by comparing the Euclidean distance between the ground truth 3D joint locations with the predicted joint locations. Specifically, we employ the PCK plot [91] and the fraction of frames with maximum error below a threshold [76]. The object pose is evaluated using the 2D Reprojection error [5] metric.

We evaluate the proposed hand+object pose estimation method on our dataset and plot the metrics in Fig. 8 for both hand and object. Qualitative results are shown in Fig. 9.

Figure 9: Qualitative results of the proposed pose annotation method. The annotations from our dataset are shown in green, the predicted poses are shown in blue. For the hand, we plot the skeleton with 3D joint positions projected to the image, and for the object we plot the 3D bounding box projected to the image. The points in the second row show the predicted and estimated wrist joint 2D locations

5.3 Using the Segmentation Data to Train the Hand+Object Predictor

As outlined in Section 3, one might use the data we used for training the segmentation network to train the pose estimator. To evaluate this option, we trained DeepPrior++ [47] on the BigHand dataset [88] and run it on the real RGB-D captured frames of hand only poses we used for training the segmentation network. We then use these predictions as labels for the real frames when training the hand+object predictor. Fig. 8 shows the PCK metric for the predicted hand joint locations using the above mentioned segmentation dataset. As the training data does not contain realistic occlusion of the hand or objects and also many realistic grasp poses, the performance of the hand pose estimator is inferior using this dataset. Thus, such images are suitable only for segmentation and not pose predictions.

6 Conclusion

We introduced the HO-3D dataset for 3D hand+object pose estimation and evaluation on real images, together with an annotation method, an evaluation strategy for this method, and a baseline method for predicting the 3D pose of the hand and the object from a single color image. We will make our dataset, annotations, and code public, to encourage other authors to develop better annotation methods and pose prediction methods that can be evaluated on real images.

7 Acknowledgment

This work was supported by the Christian Doppler Laboratory for Semantic 3D Computer Vision, funded in part by Qualcomm Inc.


  • [1] L. Ballan, A. Taneja, J. Gall, L. Van Gool, and M. Pollefeys. Motion Capture of Hands in Action Using Discriminative Salient Points. In ECCV, 2012.
  • [2] S. Bambach, S. Lee, D. J. Crandall, and C. Yu. Lending a Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions. In ICCV, 2015.
  • [3] A. Boukhayma, R. de Bem, and P. H. Torr. 3D Hand Shape and Pose from Images in the Wild. In arXiv Preprint, 2019.
  • [4] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain Separation Networks. In NIPS, 2016.
  • [5] E. Brachmann, F. Michel, A. Krull, M. M. Yang, S. Gumhold, and C. Rother. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, 2016.
  • [6] A. G. Buch, L. Kiforenko, and D. Kraft. Rotational Subgroup Voting and Pose Clustering for Robust 3D Object Recognition. In ICCV, 2017.
  • [7] I. M. Bullock, T. Feix, and A. M. Dollar. The Yale Human Grasping Dataset: Grasp, Object, and Task Data in Household and Machine Shop Environments. The International Journal of Robotics Research, 34(3):251–255, 2015.
  • [8] M. Cai, K. M. Kitani, and Y. Sato. A Scalable Approach for Understanding the Visual Structures of Hand Grasps. In ICRA, 2015.
  • [9] L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. CoRR, abs/1706.05587, 2017.
  • [10] C. Choi, S. Ho Yoon, C.-N. Chen, and K. Ramani. Robust Hand Pose Estimation During the Interaction with an Unknown Object. In ICCV, 2017.
  • [11] M. de La Gorce, D. J. Fleet, and N. Paragios. Model-Based 3D Hand Pose Estimation from Monocular Video. PAMI, 33(9):1793–1805, 2011.
  • [12] X. Deng, S. Yang, Y. Zhang, P. Tan, L. Chang, and H. Wang.

    Hand3D: Hand Pose Estimation Using 3D Neural Network.

    In arXiv Preprint, 2017.
  • [13] A. Doumanoglou, V. Balntas, R. Kouskouridas, and T. Kim.

    Siamese Regression Networks with Efficient Mid-Level Feature Extraction for 3D Object Pose Estimation.

    In arXiv Preprint, 2016.
  • [14] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding Egocentric Activities. In ICCV, 2011.
  • [15] A. Fathi, X. Ren, and J. M. Rehg. Learning to Recognize Objects in Egocentric Activities. In CVPR, 2011.
  • [16] Y. Ganin and V. Lempitsky.

    Unsupervised Domain Adaption by Backpropagation.

    In ICML, 2015.
  • [17] G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations. In CVPR, 2018.
  • [18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. In NIPS, 2014.
  • [19] D. Goudie and A. Galata.

    3D Hand-Object Pose Estimation from Depth with Convolutional Neural Networks.

    In IEEE International Conference on Automatic Face & Gesture Recognition, 2017.
  • [20] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang. Region Ensemble Network: Improving Convolutional Network for Hand Pose Estimation. In ICIP, 2017.
  • [21] H. Hamer, K. Schindler, E. Koller-Meier, and L. Van Gool. Tracking a Hand Manipulating an Object. In ICCV, 2009.
  • [22] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000.
  • [23] O. H. Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann, and C. Rother. Ipose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects. In ACCV, 2018.
  • [24] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab. SSD-6D: Making Rgb-Based 3D Detection and 6D Pose Estimation Great Again. In ICCV, 2017.
  • [25] W. Kehl, F. Milletari, F. Tombari, S. Ilic, and N. Navab. Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation. In ECCV, 2016.
  • [26] C. Keskin, F. Kıraç, Y. E. Kara, and L. Akarun. Hand Pose Estimation and Hand Shape Classification Using Multi-Layered Randomized Decision Forests. In ECCV, 2012.
  • [27] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. In ICLR, 2014.
  • [28] M. Kokic, D. Kragic, and J. Bohg. Learning to Estimate Pose and Shape of Hand-Held Objects from RGB Images. arXiv preprint arXiv:1903.03340, 2019.
  • [29] A. Kuznetsova, L. Leal-Taixé, and B. Rosenhahn. Real-Time Sign Language Recognition Using a Consumer Depth Camera. In ICCV, pages 83–90, 2013.
  • [30] N. Kyriazis and A. Argyros. Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis. In CVPR, 2013.
  • [31] N. Kyriazis and A. Argyros. Scalable 3D Tracking of Multiple Interacting Objects. In CVPR, 2014.
  • [32] J. P. Lewis, M. Cordner, and N. Fong.

    Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation.

    In SIGGRAPH, pages 165–172, 2000.
  • [33] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox. Deepim: Deep Iterative Matching for 6D Pose Estimation. In ECCV, 2018.
  • [34] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single Shot Multibox Detector. CoRR, abs/1512.02325, 2016.
  • [35] M. M. Loper and M. J. Black. Opendr: An Approximatedifferentiable Renderer. In ECCV, 2014.
  • [36] R. Luo, O. Sener, and S. Savarese. Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos. In International Conference on 3D Vision, pages 593–602, 2017.
  • [37] M. Ma, H. Fan, and K. M. Kitani. Going Deeper into First-Person Activity Recognition. In CVPR, 2016.
  • [38] S. Melax, L. Keselman, and S. Orsten. Dynamics Based 3D Skeletal Hand Tracking. In Proceedings of Graphics Interface, 2013.
  • [39] C. Mitash, A. Boularias, and K. E. Bekris. Improving 6D Pose Estimation of Objects in Clutter via Physics-Aware Monte Carlo Tree Search. In ICRA, 2018.
  • [40] M. Moghimi, P. Azagra, L. Montesano, A. C. Murillo, and S. Belongie. Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition. In CVPR, 2014.
  • [41] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain Generalization via Invariant Feature Representation. In ICML, 2013.
  • [42] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas, and C. Theobalt. Real-Time Hand Tracking Under Occlusion from an Egocentric Rgb-D Sensor. In ICCV, 2017.
  • [43] F. Müller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt. Ganerated Hands for Real-Time 3D Hand Tracking from Monocular RGB. In CVPR, 2018.
  • [44] B. Myanganbayar, C. Mata, G. Dekel, B. Katz, G. Ben-yosef, and A. Barbu. Partially Occluded Hands: A Challenging New Dataset for Single-Image Hand Pose Estimation. In ACCV, 2018.
  • [45] N. Neverova, C. Wolf, F. Nebout, and G. Taylor.

    Hand Pose Estimation through Semi-Supervised and Weakly-Supervised Learning.

    CVIU, 2017.
  • [46] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
  • [47] M. Oberweger and V. Lepetit. DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation. In ICCV, 2017.
  • [48] M. Oberweger, G. Riegler, P. Wohlhart, and V. Lepetit. Efficiently Creating 3D Training Data for Fine Hand Pose Estimation. In CVPR, 2016.
  • [49] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient Model-Based 3D Tracking of Hand Articulations Using Kinect. In BMVC, 2011.
  • [50] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Full DoF Tracking of a Hand Interacting with an Object by Modeling Occlusions and Physical Constraints. In ICCV, 2011.
  • [51] S. Pan, I. Tsang, J. Kwok, and Q. Yang. Domain Adaptation via Transfer Component Analysis. In IJCAI, 2009.
  • [52] P. Panteleris and A. Argyros. Back to RGB: 3D Tracking of Hands and Hand-Object Interactions Based on Short-Baseline Stereo. In ICCV, 2017.
  • [53] P. Panteleris, N. Kyriazis, and A. A. Argyros. 3D Tracking of Human Hands in Interaction with Unknown Objects. In BMVC, 2015.
  • [54] P. Panteleris, I. Oikonomidis, and A. Argyros. Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild. In WACV, 2018.
  • [55] T.-H. Pham, A. Kheddar, A. Qammaz, and A. Argyros. Capturing and Reproducing Hand-Object Interactions through Vision-Based Force Sensing. In Object Understanding for Interaction, 2015.
  • [56] T.-H. Pham, A. Kheddar, A. Qammaz, and A. A. Argyros. Towards Force Sensing from Vision: Observing Hand-Object Interactions to Infer Manipulation Forces. In CVPR, 2015.
  • [57] T.-H. Pham, N. Kyriazis, A. A. Argyros, and A. Kheddar. Hand-Object Contact Force Estimation from Markerless Visual Tracking. PAMI, 40(12):2883–2896, 2018.
  • [58] H. Pirsiavash and D. Ramanan. Detecting Activities of Daily Living in First-Person Camera Views. In CVPR, pages 2847–2854, 2012.
  • [59] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and Robust Hand Tracking from Depth. In CVPR, 2014.
  • [60] M. Rad and V. Lepetit. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects Without Using Depth. In ICCV, 2017.
  • [61] M. Rad, M. Oberweger, and V. Lepetit. Domain Transfer for 3D Pose Estimation from Color Images Without Manual Annotations. In ACCV, 2018.
  • [62] M. Rad, M. Oberweger, and V. Lepetit. Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images. In CVPR, 2018.
  • [63] G. Rogez, M. Khademi, J. Supančič III, J. M. M. Montiel, and D. Ramanan. 3D Hand Pose Detection in Egocentric RGB-D Images. In ECCV, 2014.
  • [64] G. Rogez, J. S. Supancic, and D. Ramanan. Understanding Everyday Hands in Action from RGB-D Images. In ICCV, 2015.
  • [65] J. Romero, H. Kjellström, C. H. Ek, and D. Kragic. Non-Parametric Hand Pose Estimation with Object Context. Image and Vision Computing, 31(8):555–564, 2013.
  • [66] J. Romero, H. Kjellström, and D. Kragic. Hands in Action: Real-Time 3D Reconstruction of Hands in Interaction with Objects. In ICRA, 2010.
  • [67] J. Romero, D. Tzionas, and M. J. Black. Embodied Hands: Modeling and Capturing Hands and Bodies Together. TOG, 36(6):245, 2017.
  • [68] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, et al. Accurate, Robust, and Flexible Real-Time Hand Tracking. In CHI, pages 3633–3642, 2015.
  • [69] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1556, 2014.
  • [70] S. Singh, C. Arora, and C. Jawahar. First Person Action Recognition Using Deep Learned Descriptors. In CVPR, 2016.
  • [71] D. Song, N. Kyriazis, I. Oikonomidis, C. Papazov, A. Argyros, D. Burschka, and D. Kragic. Predicting Human Intention in Visual Observations of Hand/Object Interactions. In IEEE International Conference on Robotics and Automation, 2013.
  • [72] S. Sridhar, F. Mueller, M. Zollhoefer, D. Casas, A. Oulasvirta, and C. Theobalt. Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input. In ECCV, 2016.
  • [73] S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data. In ICCV, 2013.
  • [74] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded Hand Pose Regression. In CVPR, 2015.
  • [75] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. In ECCV, 2018.
  • [76] D. Tang, H. J. Chang, A. Tejani, and T.-K. Kim. Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In CVPR, 2014.
  • [77] D. Tang, T.-H. Yu, and T.-K. Kim. Real-Time Articulated Hand Pose Estimation Using Semi-Supervised Transductive Regression Forests. In ICCV, pages 3224–3231, 2013.
  • [78] B. Tekin, S. N. Sinha, and P. Fua. Real-Time Seamless Single Shot 6D Object Pose Prediction. In CVPR, 2018.
  • [79] J. Tompson, M. Stein, Y. LeCun, and K. Perlin. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. TOG, 33, 2014.
  • [80] A. Tsoli and A. A. Argyros. Joint 3D Tracking of a Deformable Object in Interaction with a Hand. In ECCV, 2018.
  • [81] D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys, and J. Gall. Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation. IJCV, 118(2):172–193, 2016.
  • [82] D. Tzionas and J. Gall. 3D Object Reconstruction from Hand-Object Interactions. In ICCV, 2015.
  • [83] D. Tzionas, A. Srikantha, P. Aponte, and J. Gall. Capturing Hand Motion with an RGB-D Sensor, Fusing a Generative Model with Salient Points. In

    German Conference on Pattern Recognition

    , 2014.
  • [84] R. Wang, S. Paris, and J. Popović. 6D Hands: Markerless Hand-Tracking for Computer Aided Design. In ACM Symposium on User Interface Software and Technology, pages 549–558, 2011.
  • [85] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In RSS, 2018.
  • [86] C. Xu, L. N. Govindarajan, Y. Zhang, and L. Cheng. Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups. IJCV, 2016.
  • [87] Q. Ye, S. Yuan, and T.-K. Kim. Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation. In ECCV, 2016.
  • [88] S. Yuan, Q. Ye, B. Stenger, S. Jain, and T.-K. Kim. Big Hand 2.2m Benchmark: Hand Pose Data Set and State of the Art Analysis. In CVPR, 2017.
  • [89] H. Zhang and Q. Cao. Combined Holistic and Local Patches for Recovering 6D Object Pose. In ICCV, 2017.
  • [90] X. Zhou, Q. Wan, W. Zhang, X. Xue, and Y. Wei. Model-Based Deep Hand Pose Estimation. IJCAI, 2016.
  • [91] C. Zimmermann and T. Brox. Learning to Estimate 3D Hand Pose from Single RGB Images. In ICCV, 2017.