A Python package that provides evaluation and visualization tools for the DexYCB dataset
We introduce DexYCB, a new dataset for capturing hand grasping of objects. We first compare DexYCB with a related one through cross-dataset evaluation. We then present a thorough benchmark of state-of-the-art approaches on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. Finally, we evaluate a new robotics-relevant task: generating safe robot grasps in human-to-robot object handover. Dataset and code are available at https://dex-ycb.github.io.READ FULL TEXT VIEW PDF
A Python package that provides evaluation and visualization tools for the DexYCB dataset
3D object pose estimation and 3D hand pose estimation are two important yet unsolved vision problems. Traditionally, these two problems have been addressed separately, yet in many critical applications, we need both capabilities working together [36, 6, 13]. For example, in robotics, a reliable motion capture for hand manipulation of objects is crucial for both learning from human demonstration  and fluent and safe human-robot interaction .
rely on deep learning and thus require large datasets with labeled hand or object poses for training. Many datasets[48, 49, 16, 17] have been introduced in both domains and have facilitated progress on these two problems in parallel. However, since they were introduced for either task separately, many of them do not contain interaction of hands and objects, i.e., static objects without humans in the scene, or bare hands without interacting with objects. In the presence of interactions, the challenge of solving the two tasks together not only doubles but multiplies, due to the motion of objects and mutual occlusions incurred by the interaction. Networks trained on either of the datasets will thus not generalize well to interaction scenarios.
Creating a dataset with accurate 3D pose of hands and objects is also challenging for the same reasons. As a result, prior works have attempted to capture accurate hand motion either with specialized gloves , magnetic sensors [48, 8], or marker-based mocap systems [3, 35]. While they can achieve unparalleled accuracy, the introduction of hand-attached devices may be intrusive and thus bias the naturalness of hand motion. It also changes the appearance of hands and thus may cause issues with generalization.
Due to the challenge of acquiring real 3D poses, there has been an increasing interest in using synthetic datasets to train pose estimation models. The success has been notable on object pose estimation. Using 3D scanned object models and photorealistic rendering, prior work [16, 34, 38, 5, 17] has generated synthetic scenes of objects with high fidelity in appearance. Their models trained only on synthetic data can thus translate to real images. Nonetheless, synthesizing hand-object interactions remains challenging. One problem is to synthesize realistic grasp poses for generic objects . Furthermore, synthesizing natural looking human motions is still an active research area in graphics.
In this paper, we focus on marker-less data collection of real hand interaction with objects. We take inspiration from recent work  and build a multi-camera setup that records interactions synchronously from multiple views. Compared to the recent work, we instrument the setup with more cameras and configure them to capture a larger workspace that allows our human subjects to interact freely with objects. In addition, our pose labeling process utilizes human annotation rather than automatic labeling. We crowdsource the annotation so that we can efficiently scale up the data labeling process. Given the setup, we construct a large-scale dataset that captures a simple yet ubiquitous task: grasping objects from a table. The dataset, DexYCB, consists of 582K RGB-D frames over 1,000 sequences of 10 subjects grasping 20 different objects from 8 views (Fig. 1).
Our contributions are threefold. First, we introduce a new dataset for capturing hand grasping of objects. We empirically demonstrate the strength of our dataset over a related one through cross-dataset evaluation. Second, we provide in-depth analysis of current approaches thoroughly on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. To the best of our knowledge, our dataset is the first that allows joint evaluation of these three tasks. Finally, we demonstrate the importance of joint hand and object pose estimation on a new robotics relevant task: generating safe robot grasps for human-to-robot object handover.
In order to construct the dataset, we built a multi-camera setup for capturing human hands interacting with objects. A key design choice was to enable a sizable capture space, where a human subject can freely interact and perform tasks with multiple objects. Our multi-camera setup is shown in Fig. 2. We use 8 RGB-D cameras (RealSense D415) and mount them such that collectively they can capture a tabletop workspace with minimal blind spots. The cameras are extrinsically calibrated and temporally synchronized. For data collection, we stream and record all 8 views together at 30 fps with both color and depth of resolution .
Given the setup, we record videos of hands grasping objects. We use 20 objects from the YCB-Video dataset , and record multiple trials from 10 subjects. For each trial, we select a target object with 2 to 4 other objects and place them on the table. We ask the subject to start from a relaxed pose, pick up the target object, and hold it in the air. We also ask some subjects to pretend to hand over the object to someone across from them. We record for 3 seconds, which is sufficient to contain the full course of action. For each target object, we repeat the trial 5 times, each time with a random set of accompanied objects and placement. We ask the subject to perform the pick-up with the right hand in the first two trials, and with the left hand in the third and fourth trials. In the fifth trial, we randomize the choice. We rotate the target among all 20 objects. This gives us 100 trials per subject, and 1,000 trials in total for all subjects.
To acquire accurate ground-truth 3D pose for hands and objects, our approach (detailed in Sec. 2.3) relies on 2D keypoint annotations for hands and objects in each view. To ensure accuracy, we label the required keypoints in RGB sequences fully through human annotation. Our annotation tool is based on VATIC  for efficient annotation of videos. We set up annotation tasks on the Amazon Mechanical Turk (MTurk) and label every view in all the sequences.
For hands, we adopt 21 pre-defined hand joints as our keypoints (3 joints plus 1 tip for each finger and the wrist). We explicitly ask the annotators to label and track these joints throughout a given video sequence. The annotators are also asked to mark a keypoint as invisible in a given frame when it is occluded.
Pre-defining keypoints exhaustively for every object would be laborious and does not scale as the number of objects increases. Our approach (Sec. 2.3) explicitly addresses this issue by allowing user-defined keypoints. Specifically, given a video sequence in a particular view, we first ask the annotator to find 2 distinctive landmark points that are easily identified and trackable on a designated object, and we ask them to label and track these points throughout the sequence. We explicitly ask the annotators to find keypoints that are visible most of the time, and mark a keypoint as invisible whenever it is occluded.
To represent 3D hand pose, we use the popular MANO hand model . The model represents a right or left hand with a deformable triangular mesh of 778 vertices. The mesh is parameterized by two low-dimensional embeddings , where accounts for variations in pose (i.e. articulation) and in shape. We use the version from 
, which implements MANO as a differentiable layer in PyTorch that mapsto the mesh together with the 3D positions of 21 hand joints defined in the keypoint annotation. We pre-calibrate the hand shape for each subject and fix it throughout each subject’s sequences.
Since our objects from YCB-Video  also come with texture-mapped 3D mesh models, we use the standard 6D pose representation [16, 17] for 3D object pose. The pose of each object is represented by a matrix
composed of a 3D rotation matrix and a 3D translation vector.
To solve for hand and object pose, we formulate an optimization problem similar to [50, 11] by leveraging depth and keypoint annotations from all views and multi-view geometry. For a given sequence with hands and objects, we denote the overall pose at a given time frame by , where and . We define the pose in world coordinates where we know the extrinsics of each camera. Then at each time frame, we solve the pose by minimizing the following energy function:
The depth term measures how well the models given poses explain the observed depth data. Let be the total point cloud merged from all views after transforming to the world coordinates, with denoting the number of points. Given a pose parameter, we denote the collection of all hand and object meshes as . We define the depth term as
where calculates the signed distance value of a 3D point from a triangular mesh in mm. While is differentiable, calculating and also the gradients is computationally expensive for large point clouds and meshes with a huge number of vertices. Therefore, we use an efficient point-parallel GPU implementation for it.
The keypoint term measures the reprojection error of the keypoints on the models with the annotated keypoints, and can be decomposed by hand and object:
For hands, let be the 3D position of joint of hand in the world coordinates, be the annotation of the same joint in the image coordinates of view , and be its visibility indicator. The energy term is defined as
where returns the projection of a 3D point onto the image plane of view , and and .
For objects, recall that we did not pre-define keypoints for annotation, but rather asked annotators to select distinctive points to track. Here, we assume an accurate initial pose is given at the first frame where an object’s keypoint is labeled visible. We then map the selected keypoint to a vertex on the object’s 3D model by back-projecting the keypoint’s position onto the object’s visible surface. We fix that mapping afterwards. Let be the 3D position of the selected keypoint of object in view in world coordinates. Similar to Eq. (4), with , the energy term is
To ensure an accurate initial pose for keypoint mapping, we initialize the pose in each time frame with the solved pose from the last time frame. We initialize the pose in the first frame by running PoseCNN  on each view and select an accurate pose for each object manually.
To minimize Eq. (1), we use the Adam optimizer with a learning rate of 0.01. For each time frame, we initialize the pose with the solved pose from the last time frame and run the optimizer for 100 iterations.
Most datasets address instance-level 6D object pose estimation, where 3D models are given a priori. The recent BOP challenge [16, 17] has curated a decent line-up of these datasets which the participants have to evaluate on. Yet objects are mostly static in these datasets without human interactions. A recent dataset  was introduced for category-level 6D pose estimation, but the scenes are also static.
|dataset||visual modality||real image||marker- less||hand-||hand-||3D||3D||resolution||#frames||#sub||#obj||#views||motion||#seq||dynamic grasp||label|
|Rendered Hand Pose ||RGB-D||–||–||320320||44K||20||–||1||–||–||synthetic|
|GANerated Hands ||RGB||–||✓||256256||331K||–||–||1||–||–||synthetic|
|GRAB ||–||–||✓||✓||✓||–||1 ,624K||10||51||–||✓||1,335||✓||mocap|
|Mueller et al. ||depth||–||✓||–||640480||80K||5||–||4||✓||11||–||synthetic|
|Simon et al. ||RGB||✓||✓||✓||✓||19201080||15K||–||–||31||✓||–||✓||automatic|
We present a summary of related 3D hand pose datasets in Tab. 1. Some address pose estimation with depth only, e.g., BigHand2.2M . These datasets can be large but lack color images and capture only bare hands without interactions. Some others address hand-hand interactions, e.g., Mueller et al.  and InterHand2.6M . While not focused on interaction with objects, their datasets address another challenging scenario and is orthogonal to our work. Below we review datasets with hand-object interactions.
Synthetic data has been increasingly used for hand pose estimation [25, 49, 23, 14]. A common downside is the gap to real images on appearance. To bridge this gap, GANerated Hands  was introduced by translating synthetic images to real via GANs. Nonetheless, other challenges remain. One is to synthesize realistic grasp pose for objects. ObMan 
was introduced to address this challenge using heuristic metrics. Yet besides pose, it is also challenging to synthesize realistic motion. Consequently, these datasets only offer static images but not videos. Our dataset captures real videos with real grasp pose and motion. Furthermore, our motion data from real can help bootstrapping synthetic data generation.
FPHA  captured hand interaction with objects in first person view by attaching magnetic sensors to the hands. This offers a flexible and portable solution, but the attached sensors may hinder natural hand motion and also bias the hand appearance. ContactPose  captured grasp poses by tracking objects with mocap markers and recovering hand pose from thermal imagery of the objects surface. While the dataset is large in scale, the thermal based approach can only capturing rigid grasp poses but not motions. GRAB  captured full body motion together with hand-object interactions using marker-based mocap. It provides high-fidelity capture of interactions but does not come with any visual modalities. Our dataset is marker-less, captures dynamic grasp motions, and provides RGB-D sequences in multiple views.
Our DexYCB dataset falls in this category. The challenge is to acquire accurate 3D pose. Some datasets like Dexter+Object  and EgoDexter  rely on manual annotation and are thus limited in scale. To acquire 3D pose at scale others rely on automatic or semi-automatic approaches [30, 50]. While these datasets capture hand-object interactions, they do not offer 3D pose for objects.
Most similar to ours is the recent HO-3D dataset . HO-3D also captures hands interacting with objects from multiple views and provides both 3D hand and object pose. Nonetheless, the two datasets differ both quantitatively and qualitatively. Quantitatively (Tab. 1), DexYCB captures interactions with more objects (20 versus 10), from more views (8 versus 1 to 5), and is one order of magnitude larger in terms of number of frames (582K versus 78K) 222We count the number of frames over all views. and number of sequences (1,000 versus 27). 333We count the same motion across different views as one sequence. Qualitatively (Fig. 8), we highlight three differences. First, DexYCB captures full grasping processes (i.e. from hand approaching, opening fingers, contact, to holding the object stably) in short segments, whereas HO-3D captures longer sequences with object in hand most of the time. Among the 27 sequences from HO-3D, we found 17 with hand always rigidly attached to the object, only rotating the object from the wrist with no finger articulation. Second, DexYCB captures a full tabletop workspace with 3D pose of all the objects on the table, whereas in HO-3D one object is held close to the camera each time and only the held object is labeled with pose. Finally, regarding annotation, DexYCB leverages keypoints fully labeled by humans through crowdsourcing, while HO-3D relies on a fully automatic labeling pipeline.
To further assess the merit of our DexYCB dataset, we perform cross-dataset evaluation following prior work [50, 29, 47]. We focus specifically on HO-3D  due to its relevance, and evaluate generalizability between HO-3D and DexYCB on single-image 3D hand pose estimation. For DexYCB, we generate a train/val/test split following our benchmark setup (“S0” in Sec. 5.1). For HO-3D, we select 6 out of its 55 training sequences for test. We use Spurr et al.  (winner of the HAND 2019 Challenge) as our method, and train the model separately on three training sets: the training set of HO-3D, the training set of DexYCB, and the combined training set from both datasets. Finally, we evaluate the three trained models separately on the test set of HO-3D and DexYCB.
Tab. 2 shows the results in mean per joint position error (MPJPE) reported on two different alignment methods (details in Sec. 5.4). We experiment with two different backbones: ResNet50 and HRNet32 . Unsurprisingly, when training on a single dataset, the model generalizes better to the respective test set. The error increases when tested on the other dataset. When we evaluate the DexYCB trained model on HO-3D, we observe a consistent increase from 1.4 to 1.9 (e.g., for ResNet50, from 18.05 to 31.76 mm on root-relative). However, when we evaluate the HO-3D trained model on DexYCB, we observe a consistent yet more significant increase, from 3.4 to 3.7 (e.g., for ResNet50, from 12.97 to 48.30 mm on root-relative). This suggests that models trained on DexYCB generalize better than on HO-3D. Furthermore, when we train on the combined training set and evaluate on HO-3D, we can further reduce the error from only training on HO-3D (e.g., from 18.05 to 15.79 mm). However, when tested on DexYCB, the error rather rises compared to training only on DexYCB (e.g., from 12.97 to 13.36 mm). We conclude that DexYCB complements HO-3D better than vice versa.
|testtrain||HO-3D ||DexYCB||HO-3D |
|HO-3D||18.05 / 10.66||31.76 / 15.23||15.79 / 9.51|
|DexYCB||48.30 / 24.23||12.97 / 7.18||13.36 / 7.27|
|HO-3D||17.46 / 10.44||33.11 / 15.51||15.89 / 9.00|
|DexYCB||46.38 / 23.94||12.39 / 6.79||12.48 / 6.87|
We benchmark three tasks on our DexYCB dataset: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. For each task we select representative approaches and analyze their performace.
To evaluate different scenarios, we generate train/val/test splits in four different ways (referred to as “setup”):
S0 (default). The train split contains all 10 subjects, all 8 camera views, and all 20 grasped objects. Only the sequences are not shared with the val/test split.
S1 (unseen subjects). The dataset is split by subjects (train/val/test: 7/1/2).
S2 (unseen views). The dataset is split by camera views (train/val/test: 6/1/1).
S3 (unseen grasping). The dataset is split by grasped objects (train/val/test: 15/2/3). Objects being grasped in the test split are never being grasped in the train/val split, but may appear static on the table. This way the training set still contain examples of each object.
We evaluate object and keypoint detection using the COCO evaluation protocol. For object detection, we consider 20 object classes and 1 hand class. We collect ground truths by rendering a segmentation mask for each instance in each camera view. For keypoints, we consider 21 hand joints and collect the ground truths by reprojecting each 3D joint to each camera image.
. Mask R-CNN has been the de facto for object and keypoint detection and SOLOv2 is a state-of-the-art on COCO instance segmentation. For both we use a ResNet50-FPN backbone pre-trained on ImageNet and finetune on DexYCB.
|S0 (default)||S1 (unseen subjects)||S2 (unseen views)||S3 (unseen grasping)|
|Mask R-CNN||SOLOv2||Mask R-CNN||SOLOv2||Mask R-CNN||SOLOv2||Mask R-CNN||SOLOv2|
Tab. 3 shows results for object detection in average precision (AP). First, Mask R-CNN and SOLOv2 perform similarly in mean average precision (mAP). Mask R-CNN has a slight edge on bounding box (e.g., 75.76 versus 75.13 mAP on S0) while SOLOv2 has a slight edge on segmentation (e.g., 71.56 versus 69.58 mAP on S0). This is because Mask R-CNN predicts bounding boxes first and uses them to generate segmentations. Therefore the error can accumulate for segmentation. SOLOv2 in contrast directly predicts segmentations and uses it to generate bounding boxes. Second, the AP for hand is lower than the AP for objects (e.g., for Mask R-CNN, 71.85 for hand versus 75.76 mAP on S0 bbox). This suggests that hands are more difficult to detect than objects, possibly due to its larger variability in position. Finally, performance varies across setups. For example, mAP in S1 is lower than in S0 (e.g., for Mask R-CNN, 72.69 versus 75.76 on bbox). This can be attributed to the challenge introduced by unseen subjects. Tab. 4 shows the AP for keypoint detection with Mask R-CNN. We observe a similar trend on the ordering of AP over different setups.
We evaluate single-view 6D pose estimation following the BOP challenge protocol [16, 17]. The task asks for a 6D pose estimate for each object instance given the number of objects and instances in each image. The evaluation computes recall using three different pose-error functions (VSD, MSSD, MSPD), and the final score is the average of the three recall values (AR).
We first analyze the challenge brought by hand grasping by comparing static and grasped objects. We use PoseCNN (RGB)  as a baseline and retrain the model on DexYCB. Tab. 5 shows the AR on S0 evaluated separately on the full test set, static objects only, and grasped objects only. Without surprise the AR drops significantly when only considering the grasped objects. This recapitulates the increasing challenge of object pose estimation under hand interactions. To better focus on this regime, we evaluate only on the grasped objects in the remaining experiments.
|S0||S1||S2||S3||PoseCNN ||DeepIM ||DOPE ||PoseRBPF ||CosyPose |
|RGB||+ depth ref||RGB||RGB-D||RGB||RGB||RGB-D||RGB|
We next evaluate on the four setups using the same baseline. Results are shown in Tab. 6 (left). On S1 (unseen subjects), we observe a drop in AR compared to S0 (e.g., 38.26 versus 41.65 on all) as in object detection. This can be attributed to the influence of interaction from unseen hands as well as different grasping styles. The column of S3 (unseen grasping) shows the AR of the three only objects being grasped in the test set (i.e. “009_gelatin_box”, “021_bleach_cleanser”, “036_wood_block”). Again, the AR is lower compared to on S0, and the drop is especially significant for smaller objects like “009_gelatin_box” (33.07 versus 46.62). Surprisingly, AR on S2 (unseen views) does not drop but rather increases slightly (e.g., 45.18 versus 41.65 on all). This suggests that our multi-camera setup has a dense enough coverage of the scene such that training on certain views can translate well to some others.
Finally, we benchmark five representative approaches: PoseCNN , DeepIM , DOPE , PoseRBPF , and CosyPose  (winner of the BOP challenge 2020). All of them take RGB input. For PoseCNN, we also include a variant with post-process depth refinement. For DeepIM and PoseRBPF, we also include their RGB-D variants. For CosyPose, we use its single view version. For DeepIM and CosyPose, we initialize the pose from the output of PoseCNN (RGB). We retrain all the models on DexYCB except for DOPE and PoseRBPF, which trained their models solely on synthetic images. For DOPE, we train models for an additional 7 objects besides the 5 provided. Tab. 6 (right) compares the results on the most challenging S1 (unseen subjects) setup. First, we can see a clear edge on those which also use depth compared to their respective RGB only variants (for PoseCNN, from 38.26 to 43.27 AR on all). Second, refinement based methods like DeepIM and CosyPose are able to improve significantly upon their initial pose input (e.g., for DeepIM RGB-D, from 38.26 to 57.54 AR on all). Finally, CosyPose achieves the best overall AR among all the RGB-based methods (57.43 AR on all).
The task is to estimate the 3D position of 21 hand joints from a single image. We evaluate with two metrics: mean per joint position error (MPJPE) in mm and percentage of correct keypoints (PCK). Besides computing error in absolute 3D position, we also report errors after aligning the predictions with ground truths in post-processing [49, 50, 11]. We consider two alignment methods: root-relative and Procrustes. The former removes ambiguity in translation by replacing the root (wrist) location with ground thuths. The latter removes ambiguity in translation, rotation, and scale, thus focusing specifically on articulation. For PCK we report the area under the curve (AUC) over the error range with 100 steps.
We benchmark one RGB and one depth based approach. For RGB, we select a supervised version of Spurr et al.  which won the HANDS 2019 Challenge . We experiment with the original ResNet50 backbone as well as a stronger HRNet32 
backbone, both initialized with ImageNet pre-trained models and finetuned on DexYCB. For depth, we select A2J and retrain the model on DexYCB.
|S0|| + ResNet50||53.92 (0.307)||17.71 (0.683)||7.12 (0.858)|
| + HRNet32||52.26 (0.328)||17.34 (0.698)||6.83 (0.864)|
|A2J ||27.53 (0.612)||23.93 (0.588)||12.07 (0.760)|
|S1|| + ResNet50||70.23 (0.240)||22.71 (0.601)||8.43 (0.832)|
| + HRNet32||70.10 (0.248)||22.26 (0.615)||7.98 (0.841)|
|A2J ||29.09 (0.584)||25.57 (0.562)||12.95 (0.743)|
|S2|| + ResNet50||83.46 (0.208)||23.15 (0.566)||8.11 (0.838)|
| + HRNet32||80.63 (0.217)||25.49 (0.530)||8.21 (0.836)|
|A2J ||23.44 (0.576)||27.65 (0.540)||13.42 (0.733)|
|S3|| + ResNet50||58.69 (0.281)||19.41 (0.665)||7.56 (0.849)|
| + HRNet32||55.39 (0.311)||18.44 (0.686)||7.06 (0.859)|
|A2J ||30.99 (0.608)||24.92 (0.581)||12.15 (0.759)|
Tab. 7 shows the results. For , we can see that estimating absolute 3D position solely from RGB is difficult (e.g., 53.92 mm absolute MPJPE with ResNet50 on S0). The stronger HRNet32 backbone reduces the errors over ResNet50 but only marginally (e.g., 52.26 versus 53.92 mm absolute MPJPE on S0). Similar to in object pose, the errors on S1 (unseen subjects) increase from S0 (e.g., 70.10 versus 52.26 mm absolute MPJPE for HRNet32) due to the impact of unseen hands and grasping styles. However, unlike the trend in Tab. 6 (left), the errors on S2 (unseen views) also increase pronouncedly and even surpass the other setups (e.g., 80.63 mm absolute MPJPE for HRNet32). This is because unlike objects, the subjects are always facing the same direction from a situated position in this setup. This further constrains the possible hand poses observed in each view, making cross view generalization more challenging. Unsurprisingly, A2J  outperforms  on absolute MPJPE due to the depth input, but falls short on estimating articulation as shown by the errors after alignment (e.g., 12.07 versus 6.83 mm Procrustes MPJPE for HRNet32 on S0).
Given an RGB-D image with a person holding an object, the goal is to generate a diverse set of robot grasps to take over the object without pinching the person’s hand (we refer to as “safe” handovers). The diversity of grasps is important since not all the grasps are kinematically feasible for execution. We assume a parallel-jaw Franka Panda gripper and represent each grasp as a point in .
We first sample 100 grasps for each YCB object using farthest point sampling from a diverse set of grasps pre-generated for that object in . This ensures a dense coverage of the pose space (Fig. 15). For each image, we transform these grasps from the object frame to camera frame using the ground-truth object pose, and remove those collided with the ground-truth object and hand mesh. This generates a reference set of successful grasps .
Given a set of predicted grasps , we evaluate its diversity by computing the coverage  of , defined by the percentage of grasps in having at least one matched grasp in that is neither collided with the object nor the hand. Specifically, two grasps , are considered matched if and , where is the translation and is the orientation in quaternion. We use and .
One could potentially hit a high coverage by sampling grasps exhaustively. Therefore we also compute precision, defined as the percentage of grasps in that have at least one matched successful grasp in .
We experiment with a simple baseline that only requires hand segmentation and 6D object pose. Similar to constructing , we transform the 100 grasps to the camera frame but using the estimated object pose, then remove those that are collided with the hand point cloud obtained by the hand segmentation and the depth image. Specifically, a grasp is collided if the distance of a pair of points from the gripper point cloud and the hand point cloud is less than a threshold . The gripper point cloud is obtained from a set of pre-sampled points on the gripper surface. We use the hand segmentation results from Mask R-CNN (Sec. 5.2).
We evaluate grasps generated with different object pose methods at different threshold and show the precision-coverage curves on S1 in Fig. 4. We see that better object pose estimation leads to better grasp generation. Fig. 15 shows qualitative examples of the predicted grasps. We see that most of the failure grasps (red and gray) are due to inaccurate object pose. Some are hand-colliding grasps caused by a miss detected hand when the hand is partially occluded by the object (e.g., “003_cracker_box”). This can be potentially addressed by model based approaches that directly predict the full hand shape.
We have introduced DexYCB for capturing hand grasping of objects. We have shown its merits, presented a thorough benchmark of current approaches on three relevant tasks, and evaluated a new robotics-relevant task. We envision our dataset will drive progress on these crucial fronts.
PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes.In RSS, 2018.
Here we provide the details on optimizing the depth term in solving 3D hand and object pose (Sec. 2.3). Below we first describe our efficient forward computation based on a point-parallel GPU implementation. Then we show that is differentiable and derive its gradient.
Recall that denotes the entire point cloud where is a single 3D point, and is the triangular mesh of our model. Without loss of generality, we assume that our model contains only one single rigid object, i.e. . Recall that the depth term is defined as
where calculates the signed distance value of a 3D point from the mesh . One naive way to compute this value is to exhaustively compute the distance from the query point to all the triangular faces on the mesh, and find the minimum value among them. To perform a more efficient computation, we leverage the bounding volume hierarchy (BVH) technique from graphics. Specifically, we build an axis-aligned bounding box tree (AABB tree) to index the triangular faces of the mesh at the start of each forward computation of Eq. 7. The AABB tree allows us to efficiently traverse the triangular faces of the mesh and compute distances from the query without going through all the faces exhaustively. Furthermore, since Eq. 7 repeats the same computation for all the query points , we can achieve further speedup by parallelizing over query points using a GPU implementation.
|thumb tip||index tip||middle tip||ring tip||pinky tip|
|4.27 3.42||4.99 3.67||4.33 3.53||4.03 3.56||3.90 3.45|
Now let us turn to the backward pass of Eq. 7, since we optimize using a gradient-based method. During the forward computation, for a query point , once we find its closest point on the mesh, we represent using barycentric coordinates:
where denotes the three vertices of the triangular face which lies on, and with and . Now we can further represent with
We can thus derive the gradient as
Let denote the th vertex of the mesh with
vertices. Using the multivariable chain rule, we get
With the barycentric representation of from Eq.8, we can write
Since we can also compute for rigid objects in Eq. 11, we can obtain the gradient of . In fact, we can compute the gradient of as long as is differentiable. This holds true for the YCB objects (rigid) and also the MANO hand model (deformable). Therefore we can optimize Eq. 7 with both our object and hand models. Similar to the forward computation, the backward pass can also be parallelized using a GPU implementation.
For a better visualization of data and annotations please see the supplementary video. 444https://youtu.be/Q4wyBaZeBw0.
To analyze the accuracy of 3D annotation in DexYCB, we compute the reprojection error of ground-truth 3D keypoints with human labeled 2D keypoints in each image. Tab. 8 reports the reprojection errors of the five finger tips. The mean errors are all below 5 pixels. As shown in Fig. 6, the large errors are often produced by a fast moving hand (top) and occasionally ambiguous annotations (bottom: ring labeled as pinky).
DexYCB pushes prior hand datasets to an unprecedented scale on the diversity of grasps. To analyze the diversity of hand pose, we extract the PCA representation of the MANO hand model for each hand pose and visualize the distribution in 2D space using the first two PCA coefficients. Fig. 7 compares the distribution of hand pose in DexYCB with HO-3D . While DexYCB grows the number of objects as well as the number of grasps per object over prior datasets, the main edge actually lies in the diversity of captured grasps as shown by the distribution of hand pose. This diversity marks an essential challenge of pose estimation in hand-object interactions and downstream applications like object handover to robots.
|#sub||#obj||#view||#seq||#image||#obj anno||#hand anno|
|S1 (unseen subjects)|
|S2 (unseen views)|
|S3 (unseen grasping)|
The main focus of DexYCB is on the diversity of grasps, not the diversity of backgrounds. Capturing hand-object interaction is challenging already in controlled settings. Besides, DexYCB is no lesser than prior datasets in this respect: FreiHAND  used a green background; while HO-3D  used more natural backgrounds, the diversity is not much better since the choice of background scenes in HO-3D is still only two (Fig. 8).
Tab. 9 shows the statistics of the four evaluation setups: S0 (default), S1 (unseen subjects), S2 (unseen views), and S3 (unseen grasping). The first row (all) represents the full dataset, and each sub-table below shows the statistics of one particular setup, divided into train/val/test splits. For each split, we list the number of subjects (“#sub”), objects (“#obj”), views (“#view”), sequences (“#seq”), and image samples (“#image”).
We also list the number of object annotations (“#obj anno”) in each split, including the full object set (“all”) and the subset with only the grasped objects (“grasped”). For 6D object pose estimation, we follow the BOP challenge to speed up the evaluation by evaluating on subsampled keyframes. We use a subsampling factor of 4. The column “6D eval” lists the number of object annotations in the keyframes of the test split. For object handover, the test images need to capture a person with an object in hand ready for handover. Therefore we use the last frame of each video as the test samples, since during capture the subjects are instructed to hand over the object to someone across at the end. The column “handover eval” lists the number of in-hand object annotations (which is also the number of the last frames from all videos since each image contains one in-hand object) in the test split for the handover evaluation.
Finally, we list the number of hand annotations (“#hand anno”) in each split, including the full hand set (“all”) and the subsets with only right (“right”) and left (“left”) hands.
Fig. 11 shows qualitative results of 2D object detection and keypoint detection with Mask R-CNN (Detectron2) [15, 43] on the S0 setup. We highlight some failure examples in the last two rows. In many failure cases we see false object detections due to occlusions either by other objects (e.g., “036_wood_block” in row 5 and column 2) or by hand interaction (e.g., “021_bleach_cleanser” in row 6 and column 1). We also see inaccurately detected hand keypoints when the hand is interacting with objects (e.g., row 6 and column 4).
Fig. 12 shows qualitative results of 6D object pose estimation on the S1 setup. The first row shows the input RGB images and the following rows show the estimated pose from each representative approach. We render object models given the estimated poses and overlay them on top of a darkened input image. We can see the challenge when the object is severely occluded by the hand (e.g., the leftmost example of PoseCNN  (RGB)). We also see that refinement based approaches like DeepIM  and CosyPose  are able to improve upon their coarse estimate input (i.e., PoseCNN (RGB)) and generate more accurate final predictions (e.g., the second-left example of DeepIM (RGB)).
|PoseCNN ||DeepIM ||PoseRBPF ||CosyPose |
|RGB||+ depth ref||RGB||RGB-D||RGB||RGB-D||RGB|
Fig. 13 shows qualitative results of 3D hand pose estimation using Spurr et al.’s method  (HRNet32) on the S0 setup. The method is able to generate sensible articulated pose even under object occlusions. This is consistent with the quantitative results reported in Tab. 7, where the mean per joint position error after Procrustes alignment is less than 1cm (6.83mm). As suggested in that table, the major source of error comes from translation, rotate, and scale (Absolute), rather than articulation. Nonetheless, we still see errors in local articulation when objects are in close contact (e.g., the middle finger in row 2 and column 1) and when the fingers are largely occluded by the held object (e.g., the index finger in row 6 and column 4).
For 6D object pose estimation, since in Tab. 6 we report benchmark results only on S1, we now include the results on the other three setups. Tab. 10, 11, and 12 show the results in AR on S0, S2, and S3, respectively. Overall, we observe a similar trend as on S1 (see Sec. 5.3 ), where DeepIM (RGB-D)  and CosyPose  are the two most competitive approaches in estimation accuracy.
|S2 (unseen views)|
|PoseCNN ||DeepIM ||PoseRBPF ||CosyPose |
|RGB||+ depth ref||RGB||RGB-D||RGB||RGB-D||RGB|
|S3 (unseen grasping)|
|PoseCNN ||DeepIM ||PoseRBPF ||CosyPose |
|RGB||+ depth ref||RGB||RGB-D||RGB||RGB-D||RGB|
Fig. 14 visualizes the 100 grasps for each object used in our evaluation. These grasps are sampled from the pre-generated grasps for YCB objects in . Note that here we only show the grasps for 18 out of 20 objects in DexYCB, since the remaining two objects (“002_master_chef_can” and “036_wood_block”) do not have any feasible grasps using the gripper of choice (i.e., Franka Panda).
Fig. 15 shows additional qualitative results of the predicted grasps for human-to-robot object handover. Interestingly, larger objects like “003_cracker_box” (row 1 and column 1) are less tolerant to errors in pose estimation, since most successful grasps requires the gripper to be fully open and barely fitting the object. Therefore a slight error in the estimated pose will cause the gripper to collide with the object. At the same time, errors in object pose estimation may cause the gripper to miss the grasp especially on smaller objects (e.g., gray grasps for “037_scissors” in row 4 and column 3). On the other hand, a hand that is miss or partially detected due to occlusion may cause a potential pinch by the gripper (e.g., red grasps for “061_foam_brick” in row 5 and column 3).
Since in Fig. 4 we show the results of grasp generation only on S1, we now include the results on the other three setups. Fig. 9 shows the precision-coverage curves on S0, S2, and S3, respectively. Overall, similar to on S1, we can see that more accurate object pose estimation leads to better performance in grasp generation.
It is worth noting that although both DeepIM (RGB-D)  and CosyPose  achieved very close performance on 6D object pose (e.g. 57.54 versus 57.43 AR in Tab. 6), the later performs significantly better on grasp generation (e.g. see Fig. 4). This is because that DeepIM (RGB-D) maintains a competitive average recall by only outperforming CosyPose on a smaller set of objects (e.g. 7 on S1) but with larger margins, whereas CosyPose shows an edge over DeepIM (RGB-D) on more objects (e.g. 13 on S1). This shows that a higher performance on 6D pose metrics like AR does not necessarily translate to a higher performance on downstream robotics tasks like object handover.
Since DexYCB was captured in a controlled lab environment with a constant background, models trained on the RGB images of DexYCB are not expected to generalize well to in-the-wild images. We tested a DexYCB-trained model on COCO images  and observed an expected drop in performance. Fig. 10 shows qualitative examples of 2D hand and keypoint detection. We can see that the detector frequently produces false positives (top left), false negatives (top middle), and inaccurate keypoint detection (top right). Combining DexYCB with other in-the-wild datasets for training will be an interesting follow up work.