Log In Sign Up

DexTransfer: Real World Multi-fingered Dexterous Grasping with Minimal Human Demonstrations

by   Zoey Qiuyu Chen, et al.

Teaching a multi-fingered dexterous robot to grasp objects in the real world has been a challenging problem due to its high dimensional state and action space. We propose a robot-learning system that can take a small number of human demonstrations and learn to grasp unseen object poses given partially occluded observations. Our system leverages a small motion capture dataset and generates a large dataset with diverse and successful trajectories for a multi-fingered robot gripper. By adding domain randomization, we show that our dataset provides robust grasping trajectories that can be transferred to a policy learner. We train a dexterous grasping policy that takes the point clouds of the object as input and predicts continuous actions to grasp objects from different initial robot states. We evaluate the effectiveness of our system on a 22-DoF floating Allegro Hand in simulation and a 23-DoF Allegro robot hand with a KUKA arm in real world. The policy learned from our dataset can generalize well on unseen object poses in both simulation and the real world


page 1

page 5

page 11

page 12


Learning Robust Real-World Dexterous Grasping Policies via Implicit Shape Augmentation

Dexterous robotic hands have the capability to interact with a wide vari...

Learning Pregrasp Manipulation of Objects from Ungraspable Poses

In robotic grasping, objects are often occluded in ungraspable configura...

AdaGrasp: Learning an Adaptive Gripper-Aware Grasping Policy

This paper aims to improve robots' versatility and adaptability by allow...

DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Dexterous multi-fingered robotic hands have a formidable action space, y...

GraspARL: Dynamic Grasping via Adversarial Reinforcement Learning

Grasping moving objects, such as goods on a belt or living animals, is a...

Action Image Representation: Learning Scalable Deep Grasping Policies with Zero Real World Data

This paper introduces Action Image, a new grasp proposal representation ...

Multi-Object Grasping – Types and Taxonomy

This paper proposes 12 multi-object grasps (MOGs) types from a human and...

I Introduction

Our world is largely designed for human hands, anthropomorphic robotic hands are likely to be an important element of robotic systems in human-centric environments. To build general purpose dexterous manipulation systems, we require methods that are able to train policies that can generalize widely with relatively little burden placed on a human supervisor. In this work, our goal is to build a system that is able to leverage a small amount of human supervision to learn a robust controller for dexterous manipulation tasks, which can be reliably deployed on a real robot.

We proposed a demonstration guided data augmentation system (Fig.2) that aims to generate a large dataset of diverse, successful trajectories for a robot gripper in simulation. Our data-augmentation pipeline combines motion retargeting with local gradient-free trajectory refinement and augmentation to obtain a large variety of successful data relatively cheaply. This data can be used to learn policies mapping from point-clouds to actions, which can be transferred to real world scenarios to grasp objects in novel poses.

While human demonstration plays a crucial role in training manipulation policies, devising a scalable method to collect demonstrations is still a key bottleneck. Prior work has collected demonstrations kinesthetically [45], through VR interfaces [35], or using motion capture (mocap) solutions [10, 15, 40, 33]. Some recent efforts have also explored collecting demonstrations by harvesting internet video [23]. Although there has been some innovations for parallel-jaw grippers [19, 38, 43], acquiring demonstrations for multi-fingered dexterous manipulators remains challenging. Consequently, the collected demonstration datasets are often limited in size. Our work leverages existing mocap datasets of human grasping [4] to guide the generation of a large, diverse grasping dataset for training dexterous manipulators.

Ii DexTransfer: A System for Real World Dexterous Grasping with Minimal Human Demonstrations

Fig. 2: Overview of the proposed framework DexTransfer. The mocap data is first retargeted to a dexterous gripper in simulation. This motion reference is then refined and augmented into a large and diverse set of successful trajectories, to learn a policy to succeed on unseen object poses and initial hand poses. The learned policy is eventually transferred to a real robot system.

Our work aims to build a system that can leverage a small number of human demonstrations to learn grasping policies for dexterous grippers which are robust across a variety of object poses in the real world. To accomplish this, our system consists of three major phases - (1) trajectory retargeting (2) trajectory refinement and (3) policy learning, that are performed in sequence to acquire dexterous grasping policies as illustrated in Figure 2. The system starts by being provided with human demonstrations, which are first retargeted to an actual robot gripper in simulation as described in Section II-A. Since the retargeted trajectories are not guaranteed to be dynamically successful, the system then aims to generate trajectory refinements that leads to robust and successful grasps as described in Section II-B, along with performing directed trajectory augmentation to increase the dataset size and diversity. Lastly, these successful grasp trajectories can be used for policy learning directly from sensory observations in Section II-C.

Ii-a Trajectory Retargeting

Our system takes a set of human demonstrations provided via a hand motion capture system. Precisely, we assume a demonstration dataset , where each demonstration is a trajectory of hand 3D pose and object 3D pose . These demonstrations, when directly mapped to a robot hand using standard Inverse Kinematics [11], are typically not functional due to the mismatch between the human and robot actuator shape and kinematics.

To address this issue, we follow DexPilot [13] and formulate the retargeting objective as a non-linear optimization problem with a general cost function that is able to retarget human data to various robot hands of different morphologies while preserving the original demonstration behaviors.


Eq. 1 is the same as the cost function defined in [13], where each and

are displacement vectors between hand joints for robot and human hands, and

is the scale ratio between allegro and a human hand. To move the robot hand towards the object in a similar way as the demonstration, we introduce Eq. 2 where and represents displacements between the object center and each finger tip. We also add Eq. 3 where is the minimum geodesic distance in SO(3) between the orientation of the human palm and robot palm .

The output of retargeting is a set of demonstrations represented by the joint positions of the robot , with each trajectory consisting of finger positions , end effector positions and object poses/pointclouds . These roughly capture human trajectories while being consistent with the robot kinematics. We parameterize the action space as target joint positions and we run an underlying PID controller, making it trivial to define actions from these demonstrations simply as the next position in the trajectory .

Ii-B Trajectory Refinement

Retargeted trajectories simply try to match kinematic poses without actually considering the dynamics of the world, not accounting for unseen contact forces. For contact rich manipulation tasks like dexterous grasping, this can lead to catastrophic failures in trajectory execution. The process of trajectory refinement takes the (potentially unsuccessful) retargeted trajectories from Section II-A and refines them to a diverse set of successful trajectories on a variety of different objects. The process of trajectory refinement consists of (1) Generating a large set of nominal trajectories to refine from retargeted trajectories via template matching (2) perturbing and refining these nominal trajectories to be dynamically successful at task completion (3) augmenting refined trajectories to be diverse across poses, configurations and initial hand states. We describe each of these components in detail.

Template Matching for Nominal Trajectory Generation Typically we are only given a handful of demonstrations by a human supervisor. For data driven policy learning methods as described in Section II-C, this is hardly sufficient and we need to generate a significantly larger dataset to learn general purpose policies. The first insight we leverage is that for rigid bodies, when we apply a transformation of the object pose in SE(3) space, the same transformation can also be applied to the end-effector trajectories (in most cases) to yield sensible trajectories for the new object pose as well. For each initial object pose that is encountered in a particular demonstration

, we can estimate the rigid transform

to the object pose in a different demonstration and use it to transform the first demonstrations end effector trajectory to the second objects frame of reference and vice versa. More precisely,

This transformation provides a drastically larger number of effective trajectories performing in a variety of different object poses 111In our work, we restrict the transformations to be between just a set of stable canonical poses and perform after the fact rotations to further increase diversity. We compute stable canonical poses using trimesh. Note that these trajectories do not actually have to be perfect or successful when executed in the environment as they are refined in the following phase of the pipeline, but the large diversity of such “nominal” trajectories helps with policy learning in Section II-C. We denote the trajectories obtained after the template matching phase as

Refinement via Correlated Sampling

Retargeted trajectories are not successful at dynamically solving the task as they simply match kinematics but not the dynamics of the desired behavior (including desired contact forces). We found that directly replaying the actions in a trajectory open-loop, typically results in close to zero successes. This suggests that to generate an appropriate dataset for supervised learning, we need to refine the original retargeted trajectories to be dynamically successful in simulation.

Our key insight here is that a simple technique based on perturbation with rejection sampling can be effective in generating refined trajectories retargeted from demonstrations. When considering how to refine trajectories, we perturb the motion in the neighborhood of retargeted trajectories so as to find dynamically successfully trajectories that grasp and lift the object. The most naive approach to this would be to randomly add independent gaussian noise to various actuators to create perturbations around the nominal trajectory so as to find a more successful set of controls. However, in a high dimensional state space such as multi-fingered dexterous grasping, this quickly becomes ineffective. Instead, we perform perturbations in a more directed “synergy” space which allows for coordinated perturbations of several fingers so as to open and close fingers in a coordinated fashion, rather than simply perturbing joints independently.

Input: Retargeted trajectories
Output: Refined trajectories

  while   do
     Compute perturbation
     Compute perturbed action
     Execute open loop action sequence to get trajectory
     If passes stability checks,
  end while
Algorithm 1 Correlated Sampling for Refinement

To be more precise, at every step of the trajectory the true control of a retargeted trajectory () is perturbed by sampling and applying correlated perturbations to the nominal controls of the retargeted trajectory. These correlated perturbations are sampled by first sampling a scalar parameter time

from the uniform distribution

and then using this parameter to choose a per-joint perturbation

that interpolates between minimum and maximum perturbation values per joint

, which can then be added to to obtained the perturbed action. This sampling scheme allows for correlated perturbation of the fingers, allowing for perturbation to “open” and “close” the fingers more coherently, rather than simply doing uniformly random exploration independently per-joint, which makes refinement significantly less effective.

To generate successfully refined trajectories, we can generate perturbations in the control as described above, simulate trajectories via standard forward simulation and then reject trajectories which don’t meet the desired stability (object does not fall under perturbation and randomization of parameters like mass and friction and is lifted high off the table) and success criteria (object is grasped and lifted). Our choice of rejection condition during rejection sampling can allow us to select for the most robust behaviors as described in Algorithm 1.

Object-Centric Augmentation. While refinement yields dynamically consistent trajectories that successfully apply contact forces to grasp objects, it still only covers a somewhat narrow diversity of object, hand and arm poses since it performs very local refinements on nominal trajectories. Given that most household objects have multiple plausible grasp positions, which simply differ by a translation offset, we can perform a simple augmentation scheme to propose random translation offsets to refined trajectories and retain those translationally perturbed trajectories that are still able to succesfully accomplish the grasping task. This allows us to synthesize a larger dataset of different grasping trajectories beyond the human demonstrations.

While refinement and augmentation do generate a variety of different grasping trajectories, the number of distinct initial positions of the end effector is still somewhat minimal. This makes policy generalization to novel end effector positions challenging. Given the nature of the grasping problem, we can synthetically generate a much larger dataset of trajectories with varying initial hand poses by generating a variety of initial hand positions and then doing interpolation in free space to the closest trajectory amongst the existing human provided demonstrations, from where that demonstration can subsequently be executed. We refer this step as data funneling.

Given the combination of template matching, refinement with correlated sampling and object-centric augmentation with funneling, we finally arrive at a diverse and dynamically successful dataset that is able to successfully apply contact forces and perform grasping from a variety of configurations. This is then used for policy learning as described below.

master chef can    cracker box      sugar box tomato soup can   mustard bottle    tuna fish can
Human Demo 26 25 25 25 25 28
Normalized Trajectory 78 141 158 56 75 122
Refined Instance 182 100 313 146 82 200
Augmented Trajectory 286 462 1018 294 46 139
   pudding box     gelatin box potted meat can    wood block    foam brick        bowl
Human Demo 23 27 23 26 26 27
Normalized Trajectory 87 118 128 102 104 79
Refined Trajectory 40 52 96 55 79 25
Augmented Trajectory 110 248 713 846 380 179
        mug  bleach cleaner    power drill     large marker     pitcher base        sum
Human Demo 23 23 23 26 25 426
Normalized Trajectory 101 56 64 78 74 1621
Refined Instance 122 102 45 43 34 2566
Augmented Trajectory 109 181 123 9 38 5181
TABLE I: Allegro Hand Dataset across 17 YCB objects. We transfer a small set of Mocap Data to a large dataset with successful trajectories.

Ii-C Policy Learning from Point Cloud observations

Fig. 3: Network architecture for the point cloud dependent policy . Scene Encoder consists of three PointNet++ SA modules followed by two fully-connected layers. Kinematics Encoder consists of three residual modules. Fusion Layer takes the concatenated features and feed into one linear layer followed by two residual modules. The network has three branches to predict palm translation, rotation and joint angles. Each branch consists of three residual modules.

Given a robust and diverse dataset of robot trajectories, we train a policy to directly predict the action to execute based on current observation via a standard maximum likelihood supervised learning objective. Our network consists of a Scene Encoder and a Kinematics Encoder. The Scene Encoder is based on PointNet++ [32], that takes point clouds of the object as current observation , and predicts scene features as output. The Kinematics Encoder takes the past 5 frames of motions (to deal with partial observability and variable timing)

, each includes robot finger joints, keypoints and object center shift, and outputs the kinematics feature. The scene features and kinematics features are then concatenated and fed into Fusion Layer to compute combined features for the current observation. The network splits three branches to predict translation (3-dim), rotations (3-dim) and finger joints (N-dim, N=16 for Allegro). We found that representing observation and action relative to robot palm coordinate is crucial for generalizing to unseen object poses. Thus all the observation and keypoint positions are represented in the current palm coordinates. The predictions of the network are used to compute the loss function as follows, which can then be used to learn policy parameters

via standard stochastic gradient methods:


where is the minimum geodesic distance in SO(3).

Iii Experiments

In this section, we aim to answer the following questions:

  1. Can DexTransfer enable us to leverage a small number of human demonstrations to obtain a large number of dynamically successful trajectories in sim across a variety of objects?

  2. Can we understand which elements of the proposed system are most crucial to enabling robust dexterous grasping performance?

  3. Do the resulting policies transfer to the real world on an Allegro robot hand?

We show videos experiments for both simulation and real world in the supplementary material and further results can be found on the supplementary website

Iii-a Dataset Generation in Simulation

We extracted human demonstrations from the DexYCB dataset [4], a dataset with 1,000 mocap sequences of humans picking up YCB objects [3] from a table. Each sequence contains a single human subject picking up one specific object with a single hand. The hand pose is captured using the MANO hand model [36]. Given the raw dataset, we curated a subset with 17 objects picked up by a human with the right hand, resulting in a total of demonstration sequences with varying object configurations. Table I reports the accumulated number of trajectories after each processing stage for the Allegro hand in a per object breakdown. Despite MoCap tracking error and kinematics difference between the robot and the human demonstrator, DexTransfer is able to generate more than 10 folds successful data for the allegro robot.

Fig. 4: Qualitative results of policies on unseen poses of various objects in both simulation and real world. We can see that the hand approaches the objects from a variety of approach angles and grasp poses and can show interesting grasping strategies.

Iii-B Policy Learning on Augmented Dataset

In this section, we show how we can leverage the augmented dataset to learn policies that can perform well at dexterously grasping objects in unseen poses and configurations. In particular, we follow the procedure described in Section II-C and learn a per-object policy that is able to dexterously grasp a particular object in a variety of different poses. During test time, we randomize object 6D poses and initial positions of the robot. We evaluate our policies 20 times on each object. Similar to the set up in [39], each time, if the robot drops the object, we reset our network to additionally captures the rate of success within three attempts. Note that a test episode is ended as soon as a success is reached. The success is defined when the object is above the table by 10 cm. In Table II, we show the success rate of the robot after each attempt. The policy is able to recover and succeed in the next trials because its trained in a closed-loop fashion. All experiments are trained with the same batch size and Adam optimizer with learning rate . We use Nvidia Apex [7] to achieve mixed-precision training through all of our experiments. For depth rendering, we use pyrender [26] to enable fast online rendering during training. We show the policy is able to grasp most objects with a success rate of over on unseen poses and configurations in Table II(DexTransfer (ours)).

Overall, these results indicate that the augmented dataset combined with policy learning via supervised learning allow us to acquire dexterous grasping behavior that is general both across unseen poses and across different objects. We additionally show qualitative results of Allegro hand grasping objects with different poses at different frames in Fig.4.

Iii-C Understanding Different Elements of the Refinement Pipeline

master chef can    cracker box      sugar box tomato soup can   mustard bottle    tuna fish can
  1     2     3   1     2     3   1     2     3   1     2     3   1     2     3   1     2     3
Heuristic 0.00  0.00  0.00 0.00  0.00  0.00 0.25  0.30  0.35 0.05  0.05  0.05 0.00  0.00  0.00 0.20  0.20  0.20
Nearest Neighbor 0.20  0.20  0.20 0.10  0.10  0.10 0.10  0.10  0.10 0.10  0.10  0.10 0.25  0.25  0.25 0.25  0.25  0.25
No Funneling 0.05  0.05  0.05 0.05  0.05  0.05 0.10  0.15  0.15 0.00  0.00  0.00 0.05  0.05  0.05 0.05  0.10  0.10
No Randomization 0.65  0.80  0.80 0.25  0.25  0.25 0.35  0.35  0.35 0.25  0.25  0.25 0.50  0.55  0.55 0.40  0.55  0.55
No Augmentation 0.65  0.65  0.65 0.20  0.20  0.20 0.60  0.65  0.65 0.55  0.60  0.60 0.50  0.50  0.50 0.40  0.50  0.55
DexTransfer(ours) 0.70  0.80  0.80 0.25  0.30  0.35 0.65  0.70  0.80 0.70  0.85  0.85 0.75  0.85  0.85 0.70  0.80  0.85
   pudding box     gelatin box potted meat can    wood block    foam brick        bowl
  1     2     3   1     2     3   1     2     3   1     2     3   1     2     3   1     2     3
Heuristic 0.20  0.20  0.20 0.05  0.05  0.05 0.30  0.30  0.30 0.00  0.00  0.05 0.15  0.15  0.15 0.00  0.00  0.00
Nearest Neighbor 0.20  0.25  0.25 0.05  0.05  0.05 0.10  0.10  0.10 0.10  0.10  0.10 0.25  0.25  0.25 0.25  0.25  0.25
No Funneling 0.05  0.05  0.05 0.00  0.00  0.00 0.05  0.05  0.05 0.00  0.05  0.05 0.00  0.00  0.00 0.25  0.25  0.25
No Randomization 0.30  0.40  0.45 0.15  0.20  0.25 0.55  0.60  0.60 0.20  0.20  0.25 0.20  0.30  0.30 0.35  0.40  0.50
No Augmentation 0.50  0.75  0.80 0.40  0.50  0.50 0.75  0.80  0.80 0.60  0.75  0.75 0.60  0.75  0.75 0.65  0.75  0.75
DexTransfer(ours) 0.40  0.55  0.70 0.50  0.70  0.80 0.80  0.90  0.90 0.55  0.75  0.75 0.80  0.80  0.85 0.85  0.90  0.95
        mug  bleach cleaner    power drill     large marker     pitcher base       average
  1     2     3   1     2     3   1     2     3   1     2     3   1     2     3   1     2     3
Heuristic 0.10  0.15  0.15 0.00  0.00  0.00 0.50  0.55  0.55 0.10  0.10  0.10 0.00  0.00  0.00 0.11  0.12  0.13
Nearest Neighbor 0.15  0.15  0.15 0.05  0.05  0.05 0.10  0.10  0.10 0.15  0.20  0.20 0.00  0.00  0.00 0.14  0.15  0.15
No Funneling 0.05  0.25  0.30 0.00  0.05  0.05 0.00  0.00  0.05 0.00  0.00  0.00 0.00  0.00  0.00 0.04  0.07  0.07
No Randomization 0.50  0.65  0.65 0.20  0.25  0.25 0.25  0.25  0.25 0.40  0.40  0.40 0.05  0.05  0.05 0.33  0.38  0.40
No Augmentation 0.50  0.65  0.70 0.10  0.20  0.25 0.15  0.20  0.25 0.90  0.90  0.90 0.05  0.05  0.05 0.47  0.55  0.57
DexTransfer(ours) 0.85  0.85  0.85 0.40 0.50  0.50 0.40  0.45  0.45 0.90  0.90  0.90 0.25  0.35  0.40 0.61  0.70  0.74
TABLE II: Ablation Study in Simulation

In this section, we aim to understand which elements of our system are crucial to performance by evaluating the performance of the system as each component: domain randomization, augmentation and data funneling, is removed. In addition, we compare our approach with a heuristic based approach and a nearest neighbor approach. The results are shown in Table II and are described below.

Policy without Domain Randomization We show that adding domain randomization in the dataset can significantly improve policy’s ability to handle covariate shift at inference time. The key insight is one trajectory could be successful by chance, but cannot guarantee the robustness to noise and different physics parameters. We observe an almost performance drop without domain randomization when evaluating policies on 17 objects as show in the second row in Table II.

Policy Without Augmentation We found without augmentation, the policy performs similar on smaller objects but worse on bigger objects like bleach cleaner, power drill and pitcher base. On average, the performance drops more than .

Policy Without Data Funneling Without data funneling, the policy fails completely to grasp unseen poses from a novel initial robot state. This is because only refined trajectories cannot cover the entire space. We show that without data funneling, the success rate on average is less than .

Heuristic Approach We compare our approach with a heuristic approach, which drives the robot along a linear path to a certain distance from the center of the object pointclouds, and grasp the object using pre-defined finger poses. We observed that this policy only achieves an 11% success rate. This shows that learning to adjust both position and orientation of the robot given the observation of the scene is crucial for success.

Nearest Neighbor To illustrate the difference between the training and test set, we further perform the following baseline: For each of the 20 test samples, we find the closest sample in the training set and execute its corresponding actions. We observe only a 14% success rate on average. This suggests our policy learns to adapt to novel poses given the observation of the object.

Iii-D Real Robot Experiments

Lastly, we test policy transfer to the real world on a physical robot for a subset of the trained policies.

Fig. 5: Real experiments with a 23-DoF Kuka Allegro robot tested on 5 objects. Each object is evaluated 25 times.

Robot System We deploy our policies to a robotic platform that has 23 actuators across a KUKA LBR iiwa 7 R800 robot arm and a Wonik Robotics Allegro robotic hand. We place three RGB-D cameras around the robot to provide necessary point cloud information. Specifically, unseen object instance segmentation [42] is applied to the image data to extract a segmented point cloud of a single object on the table, which is used as input to the trained policies.

Control The policies require additional control aspects for real-world application. First, the policy generates palm pose targets, which is an elevated action space for the robot. These pose targets are passed to an underlying, manually-derived geometric fabrics policy that generates high-frequency joint position, velocity, and acceleration targets motion across all joints of the arm. Geometric fabrics is a provably stable, second order policy that effectively solves the problem of reaching to end-effector targets while resolving arm redundancy, controlling manipulator posture, avoiding joint limits and joint speed limits, avoiding robot self-collision, and avoiding excessive collision between the robot and the table. The design and tuning of this policy is exactly the same as the one reported in [41].

Finally, the target joint positions generated by fabrics are directly fed to an underlying gravity-compensated joint PD controller at 30 Hz, which are upsampled to 1000 Hz via polynomial interpolation. The trained policies also generate finger joint position commands at 6 Hz, which are fed directly to a PD controller that drives the joints of the Allegro hand. Altogether, the actions of the trained policies are ultimately converted to target joint drive torques that generate motion through the physical robot.

Evaluation We tested policies on a subset of objects: Master chef can, potted meat can, mug, cracker box and wood block. In real world, wood block is too heavy for the Allegro hand to lift, so we replace it with a foam block with the same dimensions. To reduce the reflection of the object surface in order to get more reliable segmentation masks, we add non-sticky tapes around cracker box and potted meat can. We test our policies on each object 25 times and report success rate in Figure 5. Qualitative results are shown in Fig. 4. We see that while success rate in the real world is lower than in simulation, it is able to successfully complete the task reliably with over success rate for any object and in many cases closer to success rate.

Iv Conclusions

In this work, we proposed a new system for learning to grasp various objects with a multi-fingered dexterous manipulator, leveraging a small number of human-provided demonstrations. Our system retargets the provided demonstrations into simulation, applies a technique for targeted data-augmentation and refinement and then applies large scale supervised learning to learn control policies that operate directly on point-cloud inputs. These policies when trained with sufficient augmentation can be transferred to the real world and can be used to grasp various objects with up to success rate. We show the efficacy of the system in both simulation and the real world, running careful analysis to understand the impact of various design decisions in the system.

There are several directions for future work. For instance, it would be very interesting to go beyond tasks like grasping and do more dexterous problems like in-hand manipulation by continuing to leverage the same general system for simulation driven data augmentation. It would also be interesting to see if we can use the simulator to pretrain policies but continue to finetune them directly in the real world in new domains. Additionally, exploring more complex refinement algorithms like CMA-ES [14] rather than rejection sampling would be a worthwhile exercise.


  • [1] Y. Bai and C. K. Liu (2014) Dexterous manipulation using both palm and fingers. In ICRA, Cited by: Appendix A, Appendix A.
  • [2] S. Brahmbhatt, A. Handa, J. Hays, and D. Fox (2019) ContactGrasp: functional multi-finger grasp synthesis from contact. In IROS, Cited by: Appendix A.
  • [3] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar (2015) The YCB object and model set: towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), Cited by: §III-A.
  • [4] Y. Chao, W. Yang, Y. Xiang, P. Molchanov, A. Handa, J. Tremblay, Y. S. Narang, K. Van Wyk, U. Iqbal, S. Birchfield, J. Kautz, and D. Fox (2021) DexYCB: a benchmark for capturing hand grasping of objects. In CVPR, Cited by: Appendix A, §I, §III-A.
  • [5] T. Chen, J. Xu, and P. Agrawal (2021) A system for general in-hand object re-orientation. In CoRL, Cited by: Appendix A, Appendix A.
  • [6] S. Christen, M. Kocabas, E. Aksan, J. Hwangbo, J. Song, and O. Hilliges (2021) D-Grasp: physically plausible dynamic grasp synthesis for hand-object interactions. arXiv preprint arXiv:2112.03028. Cited by: Appendix A.
  • [7]

    A PyTorch extension: tools for easy mixed precision and distributed training in PyTorch

    Note: Cited by: §III-B.
  • [8] E. Coumans and Y. Bai (2016–2021)

    PyBullet: a Python module for physics simulation for games, robotics and machine learning

    Note: Cited by: §C-A.
  • [9] G. Garcia-Hernando, E. Johns, and T. Kim (2020)

    Physics-based dexterous manipulations with estimated hand poses and residual reinforcement learning

    In IROS, Cited by: Appendix A.
  • [10] G. Garcia-Hernando, S. Yuan, S. Baek, and T. Kim (2018) First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In CVPR, Cited by: Appendix A, §I.
  • [11] M. Gleicher (1998) Retargetting motion to new characters. In SIGGRAPH, Cited by: §II-A.
  • [12] A. Gupta, C. Eppner, S. Levine, and P. Abbeel (2016) Learning dexterous manipulation for a soft robotic hand from human demonstrations. In IROS, Cited by: Appendix A, Appendix A.
  • [13] A. Handa, K. Van Wyk, W. Yang, J. Liang, Y. Chao, Q. Wan, S. Birchfield, N. Ratliff, and D. Fox (2020) DexPilot: vision-based teleoperation of dexterous robotic hand-arm system. In ICRA, Cited by: Appendix A, §II-A, §II-A.
  • [14] N. Hansen (2016) The CMA evolution strategy: A tutorial. CoRR abs/1604.00772. External Links: Link, 1604.00772 Cited by: §IV.
  • [15] Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid (2019) Learning joint reconstruction of hands and manipulated objects. In CVPR, Cited by: Appendix A, §I.
  • [16] D. Jain, A. Li, S. Li, A. Rajeswaran, V. Kumar, and E. Todorov (2019) Learning deep visuomotor policies for dexterous hand manipulation. In ICRA, Cited by: Appendix A, Appendix A.
  • [17] R. Jeong, J. T. Springenberg, J. Kay, D. Zheng, A. Galashov, N. Heess, and F. Nori (2020) Learning dexterous manipulation from suboptimal experts. In CoRL, Cited by: Appendix A, Appendix A.
  • [18] K. Karunratanakul, J. Yang, Y. Zhang, M. J. Black, K. Muandet, and S. Tang (2020) Grasping field: learning implicit representations for human grasps. In 3DV, Cited by: Appendix A.
  • [19] M. Kokic, D. Kragic, and J. Bohg (2020) Learning task-oriented grasping from human activity datasets. RA-L 5 (2), pp. 3352–3359. Cited by: Appendix A, §I.
  • [20] V. Kumar, Y. Tassa, T. Erez, and E. Todorov (2014) Real-time behaviour synthesis for dynamic hand-manipulation. In ICRA, Cited by: Appendix A, Appendix A.
  • [21] V. Kumar, E. Todorov, and S. Levine (2016) Optimal control with learned local models: application to dexterous manipulation. In ICRA, Cited by: Appendix A, Appendix A.
  • [22] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg (2017)

    Dex-Net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

    In RSS, Cited by: Appendix A.
  • [23] P. Mandikal and K. Grauman (2021) DexVIP: learning dexterous grasping with human hand pose priors from video. In CoRL, Cited by: Appendix A, Appendix A, Appendix A, §I.
  • [24] P. Mandikal and K. Grauman (2021) Learning dexterous grasping with object-centric visual affordances. In ICRA, Cited by: Appendix A, Appendix A.
  • [25] M. T. Mason and J. K. Salisbury (1985-05) Robot hands and the mechanics of manipulation. MIT Press, Cambridge, MA. Cited by: Appendix A.
  • [26] M. Matl (2019) Pyrender. Note: Cited by: §III-B.
  • [27] A. T. Miller and P. K. Allen (2004) GraspIt! a versatile simulator for robotic grasping. RAM 11 (4), pp. 110–122. Cited by: Appendix A.
  • [28] A. Mousavian, C. Eppner, and D. Fox (2019) 6-DOF GraspNet: variational grasp generation for object manipulation. In ICCV, Cited by: Appendix A.
  • [29] A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar (2019) Deep dynamics models for learning dexterous manipulation. In CoRL, Cited by: Appendix A, Appendix A.
  • [30] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang (2019) Solving Rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: Appendix A, Appendix A.
  • [31] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba (2018) Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177. Cited by: Appendix A, Appendix A.
  • [32] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: §II-C.
  • [33] Y. Qin, Y. Wu, S. Liu, H. Jiang, R. Yang, Y. Fu, and X. Wang (2021)

    DexMV: imitation learning for dexterous manipulation from human videos

    arXiv preprint arXiv:2108.05877. Cited by: Appendix A, Appendix A, Appendix A, §I.
  • [34] I. Radosavovic, X. Wang, L. Pinto, and J. Malik (2021) State-only imitation learning for dexterous manipulation. In IROS, Cited by: Appendix A, Appendix A.
  • [35] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2018) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In RSS, Cited by: Appendix A, Appendix A, Appendix A, §I.
  • [36] J. Romero, D. Tzionas, and M. J. Black (2017) Embodied hands: modeling and capturing hands and bodies together. In SIGGRAPH Asia, Cited by: §III-A.
  • [37] J. K. Salisbury and J. J. Craig (1982) Articulated hands: force control and kinematic issues. IJRR 1 (1), pp. 4–17. Cited by: Appendix A.
  • [38] S. Song, A. Zeng, J. Lee, and T. Funkhouser (2020) Grasping in the wild: learning 6DoF closed-loop grasping from low-cost demonstrations. RA-L 5 (3), pp. 4978–4985. Cited by: Appendix A, §I.
  • [39] M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox (2021) Contact-graspnet: efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 13438–13444. Cited by: §III-B.
  • [40] O. Taheri, N. Ghorbani, M. J. Black, and D. Tzionas (2020) GRAB: a dataset of whole-body human grasping of objects. In ECCV, Cited by: Appendix A, Appendix A, §I.
  • [41] K. Van Wyk, M. Xie, A. Li, M. A. Rana, B. Babich, B. Peele, Q. Wan, I. Akinola, B. Sundaralingam, D. Fox, B. Boots, and N. Ratliff (2022) Geometric fabrics: generalizing classical mechanics to capture the physics of behavior. RA-L. Cited by: §III-D.
  • [42] Y. Xiang, C. Xie, A. Mousavian, and D. Fox (2020) Learning RGB-D feature embeddings for unseen object instance segmentation. In CoRL, Cited by: §III-D.
  • [43] S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto (2020) Visual imitation made easy. In CoRL, Cited by: Appendix A, §I.
  • [44] H. Zhang, Y. Ye, T. Shiratori, and T. Komura (2021) ManipNet: neural manipulation synthesis with a hand-object spatial representation. In SIGGRAPH, Cited by: Appendix A.
  • [45] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar (2019) Dexterous manipulation with deep reinforcement learning: efficient, general, and low-cost. In ICRA, Cited by: Appendix A, Appendix A, Appendix A, §I.

Appendix A Related Work

Manipulation with Dexterous Hands

Dexterous manipulation has a long history dating back to [37, 25]. More recently, the field has studied the use of dexterous hands in a variety of task domains, ranging from object grasping [24, 23], in-hand manipulation [1, 21, 31, 30, 5], relocation [20, 33], stacking [17], interaction with environmental props [12, 45], tool use [35, 16, 29, 34], to smart teleoperation [13, 9]. Our work addresses the task of object grasping, a key first step to gain hold of an object before any downstream manipulation tasks. Grasping has been conventionally approached by model-based planning [27]. Notably, for simpler end effectors like parallel-jaw grippers, remarkable progress has been recently achieved through end-to-end learning-based approaches which directly predict grasp position from raw visual input, such as a depth image [22] or point cloud [28]. Our work is in line with this direction but addresses the frontier of the more complex dexterous hands.

To control a dexterous hand, prior work has also investigated different approaches, including planning with analytical models [1] and online trajectory optimization [20]. However, these methods assume accurate dynamics models and robust state estimates, which are difficult to obtain in complex real-world manipulation. To overcome this limitation, recent work has resorted to learning-based approaches particularly with deep reinforcement learning (RL). Both model-based RL [21, 12, 29] and model-free RL [31, 30, 17, 5] have been investigated. Despite the progress, training deep RL models remains notoriously challenging due to high sample complexity and tedious reward engineering. Although this issue has been mitigated by incorporating human demonstrations [35, 45, 33, 34, 23], these methods are still faced with a major challenge in scalability—training a single general policy for handling diverse objects and diverse scene configurations is still largely beyond reach [33, 5]. Rather than using deep RL, we adopt the common supervised learning paradigm, where we focus on bootstrapping a large dexterous grasping dataset from a small set of human demonstrations. Our work bears some similarity to [16, 5], where an expert policy is first trained using RL in a privileged state space, followed by training a high-dimensional student policy from expert generated data via behavioral cloning. Instead of training RL experts, our method bootstraps from minimal human demonstrations. Besides, as opposed to many prior works which only evaluate in simulation [35, 16, 17, 24, 34, 23, 5], we evaluate our model on a real-world robot platform.

Fig. 6: Refined successful trajectories from human demonstrations to diverse robot grippers
Fig. 7: Illustration of trajectory refinement procedure combining refinement, augmentation, and data funneling to get an extended dataset of trajectories for supervised learning. Left: Only a few human demonstrations are provided and retargeted to a robot gripper to generate nominal trajectories. Middle: grasping trajectories generated by policies on unseen object poses and initial hand poses via refinement and augmentation. Right: Extending the set of feasible object poses and novel hand states via data funneling.
master chef can    cracker box      sugar box tomato soup can   mustard bottle    tuna fish can
  1     2     3   1     2     3   1     2     3   1     2     3   1     2     3   1     2     3
vertices 0.90  0.90  0.90 0.45  0.50  0.50 0.70  0.85  0.85 0.80  0.85  0.85 0.55  0.65  0.65 0.60  0.75  0.80
1 cam 0.25  0.30  0.30 0.15  0.15  0.15 0.35  0.40  0.40 0.25  0.25  0.25 0.25  0.25  0.25 0.15  0.15  0.20
4 cams 0.70  0.80  0.80 0.20  0.30  0.35 0.65  0.70  0.80 0.70  0.85  0.85 0.75  0.85  0.85 0.70  0.80  0.85
4 cams+contact 0.60  0.60  0.60 0.25  0.35  0.35 0.55  0.65  0.75 0.65  0.75  0.80 0.75  0.75  0.75 0.90  1.00  1.00
   pudding box     gelatin box potted meat can    wood block    foam brick        bowl
  1     2     3   1     2     3   1     2     3   1     2     3   1     2     3   1     2     3
vertices 0.45  0.75  0.75 0.50  0.60  0.75 0.80  0.85  0.90 0.90  0.90  0.90 0.85  0.95  0.95 0.70  0.95  0.95
1 cam 0.15  0.15  0.15 0.10  0.10  0.20 0.25  0.30  0.30 0.35  0.35  0.35 0.35  0.35  0.35 0.35  0.35  0.35
4 cams 0.40  0.55  0.70 0.50  0.70  0.80 0.80  0.90  0.90 0.55  0.75  0.75 0.80  0.80  0.85 0.80  0.95  0.95
4 cams+contact 0.55  0.60  0.65 0.75  0.75  0.80 0.70  0.85  0.85 0.60  0.80  0.80 0.75  0.80  0.85 0.90  1.00  1.00
         mug  bleach cleaner    power drill     large marker     pitcher base       average
  1     2     3   1     2     3   1     2     3   1     2     3   1     2     3   1     2     3
vertices 0.75  0.80  0.80 0.60  0.70  0.70 0.45  0.55  0.55 0.85  0.85  0.85 0.25  0.40  0.55 0.65  0.75  0.79
1 cam 0.35  0.40  0.40 0.35  0.35  0.35 0.15  0.15  0.15 0.50  0.50  0.50 0.05  0.05  0.05 0.25  0.26  0.27
4 cams 0.85  0.85  0.85 0.40  0.45  0.55 0.40  0.45  0.45 0.90  0.90  0.90 0.25  0.35  0.40 0.61  0.70  0.74
4 cams+contact 0.90  0.90  0.90 0.40  0.60  0.60 0.50  0.50  0.55 0.95  0.95  0.95 0.35  0.35  0.35 0.65  0.72  0.74
TABLE III: Experiments Illustrating Effect of Occluded Pointclouds on Policy Learning

Learning from Human Demonstration

While human demonstration plays a crucial role in training manipulation policies, devising a scalable means to collect demonstrations is still a key bottleneck. Prior work has collected demonstrations kinesthetically [45], through VR interfaces [35], or using motion capture (mocap) solutions [10, 15, 40, 33]. Although there has been some innovations for parallel-jaw grippers [19, 38, 43], acquiring demonstrations for dexterous manipulators remains challenging. Consequently, the collected demonstration datasets are often limited in size. Our work leverages existing mocap datasets of human grasping [4] to guide the generation of a large, diverse grasping dataset for training dexterous manipulators. Apart from our work, some recent efforts have also explored collecting demonstrations by harvesting internet video [23].

Multi-Finger Grasp Synthesis

Our work connects closely to the task of multi-finger grasp synthesis. Prior work has investigated grasp synthesis for virtual human hands [40, 18] and multi-finger robot manipulators [2]. However, they are only concerned with static grasp poses and do not consider the motion. Some very recent work attempted to bridge this gap by synthesizing trajectories for human-like hands [44, 6]. Our work can potentially capitalize on their grasp trajectories by utilizing them as additional human demonstrations.

Appendix B DexTransfer

B-a Generalization on Different Grippers

We further investigate whether DexTransfer is general for other grippers. In addition to Allegro robot, we also generate a dataset for Shadow hand, MANO hand, Robotiq-3f gripper and Kinova. Fig6 shows qualitative results on refined poses from human demonstrations. This suggests that DexTransfer is able to transfer human demonstrations to various robot grippers despite the large difference among their kinematic configurations.

Fig. 8: Examples of policies on unseen poses in real world
Fig. 9: Single policy across multiple objects

B-B Details on Trajectory Refinement

Fig 7 shows details on trajectory refinement and augmentation in Dextransfer. For each nominal trajectory, we sample a scalar parameter time from the uniform distribution , maximum palm pose and minimum palm pose . We then use this parameter to interpolate between minimum and maximum perturbation values within the corresponding time steps , which can then be added to to obtained the perturbed action. We add domain randomization and only keep refined actions that are successful with 10 different physics parameters (object mass, frictions, noise and perturbation test) via rejection sampling.

Appendix C Experiment

C-a Dataset Generation Details

In real world experiments, we found Allegro hand is not strong enough to lift heavy objects like power drill or bleach cleaner, so we reduce all the object mass to less than . During data randomization, we fixed Allegro hand torque force that is close to a real robot: , and randomize object mass , and object friction coefficient . We injected noise to each candidate trajectory. In addition, we add seconds of high-frequency perturbation to the palm at the end of the grasping trajectory to further identify unstable grasping poses. We use PyBullet [8] to generate the dataset and the robot is controlled at 12Hz in simulation.

C-B Ablation on Input Representation

Fig. 10: Examples of policies on unseen poses in simulation

We aim to understand the impact that learning on noisy or occluded point clouds has on policy learning. In particular, we consider the impact of acquiring the point cloud using just a single camera versus using multiple cameras, as well as including the ability to sense binary contacts to deal with partial observability. In particular, we compare the following scenarios for pointcloud acquisition (as detailed in Table III):

Full Vertices (Oracle): This assumes no occlusion during robot movement, and the state is fully observable. Note that this will not be realistic for any real robot experiments, but its useful as a upper-bound on the performance of the partially observable case. On average, it achieves , and success rate after three successive trials.

One Camera: If we only use one camera to render the scene, severe occlusion occurs when the robot is close to the object. We randomize camera position during training. We found that the policy performs poorly in this scenario, on average, only , and success rate are achieved.

Four Cameras: Two cameras are placed on top of the object, and two are on the side. We found that with 4 cameras, the policy is able to achieve similar performance as sampling from vertices: , and .

Binary Contact: Occlusion is inevitable during real world experiments, we investigate whether adding extra information about per-finger binary contact could help the policy under partially observable conditions. For each finger tip link, we set a binary contact variable to 1—if the contact force is larger than 0.5N—and 0 otherwise. We concatenate this binary contact feature with robot poses as the input for the Kinematics Encoder. We found that this extra information can help to fill the gap in performance between 4 camera rendering and full vertices. In particular, it helps to improve performance on smaller objects, for example, tuna fish can, pudding box, gelatin box, and large maker, where occlusion usually occurs the most as shown in Table III. Interestingly, we found it performs slightly worse on some objects, for example master chef can and sugar box. We hypothesize the reason is that binary contact sometimes might mislead the network to think the grasp is stable enough to lift, and that a richer non-binary contact format could help with this issue.

C-C Single Policy across Multiple Objects

We also show that we can train a single policy that can perform grasping across multiple objects in Figure9. We group objects with similar shapes: Cuboids and Cylinders. Each group contains 5 objects. Cuboid Objects include cracker box, sugar box, gelatin box, wood block and foam brick. Cylindrical objects include master chef can, tomato soup can, tuna fish can, mug and large marker. In addition, we select 5 objects that have diverse shapes. In order to evaluate the dropping performance when training across multiple diverse objects, we select 5 objects that yielded high performance when training per-object policies: master chef can, potted meat can, bowl, mug and large marker. We compute the average success rate from the per-object policies as an upper bound, and compare how the performance drops when training a single combined policy.

C-D Qualitative Results

We show more qualitative results in both simulation (Fig.10) and real robot experiments (Fig.8) . Our policy learns to adjust poses and fingers smoothly while approaching to the object. Further results can be found on the supplementary website