Dexterous manipulation with multi-fingered hands represents a grand challenge in robotics: the versatility of the human hand is as yet unrivaled by the capabilities of robotic systems, and bridging this gap will enable more general and capable robots. Although some real-world tasks can be accomplished with simple parallel jaw grippers, there are countless tasks in which dexterity in the form of redundant degrees of freedom is critical. In fact, dexterous manipulation is defined[okamura2000overview] as being object-centric, with the goal of controlling object movement through precise control of forces and motions – something that is not possible without the ability to simultaneously impact the object from multiple directions. Through added controllability and stability, multi-fingered hands enable useful fine motor skills that are necessary for deliberate interaction with objects. For example, using only two fingers to attempt common tasks such as opening the lid of a jar, hitting a nail with a hammer, or writing on paper with a pencil would quickly encounter the challenges of slippage, complex contact forces, and underactuation. Success in such settings requires a sufficiently dexterous hand, as well as an intelligent policy that can endow such a hand with the appropriate control strategy.
The principle challenges in dexterous manipulation stem from the need to coordinate numerous joints and impart complex forces onto the object of interest. The need to repeatedly establish and break contacts presents an especially difficult problem for analytic approaches, which require accurate models of the physics of the system. Learning offers a promising data-driven alternative. Model-free reinforcement learning (RL) methods can learn policies that achieve good performance on complex tasks[van2015learning, levine2016end, rajeswaran2017learning]; however, we will show that these state-of-the-art algorithms struggle when a high degree of flexibility is required, such as moving a pencil to follow arbitrary user-specified strokes. Here, complex contact dynamics and high chances of task failure make the overall skill much more difficult. Model-free methods also require large amounts of data, making them difficult to use in the real world. Model-based RL methods, on the other hand, can be much more efficient, but have not yet been scaled up to such complex tasks. In this work, we aim to push the boundary on this task complexity; consider, for instance, the task of rotating two Baoding balls around the palm of your hand (Figure 1). We will discuss how model-based RL methods can solve such tasks, both in simulation and on a real-world robot.
Algorithmically, we present a technique that combines elements of recently-developed uncertainty-aware neural network models with state-of-the-art gradient-free trajectory optimization. While the individual components of our method are based heavily on prior work, we show that their combination is both novel and critical. Our approach, based on deep model-based RL, challenges the general machine learning community’s notion that models are difficult to learn and do not yet deliver control results that are as impressive as model-free methods. In this work, we push forward the empirical results of model-based RL, in both simulation and the real world, on a suite of dexterous manipulation tasks starting with a 9-DoF three-fingered hand[zhu2019dexterous] rotating a valve, and scaling up to a 24-DoF anthropomorphic hand executing handwriting and manipulating free-floating objects (Figure 2). These realistic tasks require not only learning about interactions between the robot and objects in the world, but also effective planning to find precise and coordinated maneuvers while avoiding task failure (e.g., dropping objects). To the best of our knowledge, our paper demonstrates for the first time that deep neural network models can indeed enable sample-efficient and autonomous discovery of fine motor skills with high-dimensional manipulators, including a real-world dexterous hand trained entirely using just 4 hours of real-world data.
2 Related Work
In recent years, the mechanical and hardware development of multi-fingered robotic hands has significantly advanced [butterfass2001dlr, xu2016design], but our manipulation capabilities haven’t scaled similarly. Prior work in this area has explored a wide spectrum of manipulation strategies: [sundaralingam2018geometric]’s optimizer used geometric meshes of the object to plan finger-gaiting maneuvers, [andrews2013goal]’s multi-phase planner used the help of offline simulations and a reduced basis of hand poses to accelerate the parameter search for in-hand rotation of a sphere, [dogar2010push]’s motion planning algorithm used the mechanics of pushing to funnel an object into a stable grasp state, and [bai2014dexterous]’s controller used the conservation of mechanical energy to control the tilt of a palm and roll objects to desired positions on the hand. These types of manipulation techniques have thus far struggled to scale to more complex tasks or sophisticated manipulators in simulation as well as the real world, perhaps due to their need for precise characterization of the system and its environment. Reasoning through contact models and motions cones [chavan2018hand, kolbert2016experimental], for example, requires computation time that scales exponentially with the number of contacts, and has thus been limited to simpler manipulators and more controlled tasks. In this work, we aim to significantly scale up the complexity of feasible tasks, while also minimizing such task-specific formulations.
More recent work in deep RL has studied this question through the use of data-driven learning to make sense of observed phenomenon [andrychowicz2018learning, van2015learning]. These methods, while powerful, require large amounts of system interaction to learn successful control policies, making them difficult to apply in the real world. Some work [rajeswaran2017learning, zhu2019dexterous] has used expert demonstrations to improve this sample efficiency. In contrast, our method is sample efficient without requiring any expert demonstrations, and is still able to leverage data-driven learning techniques to acquire challenging dexterous manipulation skills.
Model-based RL has the potential to provide both efficient and flexible learning. In fact, methods that assume perfect knowledge of system dynamics can achieve very impressive manipulation behaviors [mordatch2012contact, lowrey2018plan] using generally applicable learning and control techniques. Other work has focused on learning these models using high-capacity function approximators [deisenroth2013survey, lenz2015deepmpc, levine2016end, nagabandi2017roach, nagabandi2018mbrl, williams2017information, chua2018pets] and probabilistic dynamics models [Deisenroth2011_ICML, ko2008gp, deisenroth2012toward, doerr2017mbpid]
. Our method combines components from multiple prior works, including uncertainty estimation[Deisenroth2011_ICML, chua2018pets, kurutach2018model], deep models and model-predictive control (MPC) [nagabandi2017roach], and stochastic optimization for planning [williams15mppi]. Model-based RL methods, including recent work in uncertainty estimation [malik2019calibrated] and combining policy networks with online planning [wang2019exploring], have unfortunately mostly been studied and shown on lower-dimensional (and often, simulated) benchmark tasks, and scaling these methods to higher dimensional tasks such as dexterous manipulation has proven to be a challenge. As illustrated in our evaluation, the particular synthesis of different ideas in this work allows model-based RL to push forward the task complexity of achievable dexterous manipulation skills, and to extend this progress to even a real-world robotic hand.
3 Deep Models for Dexterous Manipulation
In order to leverage the benefits of autonomous learning from data-driven methods while also enabling efficient and flexible task execution, we extend deep model-based RL approaches to the domain of dexterous manipulation. Our method of online planning with deep dynamics models (PDDM) builds on prior work that uses MPC with deep models [nagabandi2018mbrl] and ensembles for model uncertainty estimation [chua2018pets]. However, as we illustrate in our experiments, the particular combination of components is critical for the success of our method on complex dexterous manipulation tasks and allows it to substantially outperform these prior model-based algorithms as well as strong model-free baselines.
3.1 Model-Based Reinforcement Learning
We first define the model-based RL problem. Consider a Markov decision process (MDP) with a set of states, a set of actions , and a state transition distribution describing the result of taking action from state . The task is specified by a bounded reward function , and the goal of RL is to select actions in such a way as to maximize the expected sum of rewards over a trajectory. Model-based RL aims to solve this problem by first learning an approximate model , parameterized by , that approximates the unknown transition distribution of the underlying system dynamics. The parameters can be learned to maximize the log-likelihood of observed data , and the learned model’s predictions can then be used to either learn a policy or, as we do in this work, to perform online planning to select optimal actions.
3.2 Learning the Dynamics
We use a deep neural network model to represent , such that our model has enough capacity to capture the complex interactions involved in dexterous manipulation. We use a parameterization of the form , where the mean is given by a neural network, and the covariance
of the conditional Gaussian distribution can also be learned (although we found this to be unnecessary for good results). As prior work has indicated, capturing epistemic uncertainty in the network weights is indeed important in model-based RL, especially with high-capacity models that are liable to overfit to the training set and extrapolate erroneously outside of it. A simple and inexpensive way to do this is to employ bootstrap ensembles[chua2018pets], which approximate the posterior with a set of models, each with parameters . For deep models, prior work has observed that bootstrap resampling is unnecessary, and it is sufficient to simply initialize each model with a different random initialization and use different batches of data at each train step [chua2018pets]
. We note that this supervised learning setup makes more efficient use of the data than the counterpart model-free methods, since we get dense training signals from each state transition and we are able to use all data (even off-policy data) to make training progress.
3.3 Online Planning for Closed-Loop Control
In our method, we use online planning with MPC to select actions via our model predictions. At each time step, our method performs a short-horizon trajectory optimization, using the model to predict the outcomes of different action sequences. We can use a variety of gradient-free optimizers to address this optimization; we describe a few particular choices below, each of which builds on the previous ones, with the last and most sophisticated optimizer being the one that is used by our PDDM algorithm.
The simplest gradient-free optimizer simply generates independent random action sequences , where each sequence is of length action. Given a reward function that defines the task, and given future state predictions from the learned dynamics model , the optimal action sequence is selected to be the one corresponding to the sequence with highest predicted reward: This approach has been shown to achieve success on continuous control tasks with learned models [nagabandi2018mbrl], but it has numerous drawbacks: it scales poorly with the dimension of both the planning horizon and the action space, and it often is insufficient for achieving high task performance since a sequence of actions sampled at random often does not directly lead to meaningful behavior.
Iterative Random-Shooting with Refinement:
To address these issues, much prior work [botev2013cross] has instead taken a cross-entropy method (CEM) approach, which begins as the random shooting approach, but then does this sampling for multiple iterations at each time step. The top
highest-scoring action sequences from each iteration are used to update and refine the mean and variance of the sampling distribution for the next iteration, as follows:
After iterations, the optimal actions are selected to be the resulting mean of the action distribution.
Filtering and Reward-Weighted Refinement:
While CEM is a stronger optimizer than random shooting, it still scales poorly with dimensionality and is hard to apply when both coordination and precision are required. PDDM instead uses a stronger optimizer that considers covariances between time steps and uses a softer update rule that more effectively integrates a larger number of samples into the distribution update. As derived by recent model-predictive path integral work [williams15mppi, lowrey2018plan], this general update rule takes the following form for time step , reward-weighting factor , and reward from each of the predicted trajectories:
Rather than sampling our action samples from a random policy or from iteratively refined Gaussians, we instead apply a filtering technique to explicitly produce smoother candidate action sequences. Given the iteratively updating mean distribution from above, we generate action sequences , where each noise sample is generated using filtering coefficient as follows:
By coupling time steps to each other, this filtering also reduces the effective degrees of freedom or dimensionality of the search space, thus allowing for better scaling with dimensionality.
Each of the optimization methods described in the previous sections can be combined with our ensemble-based dynamics models to produce a complete model-based RL method. In our approach, the reward for each action sequence is calculated to be the mean predicted reward across all models of the ensemble, allowing for model disagreement to affect our chosen actions. After using the predicted rewards to optimize for an -step candidate action sequence, we employ MPC, where the agent executes only the first action , receives updated state information , and then replans at the following time step. The closed-loop method of replanning using updated information at every time step helps to mitigate some model inaccuraries by preventing accumulating model error. Note this control procedure also allows us to easily swap out new reward functions or goals at run-time, independent of the trained model. Overall, the full procedure of PDDM involves iteratively performing actions in the real world (through online planning with the use of a learned model), and then using those observations to update that learned model.
4 Experimental Evaluation
Our evaluation aims to address the following questions: (1) Can our method autonomously learn to accomplish a variety of complex dexterous manipulation tasks? (2) What is the effect of the various design decisions in our method? (3) How does our method’s performance as well as sample efficiency compare to that of other state-of-the-art algorithms? (4) How general and versatile is the learned model? (5) Can we apply these lessons learned from simulation to enable a 24-DoF humanoid hand to manipulate free-floating objects in the real world?
4.1 Task Suite
Some of the main challenges in dexterous manipulation involve the high dimensionality of the hand, the prevalence of complex contact dynamics that must be utilized and balanced to manipulate free floating objects, and the potential for failure. We identified a set of tasks (Figure 2) that specifically highlight these challenges by requiring delicate, precise, and coordinated movement. With the final goal in mind of real-world experiments on a 24-DoF hand, we first set up these selected dexterous manipulation tasks in MuJoCo [mujoco] and conducted experiments in simulation on two robotic platforms: a 9-DoF three-fingered hand, and a 24-DoF five-fingered hand. We show in Figure 3
the successful executions of PDDM on these tasks, which took between 1-2 hours worth of data for simulated tasks and 2-4 hours worth of data for the real-world experiments. We include implementation details, reward functions, and hyperparameters in the appendix, but we first provide brief task overviews below.
This starter task for looking into manipulation challenges uses a 9-DoF hand (D’Claw) [zhu2019dexterous] to turn a 1-DoF valve to arbitrary target locations. Here, the fingers must come into contact with the object and coordinate themselves to make continued progress toward the target.
Next, we look into the manipulation of free-floating objects, where most maneuvers lead to task failures of dropping objects, and thus sharp discontinuities in the dynamics. Successful manipulation of free-floating objects, such as this task of reorienting a cube into a goal pose, requires careful stabilization strategies as well as re-grasping strategies. Note that these challenges exist even in cases where system dynamics are fully known, let alone with learned models.
In addition to free-floating objects, the requirement of precision is a further challenge when planning with approximate models. Handwriting, in particular, requires precise maneuvering of all joints in order to control the movement of the pencil tip and enable legible writing. Furthermore, the previously mentioned challenges of local minima (i.e., holding the pencil still), abundant terminal states (i.e., dropping the pencil), and the need for simultaneous coordination of numerous joints still apply. This task can also test flexibility of learned behaviors, by requiring the agent to follow arbitrary writing patterns as opposed to only a specific stroke.
While manipulation of one free object is already a challenge, sharing the compact workspace with other objects exacerbates the challenge and truly tests the dexterity of the manipulator. We examine this challenge with Baoding balls, where the goal is to rotate two balls around the palm without dropping them. The objects influence the dynamics of not only the hand, but also each other; inconsistent movement of either one knocks the other out of the hand, leading to failure.
4.2 Ablations and Analysis of Design Decisions
In our first set of experiments (Fig. 4), we evaluate the impact of the design decisions for our model and our online planning method. We use the Baoding balls task for these experiments, though we observed similar trends on other tasks. In the first plot, we see that a sufficiently large architecture is crucial, indicating that the model must have enough capacity to represent the complex dynamical system. In the second plot, we see that the use of ensembles is helpful, especially earlier in training when non-ensembled models can overfit badly and thus exhibit overconfident and harmful behavior. This suggests that ensembles are an enabling factor in using sufficiently high-capacity models. In the third plot, we see that there is not much difference between resetting model weights randomly at each training iteration versus warmstarting them from their previous values.
In the fourth plot, we see that using a planning horizon that is either too long or too short can be detrimental: Short horizons lead to greedy planning, while long horizons suffer from compounding errors in the predictions. In the fifth plot, we study the type of planning algorithm and see that PDDM, with action smoothing and soft updates, greatly outperforms the others. In the final plot, we study the effect of the reward-weighting variable, showing that medium values provide the best balance of dimensionality reduction and smooth integration of action samples versus loss of control authority. Here, too soft of a weighting leads to minimal movement of the hand, and too hard of a weighting leads to aggressive behaviors that frequently drop the objects.
In this section, we compare our method to the following state-of-the-art model-based and model-free RL algorithms: Nagabandi et. al [nagabandi2018mbrl] learns a deterministic neural network model, combined with a random shooting MPC controller; PETS [kurutach2018model] combines uncertainty-aware deep network dynamics models with sampling-based uncertainty propagation; NPG [kakade2002natural] is a model-free natural policy gradient method, and has been used in prior work on learning manipulation skills [rajeswaran2017learning]; SAC [haarnoja2018soft] is an off-policy model-free RL algorithm; MBPO [janner2019trust] is a recent hybrid approach that uses data from its model to accelerate policy learning. On our suite of dexterous manipulation tasks, PDDM consistently outperforms prior methods both in terms of learning speed and final performance, even solving tasks that prior methods cannot.
We first experiment with a three-fingered hand rotating a valve, with starting and goal positions chosen randomly from the range . On this simpler task, we confirm that most of the prior methods do in fact succeed, and we also see that even on this simpler task, policy gradient approaches such as NPG require prohibitively large amounts of data (note the log scale of Figure 6).
Next, we scale up our method to a 24-DoF five-fingered hand reorienting a free-floating object to arbitrary goal configurations (Figure 6). First, we prescribe two possible goals of either left or right 90 rotations; here, SAC behaves similarly to our method (actually attaining a higher final reward), and NPG is slow as expected, but does achieve the same high reward after steps. However, when we increase the number of possible goals to 8 different options (90 and 45 rotations in the left, right, up, and down directions), we see that our method still succeeds, but the model-free approaches get stuck in local optima and are unable to fully achieve even the previously attainable goals. This inability to effectively address a “multi-task” or “multi-goal” setup is indeed a known drawback for model-free approaches, and it is particularly pronounced in such goal-conditioned tasks that require flexibility. These additional goals do not make the task harder for PDDM, because even in learning 90 rotations, it is building a model of its interactions rather than specifically learning to get to those angles.
To further scale up task complexity, we experiment with handwriting, where the base of the hand is fixed and all writing must be done through coordinated movement of the fingers and the wrist. We perform two variations of this task: (a) the agent is trained and tested on writing a single fixed trajectory, and (b) the agent is trained with the goal of following arbitrary trajectories, but is evaluated on the fixed trajectory from (a). Although even a fixed writing trajectory is challenging, writing arbitrary trajectories requires a degree of flexibility that is exceptionally difficult for prior methods. We see in Figure 8 that prior model-based approaches don’t actually solve this task (values below the grey line correspond to holding the pencil still near the middle of the paper). Our method, SAC, and NPG solve the task for a single fixed trajectory, but the model-free methods fail when presented with arbitrary trajectories and become stuck in a local optima when trying to write arbitrary trajectories. Even SAC, which has a higher entropy action distribution and therefore achieves better exploration, is unable to extract the finer underlying skill due to the landscape for successful behaviors being quite narrow.
This task is particularly challenging due to the inter-object interactions, which can lead to drastically discontinuous dynamics and frequent failures from dropping the objects. We were unable to get the other model-based or model-free methods to succeed at this task (Figure 8), but PDDM solves it using just 100,000 data points, or 2.7 hours worth of data. Additionally, we can employ the model that was trained on 100-step rollouts to then run for much longer (1000 steps) at test time. The model learned for this task can also be repurposed, without additional training, to perform a variety of related tasks (see video): moving a single ball to a goal location in the hand, posing the hand, and performing clockwise rotations instead of the learned counter-clockwise ones.
4.4 Learning Real-World Manipulation of Baoding Balls
Finally, we present an evaluation of PDDM on a real-world anthropomorphic robotic hand. We use the 24-DoF Shadow Hand to manipulate Baoding balls, and we train our method entirely with real-world experience, without any simulation or prior knowledge of the system.
Hardware setup: In order to run this experiment in the real world (Figure 9), we use a camera tracker to produce 3D position estimates for the Baoding balls. We employ a dilated CNN model for object tracking, modeled after KeypointNet [Suwajanakorn:2018], which takes a 280x180 RGB stereo image pair as input from a calibrated 12 cm baseline camera rig. Additional details on the tracker are provided in Appendix A. As shown in the supplementary video, we also implement an automated reset mechanism, which consists of a ramp that funnels the dropped Baoding balls to a specific position and then triggers a pre-preprogrammed 7-DoF Franka-Emika arm to use its parallel jaw gripper to pick them up and return them to the Shadow Hand’s palm. The planner commands the hand at 10Hz, which is communicated via a 0-order hold to the low-level position controller that operates at 1Khz. The episode terminates if the specific task horizon of 10 seconds has elapsed or if the hand drops either ball, at which point a reset request is issued again. Numerous sources of delays in real robotic systems, in addition to the underactuated nature of the real Shadow Hand, make the task quite challenging in the real world.
Results: Our method is able to learn rotations without dropping the two Baoding balls after under 2 hours of real-world training, with a success rate of about , and can achieve a success rate of about on the challenging rotation task, as shown in Figure 9. An example trajectory of the robot rotating the Baoding balls using our method is shown in Figure 1, and videos on the project website 111https://sites.google.com/view/pddm/ illustrate task progress through various stages of training. Qualitatively, we note that performance improves fastest during the first 1.5 hours of trainingl; after this, the system must learn the more complex transition of transferring the control of a Baoding ball from the pinky to the thumb (with a period of time in between, where the hand has only indirect control of the ball through wrist movement). These results illustrate that, although the real-world version of this task is substantially more challenging than its simulated counterpart, our method can learn to perform it with considerable proficiency using a modest amount of real-world training data.
We presented a method for using deep model-based RL to learn dexterous manipulation skills with multi-fingered hands. We demonstrated results on challenging non-prehensile manipulation tasks, including controlling free-floating objects, agile finger gaits for repositioning objects in the hand, and precise control of a pencil to write user-specified strokes. As we show in our experiments, our method achieves substantially better results than prior deep model-based RL methods, and it also has advantages over model-free RL: it requires substantially less training data and results in a model that can be flexibly reused to perform a wide variety of user-specified tasks. In direct comparisons, our approach substantially outperforms state-of-the-art model-free RL methods on tasks that demand high flexibility, such as writing user-specified characters. In addition to analyzing the approach on our simulated suite of tasks using 1-2 hours worth of training data, we demonstrate PDDM on a real-world 24 DoF anthropomorphic hand, showing successful in-hand manipulation of objects using just 4 hours worth of entirely real-world interactions. Promising directions for future work include studying methods for planning at different levels of abstraction in order to succeed at sparse-reward or long-horizon tasks, as well as studying the effective integration of additional sensing modalities such as vision and touch into these models in order to better understand the world.
Appendix A Method Overview
At a high level, this method (Figure 10, Algorithm 1) of online planning with deep dynamics models involves an iterative procedure of (a) running a controller to perform action selection using predictions from a trained predictive dynamics model, and (b) training a dynamics model to fit that collected data. With recent improvements in both modeling procedures as well as control schemes using these high-capacity learned models, we are able to demonstrate efficient and autonomous learning of complex dexterous manipulation tasks.
Appendix B Experiment Details
We implement the dynamics model as a neural network of 2 fully-connected hidden layers of size 500 with relu nonlinearities and a final fully-connected output layer. For all tasks and environments, we train this same model architecture with a standard mean squared error (MSE) supervised learning loss, using the Adam optimizer[Kingma2014_ICLR] with learning rate . Each epoch of model training, as mentioned in Algorithm 1, consists of a single pass through the dataset while taking a gradient step for every sampled batch of size 500 data points. The other relevant hyperparameters are listed in Table 1, referenced by task. For each of these tasks, we normalize the action spaces to and act at a control frequency of . We list reward functions and other relevant task details in Table 2.
|(sec)||Dim of||Dim of||Reward|
Appendix C Tracker Details
In order to run this experiment in the real world, we use a camera tracker to produce 3D position estimates for the Baoding balls. The camera tracker serves the purpose of providing low latency, robust, and accurate 3D position estimates of the Baoding balls. To enable this tracking, we employ a dilated CNN modeled after the one in KeypointNet [Suwajanakorn:2018]. The input to the system is a 280x180 RGB stereo pair (no explicit depth) from a calibrated 12 cm baseline camera rig. The output is a spatial softmax for the 2D location and depth of the center of each sphere in camera frame. Standard pinhole camera equations convert 2D and depth into 3D points in the camera frame, and an additional calibration finally converts it into to the ShadowHand’s coordinates system. Training of the model is done in sim, with fine-tuning on real-world data. Our semi-automated process of composing static scenes with the spheres, moving the strero rig, and using VSLAM algorithms to label the images using relative poses of the camera views substantially decreased the amount of hand-labelling that was requiring. We hand-labeled only about 100 images in 25 videos, generating over 10,000 training images. We observe average tracking errors of 5 mm and latency of 20 ms, split evenly between image capture and model inference.