Reusable neural skill embeddings for vision-guided whole body movement and object manipulation

by   Josh Merel, et al.

Both in simulation settings and robotics, there is an ambition to produce flexible control systems that can enable complex bodies to perform dynamic locomotion and natural object manipulation. In previous work, we developed a framework to train locomotor skills and reuse these skills for whole-body visuomotor tasks. Here, we extend this line of work to tasks involving whole body movement as well as visually guided manipulation of objects. This setting poses novel challenges in terms of task specification, exploration, and generalization. We develop an integrated approach consisting of a flexible motor primitive module, demonstrations, an instructed training regime as well as curricula in the form of task variations. We demonstrate the utility of our approach for solving challenging whole body tasks that require joint locomotion and manipulation, and characterize its behavioral robustness. We also provide a high-level overview video, see .


page 5

page 6

page 9

page 10

page 11


ManiSkill: Learning-from-Demonstrations Benchmark for Generalizable Manipulation Skills

Learning generalizable manipulation skills is central for robots to achi...

Hierarchical visuomotor control of humanoids

We aim to build complex humanoid agents that integrate perception, motor...

Imitate and Repurpose: Learning Reusable Robot Movement Skills From Human and Animal Behaviors

We investigate the use of prior knowledge of human and animal movement t...

dm_control: Software and Tasks for Continuous Control

The dm_control software package is a collection of Python libraries and ...

Supervised Autonomous Locomotion and Manipulation for Disaster Response with a Centaur-like Robot

Mobile manipulation tasks are one of the key challenges in the field of ...

Learning and Composing Primitive Skills for Dual-arm Manipulation

In an attempt to confer robots with complex manipulation capabilities, d...

Surprisingly Robust In-Hand Manipulation: An Empirical Study

We present in-hand manipulation skills on a dexterous, compliant, anthro...


Both in simulation settings and robotics, there is an ambition to produce flexible control systems that can enable complex bodies to perform dynamic locomotion and natural object manipulation. In previous work, we developed a framework to train locomotor skills and reuse these skills for whole-body visuomotor tasks. Here, we extend this line of work to tasks involving whole body movement as well as visually guided manipulation of objects. This setting poses novel challenges in terms of task specification, exploration, and generalization. We develop an integrated approach consisting of a flexible motor primitive module, demonstrations, an instructed training regime as well as curricula in the form of task variations. We demonstrate the utility of our approach for solving challenging whole body tasks that require joint locomotion and manipulation, and characterize its behavioral robustness. We also provide a high-level overview video (V1).

1 Introduction

Controlling the whole body of a humanoid is a challenging problem. Reinforcement learning has made great strides in learning from scratch in many game domains

(mnih2015human; silver2018general); however, complex motor control problems involving physical bodies remain difficult even in simulation. Although there have been some successes discovering locomotion from scratch for reasonably sophisticated bodies (heess2017emergence), generation of more complex behaviors, especially whole body movements that include object interactions have remained largely out of reach. These settings require the coordination of a complex, high-dimensional body to achieve a task goal, and satisfactory performance is often a complicated intersection of task success and additional constraints, e.g. with respect to the naturalness, robustness, or energy-efficiency of the movements. Unlike problems with clear single-task performance objectives, these criteria can be hard to formalize e.g. as a reward function, and even where this is possible, the discovery of good solutions through RL can be difficult.

Such settings may benefit from additional prior knowledge, for instance, in the form of demonstrations or skills transferred from other tasks. These can help with the discovery of rewarded behavior (e.g. heess2016learning) as well as constrain the nature of solutions that emerge (e.g. peng2018deepmimic; merel2018neural). As a particular challenge, we consider the case of constructing a humanoid controller with object manipulation skills that can be flexibly deployed in different tasks. While locomotion skills are a function of the body’s pose and its relation to the ground only, manipulation skills are inherently tied to objects in the environment; yet we want manipulation skills to be general enough to apply not just a single scene with a particular object, but equally to novel configurations of objects. This setting requires movement abstractions that enable the flexible reuse of skills in novel situations and thus forces us to confront a fundamental trade-off between specificity of the behavior versus generality across skills. Narrow, stereotyped skills, e.g. from demonstrations, can serve as useful initializations of behavior in settings where the controller only needs to reproduce essentially one movement pattern, ensuring the rapid discovery of solutions as well as naturalistic movements. Settings requiring unknown compositions of motor skills benefit from exploration in a space of motor skills that, while informed by demonstrations, are not fully determined by them. Such generalization admits a broader range of target tasks, but also increases the space of movements that needs to be searched, and can make it possible for the solution to deviate from essential characteristics of the demonstrations in undesirable ways.

In this paper we develop an integrated learning approach for humanoid whole-body manipulation and locomotion in simulation that allows us to strike a satisfactory balance. It consists of the following components: (1) a general purpose low-level motor skill module that is derived from motion capture demonstrations, yet is scene agnostic and can therefore be deployed in many scenarios; (2) a hierarchical control scheme, consisting of high-level task-policy that operates from egocentric vision, possesses memory, and interfaces with the the motor module; (3) a training procedure involving a broad distribution of task variations to achieve generalization to a number of different environmental conditions; and lastly, (4) training using a phased task, in which the task-policy is trained to solve stages of the task using simple rewards, which, together with the use of demonstrations, greatly facilitates exploration and allows us to learn complex multi-step tasks while minimizing the need for complicated shaping rewards.

We apply our approach to two challenging tasks, both involving a humanoid interacting bimanually with large objects such as boxes and medicine balls. The two tasks are an instructed box manipulation task in which the simulated character needs to move boxes between shelves (a highly simplified “warehouse” setting) and a ball catching and tossing task (“toss”). Both tasks are solved either from task features or egocentric vision by the same motor module (albeit different task policies) demonstrating the possibility of general and reusable motor skills that can be deployed in rather diverse settings. The results demonstrate the flexibility and generality of the approach, which achieves significant generalization beyond the raw demonstrations that the system was bootstrapped from, and constitute another step towards general learning schemes for sophisticated whole-body motor control in physical environments in simulation.

2 Related work

Research in robotics as well as control of simulated characters has often focused on locomotion and manipulation as separate problems. Robust quadrupedal or bipedal locomotion, even without object interaction, are already challenging problems. Significant progress has been made recently for simulated and robotic quadrupeds (peng2016terrain; hwangbo2019learning; lee2019robust) as well as simulated humanoids (heess2017emergence; merel2018hierarchical; merel2018neural). Separately, manipulation is often studied in settings involving a virtual robotic arm that is disconnected from a body (rajeswaran2017learning) or attached to a table such that the manipulation problem is isolated from the challenge of moving a body (zhu2018reinforcement; lynch2019learning). There are a wide range of these kinds of tabletop manipulation settings for “pick-and-place” robotics tasks (mahler2017dex; levine2018learning; bousmalis2018using).

As of yet, there has been limited research that attempts to handle whole body movement and manipulation jointly (sentis2005synthesis; otani2017adaptive). There are other isolated cases of whole body humanoid movement involving objects, but a shortcut is often taken involving forming fixed attachments between the hands of the body and the manipulated object in order to simplify the problem (mordatch2012discovery; peng2019mcp).

Demonstrations can be readily obtained for many simple real or simulated robotics systems, for instance through teleoperation or via a human operator physically guiding the pose of the robot. The classical approach for learning from such demonstrations amounts to using the demonstration to essentially initialize the policy, and learning how to deviate from the demonstrate to solve the task at hand (smart2002effective; schaal2003computational). It has long been recognized that given a small number of demonstrations, it is not sufficient to try to directly mimic the demonstration as there will be some discrepancies when recapitulating the movement and you want the movement to generalize to states other than those exactly witnessed in a demonstration (atkeson1997robot; schaal1997learning). A fairly direct approach involves fitting the demonstrations to a parametric form and using RL to modulate the parameters of the fitted model (guenter2007reinforcement; peters2008reinforcement; kober2009policy; pastor2011skill). Slightly more indirect approaches consist of using the demonstrations to learn local models from which a policy can be derived (coates2008learning) or using the demonstrations to infer the objective for the policy through inverse optimal control (ng2000algorithms; ho2016generative; englert2018learning). In Deep RL settings involving a replay buffer, and when the demonstrations include actions and reward on the task you are trying to solve, it is possible to fill the replay with teleoperation demonstrations (vevcerik2017leveraging). There are yet other approaches in which matching demonstrations and solving tasks both serve as rewards when training a policy (kumar2016learning; peng2018deepmimic; merel2017learning; zhu2018reinforcement). Additionally, as an alternative to demonstrations, but serving the same basic role, it is possible to design controllers that incorporate domain knowledge for some tasks and to then use learning to refine the behavior around this initial, engineered policy – these scheme has been applied to efforts in robotics for tossing objects (zeng2019tossingbot) and catching objects (kim2014catching).

The commonality across this broad class of existing approaches for learning from demonstrations is that they are well suited primarily for settings in which there is a single variety of movement that needs to be reproduced and where the demonstrations are well aligned with the behavior required to solve the task. While these approaches have been successful in various appropriate cases, we don’t see them as viable for more complex tasks that require composition and arbitrary re-sequencing of motor skills. Ideally we wish for this skill space serve both as a generic “initialization” of the policy as well as a set of constraints on the behavior; yet we also want the skill space to be multipotent, in the sense that it can be leveraged for multiple distinct classes of target task, rather than serve only for a narrow range of movements. While some work has aimed to build motor skill modules from unstructured demonstrations (jenkins2003automated; niekum2012learning), work to date has not aimed towards learning motor skills with generic policies at a scale usable for Deep RL. Generalization beyond individual trajectories has only just begun to be solved, for instance through general purpose skill modules (merel2018neural), striking a balance between realism of the movements and the degree to which new movements can be synthesized from a finite demonstrations.

3 Approach

Figure 1: Three stage overview for producing and reusing skills. Stage 1 involves training a large set of separate, single-behavior “expert” policies. For each motion capture clip trajectory (depicted by a curve), we produce a policy that tracks that trajectory. Stage 2 involves distilling the experts produced in stage 1 into a single inverse model architecture. The inverse model receives the state for a few future steps (from to where in our setting), an encoder embeds this into a latent intention, and the decoder produces the action that will achieve the transition from to . Stage 3 involves training only the task policy to reuse the frozen low-level skill module, using the learned embedding space to communicate what to do.

In this work, we develop an approach for skill transfer and learning from demonstrations in the setting of visually-guided humanoid control with object interactions. By “skill transfer”, we refer to a system which contains basic motor competency from previous learning on a source distribution of demonstrations or tasks, such that these motor skills can be leveraged on new tasks. We propose and evaluate an approach which involves taking unlabelled skills (without rewards) and creating a motor module

or low-level controller that can reproduce and interpolate between these skills. This module can be seen as forming a skill embedding space and it can be used for multiple object-interaction tasks.

The general workflow for producing the skill module and reusing it is depicted in figure 1. It consists of three stages. Firstly, expert policies are generated which are capable of robustly tracking individual motion capture clips in the presence of noise. The second stage consists of distilling these policies into a single conditional policy, or inverse model, which maps the state at the current timestep () and the desired state at timesteps in the near future () to the first action () of a sequence of actions that would result in that desired future. As explained in more detail in Section 3.3

this architecture is separated into an encoder and decoder, which communicate via a multi-dimensional, continuous random variable that reflects short term motor intention. The decoder can also be interpreted as a conditional policy that is trained via a form of behavioral cloning. This training procedure follows the approach described in

(merel2018neural), and we refer to this architecture as “neural probabilistic motor primitives” (NPMP). Finally, the third stage amounts to reusing the NPMP decoder as a low-level controller in the context of new tasks, by treating the learned, motor intention space as an action space for a new controller. For a new task, a high-level task policy that receives observations appropriate for the target task is trained, for example, by model-free RL. The low-level skills are not trained and the task policy outputs “actions” corresponding to latent variables that serve as “commands” to the now fixed low-level module. This means that the low-level module transforms the initial noise distribution of an untrained task policy into “colored”-noise that reflects the coordinated movement statistics of the motion capture data. This constrains movement exploration and solutions that can be found by the RL algorithm to the manifold of human-like behavior that can be produced by the motor module.

The particular challenges of the manipulation tasks considered in this work mean that several additional elements of the training process are critical. Manipulation requires directed interaction with elements of the environment, and these are difficult to discover by chance and learn even when the search is restricted to the space of movements expressed by the skill module. We address this by choosing a suitable distribution of initial configurations of the body and objects, along with variations for object masses and sizes. Taken together, these initializations and variations facilitate learning by placing the body in configurations that vary in difficulty and distance from reward, serving as an organic curriculum. Finally, we also found that while the skill module is somewhat robust to variations in what expert demonstrations are included; there is a trade-off between specificity and generality of skills. Additional details of these elements of our proposed training process will be elaborated when presenting the tasks, and we will demonstrate the relevance through ablations (see Results, Section 4.3).

All simulations are performed using the physics simulator MuJoCo (todorov2012mujoco)

. The humanoid body has 56 actuated degrees of freedom, and is a version of a body that was developed and employed in previous work

(merel2017learning; merel2018hierarchical; merel2018neural). The standard size body is publicly available as part of the DeepMind Control codebase (tassa2018deepmind).

3.1 Demonstrations for skills

Figure 2: (A) We implemented virtual analogs of the objects that were tracked with motion capture. (B) & (C) Frames of motion capture for box interaction and ball tossing along with prop and humanoid body set to those poses in the physics simulator.

We collected motion capture data of a person performing bimanual, whole-body box manipulation movements, ball tossing, and various locomotor behaviors with and without objects in-hand. When objects were involved, we also collected motion capture for the objects. To go from point-cloud, raw motion capture data to body-specific movements, we implemented simultaneous tracking and calibration (wu2013stac)

, which solves a joint optimization problem over body pose and marker position. See Supplementary Section

A for further implementation details. Figure 2 shows a visualization of the virtualized props and humanoid set to poses from the motion capture. We measured and re-sized the lengths of body segments of the virtual character to correspond to the person whose motion data we collected. We approximately matched the virtual humanoid body segment lengths to ensure that the positions of the hands relative to tracked objects to be similar in the virtual environments relative to the real setting. This precision in body dimensions also made the point-cloud to body poses more robust. Nevertheless, the dimensions of the virtual humanoid are still only approximations to the human actor and the dynamic properties differ substantially.

The dataset used in this paper consists of a single subject interacting with 8 objects. The objects were two “large” balls, two “small” balls, two “large” boxes, and two “small” boxes, where one object of each size was 3kg and the other was 10kg. We also considered interactions at 3 heights, “floor-height”, “torso-height”, and “head-height”. For each object, at each height, we collected two repeats of behavior consisting of the actor approaching a pedestal on which an object is resting, picking it up, walking around with the object in hand, returning to the pedestal, placing the object back on the pedestal, and then backing away from the pedestal. In total this was 48 clips (8 objects 2 repeats 3 heights), each of which was generally no less than 10 seconds and no longer than just over 20 seconds. Other less structured behavior was captured, including walking around with no object (“walking”) as well as playing catch with the balls with a second person (“ball-tossing”; one person and the ball were tracked). In total, we use a little less than 20 min of data (1130 sec). For representative examples, see videos of motion capture playback: box interaction video (V2) and ball tossing video (V3).

3.2 Single-clip tracking for object manipulation

Figure 3: (A) Tracking performance and corresponding filmstrip is shown for a representative warehouse expert clip. We initialize the expert at timepoints throughout the clip and the policy controls the behavior to end of the clip. Time within a rollout since the initialization is depicted via intensity. The policy is robust in that it controls the body (interacting with the box) to remain on track. (B) Representative summary of the expert tracking performance for rollouts as a function of time since initialization for all medium height experts – rollouts lasting up to 3 seconds show only limited accumulated tracking error, indicating experts are well-tracked. (C) Performance and filmstrip for a ball-tossing expert indicating good tracking performance until the ball is released from the hands at which point the performance deteriorates due largely to the loss of control over the ball. Despite the inability to control the ball to match the reference ball trajectory perfectly, visually, the expert looks reasonable through the release of the ball.

To produce expert policies, we use a tracking objective and train time-indexed policies to reproduce the movements observed via motion capture (peng2018deepmimic; merel2018hierarchical), here including the position of the object. Similarly to merel2018hierarchical, we provide the agent a normalized tracking reward () that reflects how well the body and object in the virtual environment match the reference:


where is the sum of the per energy-term weights and is a sharpness parameter ( throughout). The energy term is a sum of tracking terms, each of which corresponds to a distance between the pose of the physically simulated body relative to a reference trajectory derived from motion capture:


with terms for tracking the reference joint angles (), joint velocities (), root quaternion (

), body-frame vectors from the root to appendages (hands, feet, head;

), translational velocity (), root rotational velocities () and object position (). See Supplementary Section B for more specific details. Note that to encourage robustness of the controller, we train in the presence of moderate action noise – noise is sampled from a Gaussian independently per actuator with , for action ranges .

We produced experts for the behaviors discussed in the previous section (object pick-up, carrying, and put-down at different heights and object sizes/weights, walking without an object, and ball tossing). To asses performance of the individual tracking controllers, we perform rollouts starting from different points along the trajectory and see how well the policy rollouts align with the motion capture reference. In general we find that tracking performance is good for “warehouse” behavior experts (Figure 3A) with only a small falloff as a function of duration of rollout (Figure 3B). For “toss” behavior experts, performance sometimes shows a sharp fall-off after tossing the ball (Figure 3C). However, this performance decline is primarily due to the object tracking term when the ball is no longer directly controlled and does not reflect failure of body tracking, as visually discernable from Figure 3D. As a data augmentation, we also produced “mime” experts for which the expert was trained to track the human reference movements involving object interactions, but for which the object was not present in the virtual environment; that is, the expert produces movements as if it is interacting with an object, but it is just “miming” the movements. We found that this better balanced the expert data insofar as without this data, the overwhelming majority of data was grasping an object and the thus low-level skill module was strongly biased to close the hands of the humanoid together.

3.3 Training the motor module for locomotion and manipulation

The single-behavior expert policies track individual motion capture trajectories but do not directly generalize to new tasks or even configurations of the environment. To ensure reusability of the skills we therefore followed (merel2018neural) and distilled expert behaviors into a single module with suitable architecture (the “neural probabilistic motor primitives”, or NPMP). Unlike (merel2018neural) we are interested in manipulation skills which depend strongly on the environment, not just the body controlled by the agent. To enable the motor module to be usable in various environments, we therefore choose the following factorization: during training, we give the encoder access to the state both of the humanoid as well as the object used in the expert trajectory. The decoder, however, only directly receives egocentric humanoid proprioceptive information. By construction, the decoder will be reusable as a policy that only requires egocentric observations of the humanoid body which are consistent across environments. When reusing the skill module, any awareness of the objects in the scene must be passed to the low-level controller via the latent variable produced by the task policy.

The training procedure for the motor module follows the approach presented in merel2018neural. We train the model in a supervised fashion to model state-action sequences (trajectories) generated by executing the various single-skill experts policies while adding independent Gaussian noise to the actions.

Specifically we maximize the following evdience lower bound (ELBO):


where corresponds to the decoder, which is a policy conditioned on the latent variable in addition to the current state . corresponds to the encoder which produces latent embeddings based on short snippets into the future. controls the weight of the autoregressive prior over the latent variables which regularizes the skill embedding . As noted in merel2018neural, the NPMP can be used for one-shot imitation and this may provide insight into which movements are best captured in the skill space – see Supplementary Section C for this analysis.

3.4 Training task policies that reuse low-level skills

To train the task policy that reuses the low-level skill module, we use a model-free distributed RL setup with a single learner and many actors (1000 here), similar to IMPALA (espeholt2018impala). The value-function critic was trained using off-policy correction via V-trace. The policy was updated using a variant of MPO (abdolmaleki2018maximum), with the E-step modified to use the empirical returns and the value-function, instead of the Q-function (song2019v). The task policies consist of image and proprioceptive preprocessing networks, a small ResNet (he2016deep) and MLP respectively, followed by a LSTM layer (hochreiter1997long) that branches into a value function network and a policy network. See Figure 4 for a schematic of the task policy architecture. This architecture is similar to other work (merel2018hierarchical; merel2018neural); the exact details are not critical for the results obtained.

Figure 4: When training a high-level task policy to reuse the low-level controller, three streams of input are available potentially. For whichever of the egocentric image input, task instruction input, and proprioception streams are avaialble, each is passed through a preprocessor network. A value function and policy branch from a shared LSTM, and the policy also receives skip connections for the task and proprioception input streams. The policy output here refers to high-level actions that serve as inputs to the low-level controller.

4 Results

4.1 Core tasks

In this work, we defined two challenging object interaction tasks, and we show that the low-level skill module can be used to solve either of these, when a high-level, task-specific policy is trained to reuse the skills on each task. Our two core tasks are a proto-warehouse task (“warehouse”) and a ball tossing task (“toss”). The warehouse task involves going to a box that is on a pedestal, picking up the box, bringing it to another pedestal, putting the box down, and repeating. To make the task unambiguous, we provide the agent with a task “phase” or “instruction” that indicates which of these four phases of the task the agent is presently in. This phase also provides a natural way of providing sub-goals, insofar as sparse rewards are provided after each phase of the task has been completed. In addition, we provide the agent with the position (relative to itself) of the target pedestal (either to which it must go in order to pick up a box, or to put down a box). This active pedestal is highlighted in the videos (and this highlighting is available to vision-based agents). See Supplementary Section D for further details about the task specification.

Our second task consists of catching a ball and then tossing it into a bucket. In this task, the ball is always thrown towards the humanoid initially. The task is terminated with a negative reward if the ball touches the ground, which incentivizes the agent to learn to catch the ball to avoid dropping it. A small shaping reward encourages the agent to bring the ball towards the bucket, and a sparse positive reward is provided if the ball is deposited into the bucket. See Supplementary Section E for details.

Figure 5: (A) “Warehouse” task involving instructed movements of indicated boxes from one indicated pedestal to another. (B) “Toss” task involving catching a ball thrown towards the humanoid and tossing it into a bucket on the ground. Tasks can be performed either from state features or from egocentric vision.

Both tasks are generated procedurally and with several task parameters sampled from a distribution on a per-episode basis. For the warehouse, the variations apply to the pedestal heights, box dimensions, and box masses. For the tossing task, the variations apply to the ball size, mass, the trajectory of the ball thrown towards the humanoid, and the position of the box. In both tasks, mass variations are depicted visually by object color (darker is heavier). In the warehouse task, we also initialize episodes in the various phases of the task and sample initial poses of the body from the motion capture data. These task variations and initializations are important for the successful training of the task policies as we will show in Section 4.3. We either provide visual information (an egocentric camera mounted on the head of the humanoid) or we use state features which consist of the position of the prop relative to the humanoid as well as the orientation of the prop, and we compare performance using these different features.

4.2 Performance on tasks

Figure 6:

Performance for “warehouse” task: (A) Representative learning curves (best of 3 seeds) comparing vision-based and state-based performance on the warehouse task, as a function of learner update steps. (B) For the trained vision-based policy, heatmap overlain on top-down view visualizes probability of successful pickup as a function of initial location. (C) Representative filmstrip of behavior in the warehouse task from egocentric and side view.

Figure 7: Performance for “toss” task: (A) Representative learning curves (best of 3 seeds) comparing vision-based and state-based performance on the toss task, as a function of learner update steps. (B) For the trained state-based policy, heatmap indicates the episode return as a function of initial ball velocity. (C) Representative filmstrip of behavior in the warehouse task from egocentric and side view.

We trained task policies that operate from state and visual observations on both tasks and found that successful reuse was possible using either observation type. Although, note that comparable experiments from vision require longer walltime due to rendering requirements. On the warehouse task, visual information seemed to improve learning (Figure 6A), whereas state information was better on the toss task (Figure 7A); however, both policies using either feature set trained to a reasonable performance level. For representative performance and behavior of the vision based policies, see the representative “warehouse” task video (V4) and “toss” task video (V5).

Note that without reusable motor skills, an alternative is to learn the task from scratch. This is difficult, as rewards in these tasks are sparse and therefore do not very strongly shape behavior. And critically, in these tasks it would be very difficult to design dense rewards that incentivize the right kind of structured behavior. This being said, it did turn out to be possible to learning from scratch on the toss task from state information. After training for an order of magnitude longer (100e9 learning steps), the early terminations with penalty and shaping reward were sufficient to produce a policy that could solve the toss task, albeit actually catching the ball, essentially using its back as a paddle. This same behavior was consistently learned across multiple seeds, see video (V6). For the warehouse task, training from scratch for did not yield behavior that could solve a whole cycle of the task, but some progress was made for certain initial conditions, see video (V7). Note that for experiments from vision (slower than experiments from state), training a policy for 150e9 steps took roughly 3 weeks of wall-clock time, so it was not feasible to systematically explore training from scratch for significantly longer intervals.

For the warehouse task, we provide an additional evaluative visualization of the final performance to provide a clearer sense of how well it works. For any stage of the behavior, we can take a trained agent and assess how reliably it can perform a given behavior from different initial positions. We defined a grid of initial x-y locations in the plane. For each location we initialized the humanoid there 10 times, randomizing over orientation, body configuration (sampled from motion capture), and initial velocity. We then computed the fraction of trials for which the humanoid was able to successfully pick up a prop. We visualize a top down view, with the agent aiming to pick up the prop located on the pedestal on the right side of the top down view, with the heatmap of success probability overlain (Figure 6B). The agent is generally robust to initial position of the humanoid, with some limited fraction of initializations that are too close to the pedestal failing, presumably due to initial poses or velocities that make it especially difficult.

For the toss task, we similarly wanted to provide a statistical description of the core behavior of the trained agent. Again, we discretized the space of initial ball velocities (both towards the humanoid and horizontally relative to the agent) – consistent with training, we computed an initial vertical velocity such that ball would be approximately shoulder height when near the initial position of the humanoid. We initialized the ball velocity for 10 repeats in each bin of the velocity, randomizing over other variations. The heatmap depicted in Figure 7B indicates the “strike zone” of parameters for which the agent is able to catch the ball. Naturally, for initial velocities that are too horizontal it is simply not possible to catch the ball and probability of success falls off to zero (indicated by episode return of , corresponding to the ball hitting the ground).

We also remark that visual “quality” does not entirely align with performing the task optimally. Throughout the course of our research, we noticed that slightly worse optimizers or termination partway through training resulted in policies that were more conservative, scored fewer points, but might subjectively be considered to look more humanlike (less hurried). This is consistent with the humanlikeness of the movements being determined largely by the low-level controller, but performance of the task becoming increasingly hectic as the task-policy ultimately controls the body to move faster and with extreme movements to achieve more reward. For an example of a policy trained for less time, see video (V8).

4.3 Task variations

Figure 8: (A) Performance on the warehouse task as a function of using NPMPs trained with different ratios of expert data. “Mixed” uses natural proportions of all of our warehouse and ball tossing data, “no toss” omits all tossing data, and “toss++” uses ball toss experts upsampled by a factor of two. (B) Comparison of performance on the warehouse task under the default setting involving initializations at “all” phases of the task versus training with initializations restricted to either the “pickup” or “walk” phases. (C) We trained the task policy to perform the warehouse task with either the baseline variation in box sizes (blue) or only a smaller range of large boxes (orange) and then evaluated performance only interacting with large boxes. Variation on the wider distribution during training improved performance on the evaluation tasks.

In addition to demonstrating performance on the core tasks, a few themes emerged in developing our approach, for which we provide illustrative examples. In particular, some trends that we observed include: (1) the ratio of expert skills in the NPMP matter, (2) the initializations at different phases of the task matter for the warehouse task, but aren’t required for the toss task, & (3) more extreme variations benefit from a curriculum via variations (a similar result is reported in heess2017emergence).

First, we consider how important the relative ratios of different skills are in the NPMP. In extreme cases, this is trivially important. For example, in previous work (merel2018neural), we generated a NPMP with diverse locomotion skills, but without object interactions, and this controller cannot transfer to the warehouse task. A more nuanced question is how important the relative quantities of ball tossing behavior versus warehouse behavior affect the ability of the NPMP to learn the two tasks. For illustration, we trained three NPMPs, one that only had access to warehouse experts, one that had both warehouse experts and ball toss experts in proportion to how much was collected (more motion capture was warehouse relative to toss demonstrations), and one that trained on twice as much data from toss experts – that is, when training the NPMP, we recorded twice as many trajectories from ball toss experts as we did for other experts, thereby over-representing these experts in the NPMP training data. Note that in the toss upsampled NPMP, toss experts were over-represented relative to our motion capture, but there was still more warehouse data relative to toss data even with this upsampling. We observed that while the upsampled toss NPMP learned an arguably slightly more aesthetically satisfying toss behavior (no meaningful change in performance), it was more difficult for the upsampled toss NPMP to learn the warehouse task. In figure 8

A, we show comparisons of these different NPMPs on the warehouse task. While ultimately, the upsampled toss NPMP was able to learn the warehouse task, it was consistently lower and less robust for other hyperparameters.

In addition to the balance of expert data, we also examined the need to initialize the behavior in different phases of the warehouse task. In the warehouse task, the training episodes are initialized in all phases of the task in poses sampled from motion capture, forming a curriculum over variations. In the toss task, we did not initialize episodes in different task phases. To illustrate the need for starting the warehouse task in the various phases, we ran comparisons involving starting only in the pickup or walk phases of the task and found that neither of these were able to learn to solve the task (see figure 8B).

We also explored both decreasing and increasing the range of procedural variations across episodes. Based on previous work (heess2017emergence), it was our starting intuition to design the task with a sensible range of variations to facilitate learning – this meant that our initial distribution of variations basically worked. However, we also attempted to train the task policy to perform the task with only large boxes. We probed the original task policy trained on variable box size on variants of the warehouse task that only included larger boxes, and we see that training with variations improves performance on the probe task (figure 8C). Essentially, it was much more difficult to learn to solve this task without the variations in box size – no policy fully solved the task. For representative failure mode behavior, see video (V9).

Finally, we considered training on a wider distribution than our standard range of pedestal heights and this tended to work – this indicates that a broader, continuous task distribution could allow a policy to perform a wide range of movements, so long as exploration and learning are guided from some examples that are initially achievable. See a representative video (V10) showing performance when trained on this broader range of pedestal heights, including pedestals that are quite low to the ground as well as higher up.

5 Discussion

In this work, we demonstrated an approach for transfer of motor skills involving whole body humanoid movement and object interaction. We showed that a relatively small set of demonstration data can be used to provide a fairly generic low-level motor skill space that can be leveraged to improve exploration and learning on various tasks that can be solved via movements similar to those in the expert demonstrations. Importantly, a single skill module is multipotent, permitting reuse on multiple transfer tasks.

The multipotency of our approach to leveraging demonstrations and motor skills differentiates this approach from much preceding research. Instead of having to stick close to demonstrations of a single object interaction, we provide a large set of unlabeled demonstrations and generalize skills automatically from these. One open question that we believe will be important in future efforts involves how to best trade off the specificity of exploration provided by staying close to a narrow set of demonstrations versus generality obtained through leveraging more diverse demonstrations. Ideally, we would like to move in the direction of greater breadth of skills available, while more intelligently identifying which of the many available skills is likely to be relevant for a task at hand and exploring through preferential execution of those skills. As we showed, there is still some sensitivity to the relative ratios of the various skills in the space. It may be fundamental that there is some trade-off between exploration guidance and generality, and it would be interesting to better understand this in high-dimensional control settings.

There are a few additional caveats concerning the present approach, which are also related to the difficulty of exploration in complicated tasks. Training of the task policies is currently performed by model-free RL and is not efficient with respect to data (requiring many environment interactions and learner updates). Substantially, this slowness arises from dithering exploration at the level of the task policy. While we use low-level skills to structure exploration, this may not be sufficient for performing object interactions with high-dimensional bodies, especially in the presence of only sparse rewards. We are optimistic that future improvements will enable more intelligent exploration strategies. For example, the motif of using skills for exploration could potentially be repeated hierarchically to structure behavior in a task-directed or temporally abstract fashion. Another limitation is that, for the warehouse task, we leverage a curriculum via informative motion capture initializations, which expose the agent to favorable states that it may not have discovered on its own. It is interesting to note that the use of initializations is not required for the ball toss, where a shaping reward is more readily available indicating that shaping rewards can, in some cases be an adequate strategy.

Taken together, these limitations restrict the present approach to simulation settings; however there is a growing literature on approaches involving transfer of policies trained in simulation to real world systems (sim-to-real) (rusu2017sim; sadeghi2016cad2rl; tobin2017domain; andrychowicz2018learning; zhu2018reinforcement; tan2018sim; hwangbo2019learning; xie2019iterative). While we believe it is important to develop powerful control methods even for simulated settings, sim-to-real may offer a path to translate these results into real world applications in the future.


We thank Tim Lillicrap for constructive input at the outset of the project, Vicky Langston for help coordinating the motion capture acquisition, Thomas Rothörl for assistance during our studio visit, and Audiomotion Studios for services related to motion capture collection and clean up. We also thank others at DeepMind for input and support throughout the project.



Appendix A Simultaneous tracking and calibration

Simultaneous tracking and calibration (STAC) is an algorithm for inferring joint angles of a body from point-cloud data when it is not known in advance precisely where the markers are on the body. The relevant variables include the body which has a pose () as well as marker positions that are fixed to it (). We observe via motion capture the sensor readings () which should be equal to the positions of the markers at each timestep, up to negligible noise. STAC makes the assumption that the markers are rigidly attached to the body with fixed offsets () – if those offsets are known, a forward kinematics call () allows us to compute the positions at which we expect sensor readings.

For known marker offsets, the pose of the body () can be inferred by optimizing (per frame):


We additionally know that the marker offsets should be the same at every timestep (assuming rigid attachment of markers). So similarly, if the pose of the body is known, the marker offsets can be inferred by optimizing:


So overall, in order to perform joint optimization over unknown marker offsets and poses, we alternate between these optimization steps. We initialize the pose of the body to the null pose (approximately a t-pose) and roughly initialize the marker offsets by placing markers on the body part (without tuning placement precisely). The first optimization is of the pose, using the initial, coarsely placed markers (per frame). We then optimize the marker positions using frames sampled at a regular interval throughout a range-of-motion video (V11). We then re-optimize the joint angles per frame. We found that further alternation was not required and the marker offsets that were found using the range-of-motion clip worked well for all other clips.

In practice we also use a small regularization term, encouraging joints angles to be near the null pose, and we also warm start the per-frame optimization at the inferred pose from the preceding timestep.

Appendix B Single-clip tracking objective

In the main text, we described that the tracking reward arises from a weighted sum of terms that score how well different features of the reference are being tracked. More specifically, these objectives are:

where represents the pose or velocity and represents the reference value. The and are 3D Cartesian vectors from the root to the various appendages (head, hands, feet) or object (box or ball) in the root frame. The root is located in the pelvis of the humanoid. is in the global reference frame. In this work, for the body terms, we used coefficients , , , , , . This has been used in previous work [merel2018hierarchical]. The object tracking term coefficient, new to this work, was tuned to to relative strongly enforce object tracking. The same values were used for all clips, despite the diversity of behaviors, indicated relative robustness of this approach.

Appendix C One-shot imitation evaluation

One-shot imitation involves providing the trained NPMP with a state-sequence and asking it to generate a sequence of actions that would reproduce that movement. In asking the trained NPMP to perform one-shot imitation, we get a glimpse into which skills it is able to perform well, and we can be assess this performance for overlapping subcategories of clips. Note that one-shot imitation is not actually the objective that the NPMP was trained to perform, and one-shot imitation is difficult due to object interactions (see figure A.1

). Both walking behavior and ball toss behavior are better captured than the pickup and putdown interactions with boxes. This presumably reflects the fact that in terms of timesteps of data, there are fewer moments at which the difficult box interactions are being performed. As such, these quantifications may leave a misleading impression that one-shot behavior is worse than it is. To complement these quantitative metrics, we also provide a

video (V12) showing a representative assortment of one-shot behavior which show that while the object interactions can be difficult, movements are broadly sensible.

Figure A.1: Here we depict, by behavior category, the one-shot performance of the trained NPMP. Note that relative performance happens if an expert is imperfect and the one-shot imitation is effectively denoised.

Appendix D Instructed warehouse task details

The warehouse task encourages moving a box from one pedestal to another, and repeating this process. The environment consists of a flat ground with four pedestals and two boxes that can be moved freely (note that we varied the number of boxes and pedestals as well, not reported in this paper, and results were similar). The humanoid walker can be controlled by an agent. The distance of each pedestal from the origin is individually drawn from a uniform distribution between 2.5 and 3.5 meters, and the pedestals are at equispaced angles around the origin. The height of each pedestal is set randomly from between 0.45 and 0.75 meters. The size of each box is taken from one of our motion capture trajectories, but with a random multiplicative variation of between 0.75 and 1.25 applied. The mass of each box is also individually drawn from a uniform distribution between 2kg and 7kg (the real boxes were either 3kg or 10kg). The size and mass of each box is not provided as an observation to the agent.

This task can be logically divided into four phases: walk empty-handed to a pedestal (GOTO), lifting the box from a pedestal (LIFT), carrying the box to a different pedestal (CARRY), and putting it down on the target pedestal (PUTDOWN). In our current work, we provide the agent with an observation that tells it which of these four phases it should be pursuing at a given timestep, as a one-hot vector. The position of the focal pedestal and focal box relative to the walker is also provided as observations, where the focal box is the box that needs to be moved, and the focal pedestal is dependent on the phase of the task: in GOTO and LIFT it is the pedestal on which the box is initially placed, while in CARRY and PUTDOWN it is the target pedestal.

Each of the four phases has well-defined success criteria, as detailed in the table below (an empty cell indicates that a particular type of criterion is not used to determine success of a phase):

Phase Walker position Walker/box contact Pedestal/box contact
GOTO within 0.65 meter of focal pedestal
LIFT at least one contact point with each hand zero contact point
CARRY within 0.65 meter of focal pedestal at least one contact point with each hand
PUTDOWN no contact whatsoever at least 4 contact points

At each timestep, the task logic determines whether the agent has successfully completed its current phase. If it has, a reward of 1.0 is given at that timestep only, and the task is advanced to the next phase. The phase transition is determined by a simple state machine

and the task repeats indefinitely up to a final episode duration (15s simulated time), at which point the episode is terminated with bootstrapping. While there is no prespecified maximum score, obtaining an undiscounted return greater than 10 within a 15s episode requires moving through the phases rapidly. Note that the episode is terminated with a failure (no bootstrapping) if either the walker falls (contact between a non-foot geom of the walker and the ground plane) or if a box is dropped (contact between one of the boxes and the ground plane)

At the beginning of each episode, after randomly sampling the variations described above, one of the four phases is picked as the initial phase for the episode from a uniform distribution. A motion capture trajectory is picked at random, and a random timestep from the segment corresponding to the initial phase within that trajectory is selected. The joint configuration of the walker, the position of the box relative to the walker, and both the walker’s and box’s velocities, are synchronized to the state from this motion capture timestep. If the episode begins in either the LIFT or PUTDOWN phase, the displacement of the walker from the focal pedestal is also synchronized, otherwise we apply a random translation and rotation around the -axis (i.e. yaw) to the walker and prop together as a rigid body.

Appendix E Ball toss task details

The toss task encourages catching a ball and subsequently throwing it into a bucket. The initial pose of the walker is randomly sampled from a range of motion capture poses related to ball tossing. The ball size and mass are procedurally randomized and the angle and velocity of the ball are also procedurally randomized such that the ball is generally “thrown” towards the humanoid. The ball always starts behind the bucket which is a few meters from the humanoid. The procedural ball trajectories are sampled by selecting a random velocity towards the humanoid as well as a target position (horizontally and vertically relative to the origin of the humanoid initial positions, within a strike zone). Initial velocities horizontal and vertical velocities that will satisfy the desired target conditions are computed and at the initial timestep of the episode, the 3D components of the ball velocity are initialized accordingly. For robustness, random angular velocity is also applied to the ball at the initial timestep.

The incentives of the task are specified through rewards and termination logic. The primary element of the task is that if the ball touches the ground or if the humanoid falls (contact between a non-foot geom of the walker and the ground plane), the episode terminates with a negative reward. This strongly disincentivizes letting the ball fall to the ground and encourages the humanoid to remain standing. Even reliably achieving this level of performance over the range of procedural ball trajectories is difficult. In addition, once the ball reaches the humanoid, a shaping reward is activated that corresponds to a small positive per-timestep reward inversely related to the distance between the ball and the bucket (in the x-y plane, neglecting vertical height). This reward encourages the humanoid, after catching the ball to walk somewhat towards the bucket. Finally, if the ball is in the bucket, there is a moderate per-timestep reward encouraging dropping the ball into the bucket – this final reward is sparse in the sense that it is achieved iff there is a contact between the bottom of the bucket and the ball. Once the agent has learned to drop the ball into the bucket, it learns to do this earlier (i.e. throw the ball) to achieve the reward as soon as possible.

Appendix F Supplementary video captions

V1 Overview video summarizing highlights of the paper.

V2: Kinematic playback of a motion capture clip of a box interaction.

V3: Kinematic playback of a motion capture clip of ball tossing.

V4: A representative illustration of the behavior of a successfully trained vision-based policy on the “warehouse” task.

V5: A representative illustration of the behavior of a successfully trained vision-based policy on the “toss” task.

V6: A representative illustration of the behavior learned by a policy trained from scratch on the “toss” task (from state).

V7: A representative illustration of the behavior learned by a policy trained from scratch on the “warehouse” task (from vision).

V8: A representative illustration of the behavior of a partially trained vision-based policy on the “warehouse” task.

V9: A representative illustration of the behavior learned on the “warehouse” task when training variations only include large boxes.

V10: A representative illustration of the behavior of a successfully trained vision-based policy on the “warehouse” task when trained on a wider range of pedestal heights.

V11: Kinematic playback of a range-of-motion motion capture clip used to calibrate STAC.

V12: Examples of one-shot imitation of object interaction behaviors.