Learning Deep Parameterized Skills from Demonstration for Re-targetable Visuomotor Control

by   Jonathan Chang, et al.

Robots need to learn skills that can not only generalize across similar problems but also be directed to a specific goal. Previous methods either train a new skill for every different goal or do not infer the specific target in the presence of multiple goals from visual data. We introduce an end-to-end method that represents targetable visuomotor skills as a goal-parameterized neural network policy. By training on an informative subset of available goals with the associated target parameters, we are able to learn a policy that can zero-shot generalize to previously unseen goals. We evaluate our method in a representative 2D simulation of a button-grid and on both button-pressing and peg-insertion tasks on two different physical arms. We demonstrate that our model trained on 33 90 also successfully learn a mapping from target pixel coordinates to a robot policy to complete a specified goal.


page 1

page 2

page 3

page 4

page 5

page 6

page 7


Acquiring Target Stacking Skills by Goal-Parameterized Deep Reinforcement Learning

Understanding physical phenomena is a key component of human intelligenc...

Learning Parameterized Skills

We introduce a method for constructing skills capable of solving tasks d...

Visual Goal-Directed Meta-Learning with Contextual Planning Networks

The goal of meta-learning is to generalize to new tasks and goals as qui...

Robot Program Parameter Inference via Differentiable Shadow Program Inversion

Challenging manipulation tasks can be solved effectively by combining in...

Designing an AI Health Coach and Studying its Utility in Promoting Regular Aerobic Exercise

Our research aims to develop interactive, social agents that can coach p...

What Can I Do Here? Learning New Skills by Imagining Visual Affordances

A generalist robot equipped with learned skills must be able to perform ...

Autonomous discovery of the goal space to learn a parameterized skill

A parameterized skill is a mapping from multiple goals/task parameters t...

I Introduction

Recent advances have allowed for deep learning of visuomotor skills from scratch. However, because such skills are based on strictly visual input, they are insufficiently flexible for tasks with multiple potential goals visible in the same scene. For example, an intelligent robot should be able to identify and flip the correct light switch even in the presence of other light switches in the scene; it should be able to pick up the white queen on a chessboard even when there are other pieces, or press a specific button on a remote control. To achieve this behavior, the agent must learn a targetable skill: one that can be instructed to pick up

that piece, or flip that

switch, by taking as additional input a parameter vector that disambiguates its goal. An important question is then how to build targetable visuomotor skills in a sample-efficient manner.

Fig. 1: An overview of our pipeline. A deep neural network is trained with example goal parameters and raw sensory data, and is evaluated on settings with unseen goal parameters. This figure shows the use of row/column indices on the button panel as the goal parameterization ().

While effective at learning an end-to-end visuomotor policy with a single goal from raw pixel data, current methodologies cannot be instructed either to a specific goal or towards a target that was not seen in training. Some approaches to solve this have involved either training a unique policy for every possible goal, or creating a learning algorithm that could quickly converge to the desired policy given a few training examples [9, 11, 12]. However, the amount of training data needed becomes prohibitive for related tasks such as pressing an elevator button, where one would need to train a separate skill for each button. Additionally, even though few-shot learning and meta-learning algorithms have drastically reduced the amount of training data needed, these algorithms require the robotic system to have a separate policy to target a specific goal. Alternatively, da Silva et al. [4] introduced parameterized skills that allowed robot agents to target specific goals with a single policy; however, these methods were not end-to-end and used a separate module to infer the goal from raw pixel data.

To address these shortcomings, we present a method to train targetable visuomotor skills from expert demonstrations. Our model consists of three modules: vision, auxiliary task, and control. The vision module takes as input an RGB and depth image and outputs a visual encoding. The auxiliary task module takes the visual encoding and the goal-parameter and tries to predict the final pose of the end-effector. Finally, the control module infers the next time step’s linear velocity from the visual encoding, goal-parameter, and the predicted final state of the end-effector. By training on an informative subset of goal-parameters, we are able to train one policy that can execute its learned visuomotor skill to previously-unseen goals.

We evaluate our method via a series of experiments in simulation and on two different robotic arms. In simulation, we use a two dimensional grid representing a

button panel to study our method’s ability to generalize to novel goals. We show that the policy conditioned on the goal allows for the agent to generalize better in this multi-goal setting. Moreover, we also demonstrate how conditioning with an informative representation of the goal further improves performance. Finally, we extend our algorithm such that it can generalize to novel orientations of the goals in the scene. We then evaluate our method on two real-world robotic tasks that require significant precision in a continuous space. The KUKA LBR iiwa-7 and the the MELFA RV-4FL arms were trained to solve button-pushing and peg-insertion tasks respectively on grid-based goals. In all of our experiments, our method was able to target specific goals and generalize to more than 90% of goals after being trained on a third of the goals within the grid. Additionally, we demonstrate a goal-representation that allows our model to generalize to non grid-based goals with no additional training data. Our open-source code is available on GitHub at:


Fig. 2:

Architecture for the goal-parameterized deep imitation learning network. The vertical arrows indicate concatenation of the layer outputs and the other vectors.

Ii Related Work

We present a method that combines parameterized skill learning [4, 5] and end-to-end deep visuomotor skill learning from demonstration [18, 26]. Parameterized skill learning as described by da Silva et al. [4] aims to learn a mapping from a given task parameter vector to a policy [4, 5, 6, 9, 16, 17, 21, 22, 24]. All of these methods show zero-shot generalization properties to unseen goal parameters and goals in settings with hand-designed compact state descriptors. Unlike prior work, we focus on directly learning from raw pixels while preserving the agent’s ability to generalize from a drastically reduced number of demonstrations to similar tasks with different goals by sharing data across tasks. Other approaches achieve similar generalization through coupled dictionary learning [15], gating functions [20], and modular sub-networks [7]. Different from our model, these methodologies either did not learn an end-to-end policy from raw sensory data or did not demonstrate their parameterization capabilities outside of the simulation domain. Our work aims to learn a single policy for a parameterized skill such that it is sample efficient, is end-to-end, and can be instructed to specific goals.

There has also been work in imitation learning that takes raw sensory data from human demonstrations and learns robot motor controls [2, 3, 10, 18, 19, 23, 25, 26]. Our model specifically takes inspiration from the behavioral cloning algorithm proposed by Zhang et al. [26] which learns to map input RGB/depth sensor data to subsequent linear and angular velocities for the end-effector. While these methods could generalize to unseen variations of the task when there was one possible goal in the visual frame, we found that they failed to generalize in the presence of multiple targets. For example, the behavioral cloning algorithm presented in Zhang et al. [26] would require 30 minutes worth of demonstrations for each button resulting in 270 minutes worth of demonstrations for nine buttons.

Iii Learning Deep Parameterized Skills

This section presents a behavioral cloning algorithm that learns parameterized neural network policies. Given a set of skills , let be the set of goal parameters for each specific instance of the skill and be the dataset that consists of sets of state observations, corresponding goal parameters, and controls collected for all . Now let be our neural network policy, parameterized by , that learns control policies that are dependent on our goal parameters, at time step . Our neural network architecture closely follows that of Zhang et al. [26] with the following modifications.

Iii-a Task Parameterization

As proposed by da Silva et al. [4], a parameterized skill is a multi-task policy that maps input task parameters to end-effector controls. We learn the mapping where is the set of task parameters and is the multi-task policy parameters for the family of skills . We define as the vector of task parameters injected into our policy for skill . We chose our to be the parameters that described the goal for our tasks.

Iii-B Neural Network Policies

The neural network takes raw sensory data and a goal parameter as input and outputs robot motions. More formally, for time step , the inputs include (1) RGB images , (2) depth images , and (3) positions of the end-effector for the 5 most recent steps where . Furthermore, the net also takes as input which is (4) the task parameters.

Given these inputs, the neural network outputs the current control described as the linear velocity, of the end-effector.

The neural network architecture can be decomposed into three modules,

. The first module consists of convolutional layers with a spatial-softmax layer

[10, 18] to extract spatial feature points from the image to generate the current state encoding (Eq. 1

). Every convolutional layer is followed by a layer of rectified linear units:


This is followed by a small fully connected network that takes as input the state encoding, , and the goal parameter, , to predict the auxiliary task:


Finally, given the extracted state encoding , end-effector positions , goal parameters , and auxiliary prediction , we use a fully-connected network to predict the current time step’s controls:


For all of the fully-connected layers, we add a layer of rectified linear units.

Iii-C Auxiliary Prediction Tasks

Our network includes an auxiliary prediction task as another means of self-supervision, resembling the approaches shown by Zhang et al. [26]. Similar to their findings, we found that leveraging the extra self-supervisory signals resulted in increased data efficiency. As mentioned in [26], there could be multiple auxiliary tasks, but we found that for our tested tasks, one auxiliary task module was sufficient. We added a module of two fully-connected layers after the spatial-softmax layer (Eq. 2) and fed the final layer of these modules back into the control module (Eq. 3) as shown in Figure 2.

The auxiliary tasks that we chose were limited to using the information that could be inferred from the dataset . For our experiments, we had an auxiliary module for predicting the final end-effector pose for the task. We also found that generalization across novel goal parameters improved with the goal parameters being fed into this auxiliary prediction. All of these auxiliary prediction tasks were trained concurrently with the rest of the network.

Iii-D Loss Functions

The loss function used for learning is a modification to a commonly used behavioral cloning loss also used in

Zhang et al. [26]. Our algorithm uses and losses to fit to the training trajectories using visual and proprioceptive input. Given an example set of , we have the losses:


Furthermore, we add an arc-cosine loss to enforce consistency between the directions of the output and target velocity:


Finally, for our auxiliary prediction task, we use an loss:


The loss function for the whole algorithm is a weighted sum of the losses described by the above equations:


was tuned for each experiment. For the button pressing task we use , whereas for the peg-insertion task . Policies were optimized using NovoGrad [13] with a learning rate of 0.0005 and batch size of 64. Training was done with randomly sampled batches from the dataset .

Iv Evaluation

This section evaluates our method’s ability to learn targetable visuomotor skills in both simulated and physical domains.

Fig. 3: The first and second row show the RGB and depth image inputs respectively for the (a) 2D, (b) button pressing, and (c) peg insertion experiments.

Iv-a 2D Button Simulation

We experimented on a two-dimensional representation of the robot button-pressing task. We used a 33 grid of blue squares representing buttons and a black circle representing the agent, shown in Figure 3a. We designed the simulation such that the agent would occlude the squares when it overlapped them.

We collected 100 trajectories for each square where the agent began at a random position along the right and top edges of the scene. As shown in Figure 3a, the depth image input was a black screen. For this experiment, we used the row/column index and later the pixel coordinates of the square as . For example, we set as and for the top-left square, and for the bottom-left, and so on for the row/column index and pixel coordinates respectively.

We trained our network on random subsets of corresponding to the nine squares to test how well our model generalized. As an example: is a length subset of ’s corresponding to the

corner squares in our grid. For every random subset, we trained our network for 50 epochs and evaluated the performance of our model on 100 trials for each button. The trajectories for the trials were generated by moving the agent in one of eight cardinal directions towards the goal at each time-step. A trial was counted as a success only if the agent slowed to a complete stop at the correct blue square.

Fig. 4: Results from the 2D simulation domain. Lines are median across random subsets, and shading shows the range across subsets.
(a) Column
(b) Scramble
(c) Shift
(d) Clump
Fig. 5: Various button orientations the model generalized to after being trained on original 3x3 grid using goal pixel location as .

We conducted an ablation study to evaluate the effect of our model’s goal parameterization. Our ablation of the goal-parameter was equivalent to the architecture shown in Figure 2 without the goal-conditioning module. In addition, we evaluated a version of our algorithm that uses an unstructured representation for the goal-parameter to study how structure in the choice of representation for affects our performance. Specifically, we used a one-hot vector where a goal corresponded to a randomly-chosen index of a nine-dimensional vector. Finally, we tested another version of our model that used the specific button’s pixel location within the image as . For our ablation, one-hot, and pixel-based versions, we repeated the above experiment with the specified model changes. The results are displayed in Figure 4.

Our experiments show that injecting the goal parameter defined by either the relative row/column indices or the pixel coordinates expands the functionality of the original behavioral cloning algorithm. As Figure 4 shows, the ablation algorithm is only able to achieve approximately success rate when trained on only one square and its performance generally degraded with more squares in the training set. We also see that the one-hot vector representation of the goal parameter only allows for the algorithm to successfully reach squares seen in training. The addition of our structured goal-parameter - either as the relative row-column index or pixel location - allowed the network to learn the mapping from to the location of a square, and thus enable it to not only select squares it has seen during training, but also generalize to novel squares based on novel inputs.

Additionally, we found that given our choice of , we were able to consistently achieve perfect performance after having trained on roughly 7/9 of the possible targets regardless of the combination of goal-parameters in the training set. We also found that given optimal selection of goal parameters in the training set, we were able to achieve full generalization for the whole button panel with of the goal-parameters represented in the training data. Our experiments show that our methodology is able to learn the entire space of goals after seeing roughly a third of the possible goals provided that the training set represents an informative subset of the entire goal-parameter space.

Finally, we found that using a button’s pixel location as allowed it to generalize well. As shown in Figure 4, we found that this enabled us to achieve perfect generalization to the grid with only two goals in the training set. We also found that parameterizing by the pixel location allowed the model to generalize to arbitrary locations in the scene even when the visual input was drastically changed. That is, we were able to transfer our learned skill to target buttons in a variety of previously-unseen orientations as shown in Figure 5. We demonstrate this behavior further in our supplementary video.

Iv-B Robot Button-Pressing Task

(a) Button pressing
(b) Peg insertion
Fig. 6: (a) shows the KUKA LBR iiwa-7 with the button panel. (b) shows the MELFA RV-4FL robot with the peg and the hole grid.

In this experiment, we show that our method can work robustly on a real-world robotic task. We used a KUKA LBR iiwa-7 equipped with a Schunk gripper to press buttons on a 3D, button panel. Similar to Section IV-A

, we parameterized our button-grid with a row/column tuple of the button’s location on the grid. For training, we collected 100 trials of the robot’s end-effector beginning at a random position and following a straight line with noise to the specified button. The end-effector’s final position was varied with noise drawn from a Gaussian distribution such that the robot would press the button differently each time.

For this experiment, we used specific subsets of buttons within a section of the grid as training data. We chose combinations that had been found to generalize well in our two-dimensional simulation. We found that we achieved good performance within 100 epochs of training. We evaluated for three attempts on each button and deemed an attempt successful if the robot pressed the button. Results are displayed in Figure 7.

The performance of the robot on the task was similar to the average performance with the row/column for our 2D simulation. The robot always successfully pushed buttons seen during training. After having been trained on just three buttons, the robot successfully generalized to of previously-unseen buttons and of all buttons. Interestingly, the average performance of the robot stayed the same when trained on three to five buttons because the robot failed to press exactly two unseen buttons in each of these cases. However, we qualitatively observed that the robot did get progressively closer to succeeding when trained on more buttons, but was still not close enough to successfully press. When trained on six buttons or more, the robot achieved a success rate on all buttons in the grid.

Fig. 7: Results from our experiments on the MELFA and KUKA robots.
(a) Column
(b) Scramble
(c) Shift
(d) Clump
Fig. 8: Various goal orientations the model generalized to after being trained on the original 3x3 button grid using goal pixel location as .

We also evaluated our model again with as the pixel coordinates of the goal in the input RGB image. We found that, our model exhibited similar generalization properties with the added functionality of being able to handle arbitrary locations of the button board. As shown in Figure 8, we were able to generalize to various different orientations of goals as well as to goals that were visually different from our original button board. This suggests that our model is also able to generalize to any button board that fits within its scene, successfully learning a mapping from raw pixel coordinates to policy parameters. We demonstrate this behavior further in our supplementary video.

(a) Trajectories in training set
(b) Trajectories in testing set
Fig. 9: Plots of the trajectories that were used in training and seen in testing for the peg-insertion task by the MELFA RV-4FL. The 3x3 grid on the xy plane represent the various goals for this task.

Iv-C Robot Peg-Insertion Task

We further evaluated our method on a real-world robotic task that required significant precision. As pictured in Figure 5(b), we used a MELFA RV-4FL robotic arm to perform peg-insertions in a 33 grid of holes. We performed an experiment similar to the one in Section IV-B, with a different robot, task-setting, and subsets of goals chosen. The collected trajectories are straight lines from random start positions, uniformly sampled on the area above the holes grid, to a waypoint of random heights directly above the hole, and then a straight path downward into the hole to complete the insertion. Shown in Figure 8(a), an example three-goal subset that we trained on was . During data collection 60 such trajectories were collected for each hole, for a total of 540 insertions. Results of successful insertions for different training combination of holes are displayed in Figure 7.

Our method performed remarkably well on this task. The robot generalized to all nine holes with a success rate after having seen insertions performed on only three holes during training. All executions were performed in stiffness control mode; however, the control outputs were precise enough that compliant motion was not necessary. We found that the outputted control trajectories to new goals were very smooth as shown in Figure 8(b).

2D Button-Pressing Peg-Insertion
Indices 3 6 3
Pixel 2 4 -
TABLE I: Lowest number of goal parameters, either row/column indices or pixel coordinates, at which we observed perfect 100% generalization to the 3x3 grid of goals in our experiments.

The differences between our performance on the peg-insertion and button-pressing tasks can likely be explained by two differences: the noise in the training trajectories and the subsets of the goals that the robots were trained on. As shown in Figure 8(a), the training trajectories for our peg-insertion experiments had no noise on their final positions because the insertion tolerance was too small to induce much noise. However, the training trajectories for our button-pressing experiment had significant noise on the end-effector’s final position. This could have lowered the precision of the model, leading to near-misses for buttons that were not seen during training. In addition, the two tasks did not use the same subsets of goals for training in every case. It is possible that some of the subsets used for the peg-insertion task were more optimal than those used for the button-pressing task.

V Conclusion and Future Work

We introduce a method that uses a neural network to learn deep parameterized visuomotor skills. We show that our method is able to learn to perform new instances of a task that were not seen at training time and demonstrate this via experiments on different tasks in a 2D simulation and on two different physical robots. We empirically study our method’s generalization and dependence on user-specified goal parameters and show that it is able to generalize to all possible instances of various tasks after having seen at most six out of nine instances at training, provided a diverse subset of goals. This is summarized in Table I. We also show that depending on the choice of goal parameters, our model is able to generalize to different orientations of multiple goals in the scene with no additional training data.

We hope to investigate various architectural changes to improve our method in the future. Of particular promise is the use of Reinforcement Learning (RL) to precisely learn stopping conditions and intermediate fine-grained movements. Additionally, we hope to extend our idea of goal parameterization to other frameworks for learning from demonstration such as inverse reinforcement learning

[1] or generative adversarial imitation learning [14] (GAIL). Recent work [8] has successfully extended goal-parameterization to GAIL and shown encouraging results in simulation. We hope such methods will enable us to represent more complex parameterized skills.

Finally, we hope to investigate different ways of parameterizing the task itself. For instance, we might use natural language commands or even multi-modal encodings as our goal parameter. Studying such different parameterizations could help us formulate additional conditions or guidelines for selecting ‘good’ representations and values of goal parameters to train on.