Reinforcement learning algorithms have been proven to be effective to learn complex skills in simulation environments in ,  and . However, practical robotic reinforcement learning for complex motion skills remains a challenging and unsolved problem, due to the high number of samples needed to train most algorithms and the expense of obtaining those samples from real robots. Most existing approaches to robotic reinforcement learning either fail to generalize between different tasks and among variations of single tasks, or only generalize by requiring collecting impractical amounts of real robot experience. With recent advancements in robotic simulation, and the widespread availability of large computational resources, a popular family of methods seeking to address this challenge has emerged, known as “sim-to-real” methods. These methods seek to offload most training time from real robots to offline simulations, which are trivially parallelizable and much cheaper to operate. Our method combines this “sim-to-real” schema with representation learning and model-predictive control (MPC) to make transfer more robust, and to significantly decrease the number of simulation samples needed to train policies which achieve families of related tasks.
The key insight behind our method is that the simulation used in the pre-training step of a simulation-to-real method can also be used online as a tool for foresight. It allows us to predict the behavior of a known policy on an unseen task. When combined with a latent-conditioned policy, where the latent actuates variations of useful policy behavior (e.g. skills), this simulation-as-foresight tool allows our method to use what the robot has already learned to do (e.g. the pre-trained policy) to bootstrap online policies for tasks it has never seen before. That is, given a latent space of useful behaviors, and a simulation which predicts the rewards for those behaviors on a new task, we can reduce the adaptation problem to intelligently choosing a sequence of latent skills which maximize rewards for the new task.
Most simulation-to-real approaches so far have focused on addressing the “reality gap” problem. The reality gap problem is the domain shift performance loss induced by differences in dynamics and perception between the simulation (policy training) and real (policy execution) environments. Training a policy only in a flawed simulation generally yields control behavior which is not adaptable to even small variations in the environment dynamics. Furthermore, simulating the physics behind many practical robotic problems (e.g. sliding friction and contact forces) is an open problem in applied mathematics, meaning it is not possible to solve a completely accurate simulation for many important robotic tasks . Rather than attempt to create an explicit alignment between simulation and real , or randomize our simulation training to a sufficient degree to learn a policy which generalizes to nearby dynamics , our method seeks to learn a sufficient policy in simulation, and adapt it quickly to the real world online during real robot execution.
Our proposed approach is based on four key components: reinforcement learning with policy gradients (RL) , variational inference , model-predictive control (MPC), and physics simulation. We use variational inference to learn a low-dimensional latent space of skills which are useful for tasks, and RL to simultaneously learn single policy which is conditioned on these latent skills. The precise long-horizon behavior of the policy for a given latent skill is difficult to predict, so we use MPC and an online simulation to evaluate latent skill plans in the in simulation before executing them on the real robot.
Iii Related Work
Learning skill representations to aid in generalization has been proposed in works old and new. Previous works proposed frameworks such as Associative Skill Memories  and probabilistic movement primitives  to acquire a set of reusable skills. Our approach is built upon , which learns a embedding space of skills with reinforcement learning and variational inference, and  which shows that these learned skills are transferable and composable on real robots. While  noted that predicting the behavior of latent skills is an obstacle to using this method, our approach addresses the problem by using model-predictive control to successfully complete unseen tasks with no fine-tuning on the real robot. Exploration is a key problem in robot learning, and our method uses latent skill representations to address this problem. Using learned latent spaces to make exploration more tractable is also studied in  and . Our method exploits a latent space for task-oriented exploration: it uses model-predictive control and simulation to choose latent skills which are locally-optimal for completing unseen tasks, then executes those latent skills on the real robot.
Using reinforcement learning with model-predictive control has been explored previously. Kamthe et al.  proposed using MPC to increase the data efficiency of reinforcement algorithms by training probabilistic transition models for planning. In our work, we take a different approach by exploiting our learned latent space and simulation directly to find policies for novel tasks online, rather than learning and then solving a model.
Simulation-to-real transfer learning approaches include randomizing the dynamic parameters of the simulation, and varying the visual appearance of the environment , both of which scale linearly or quadratically the amount of computation needed to learn a transfer policy. Other strategies, such as that of Barrett et al.  reuse models trained in simulation to make sim-to-real transfer more efficient, similar to our method, however this work requires an explicit pre-defined mapping between seen and unseen tasks. Saemundson et al.  use meta-learning and learned representations to generalize from pre-trained seen tasks to unseen tasks, however their approach requires that the unseen tasks be very similar to the pre-trained tasks, and is few-shot rather than zero-shot. Our method is zero-shot with respect to real environment samples, and can be used to learn unseen tasks which are significantly out-of-distribution, as well as for composing learned skills in the time domain to achieve unseen tasks which are more complex than the underlying pre-trained task set.
learn an implicit skill representation by clustering trajectories of states and rewards in a latent space. Furthermore, we focus on MPC-based planning in the latent space to achieve robotic tasks learned online with a real robot, while their analysis focuses on the machine learning behind this family of methods and uses simulation experiments.
Iv-a Skill Embedding Algorithm
In our multi-task RL setting, we pre-define a set of low-level skills with IDs , and accompanying, per-skill reward functions .
In parallel with learning the joint low-level skill policy as in conventional RL, we learn an embedding function which parameterizes the low-level skill library using a latent variable . Note that the true skill identity is hidden from the policy behind the embedding function . Rather than reveal the skill ID to the policy, once per rollout we feed the skill ID
, encoded as s one-hot vector, through the stochastic embedding functionto produce a latent vector . We feed this same value of to the policy for the entire rollout, so that all steps in a trajectory are correlated with the same value of .
To aid in learning the embedding function, we learn an inference function which, given a state-only trajectory window of length , predicts the latent vector which was fed to the low-level skill policy when it produced that trajectory. This allows us to define an augmented reward which encourages the policy to produce distinct trajectories for different latent vectors. We learn in parallel with the policy and embedding functions, as shown in Eq. IV-A.
We add a policy entropy bonus , which ensures that the policy does not collapse to a single solution for each skill. For a detailed derivation, refer to .
Iv-B Skill Embedding Criterion
In order for the learned latent space to be useful for completing unseen tasks, we seek to constrain the embedding distribution to satisfy two important properties:
High entropy: Each task should induce a distribution over latent vectors which is wide as possible, corresponding to many variations of a single skill.
Identifiability: Given an arbitrary trajectory window, the inference network should be able to predict with high confidence the latent vector fed to the policy to produce that trajectory.
When applied together, these properties ensure that during training the policy is trained to encode high-reward controllers for many parameterizations of a skill (high-entropy), while simultaneously ensuring that each of these latent parameterizations corresponds to a distinct variation of that skill. This dual constraint is the key for using model predictive control or other composing methods in the latent space as discussed in Sec. IV-C.
We train the policy and embedding networks using Proximal Policy Optimization , though our method may be used by any parametric reinforcement learning algorithm. We use the MuJoCo physics engine 
to implement our Sawyer robot simulation environments. We represent the policy, embedding, and inference functions using multivariate Gaussian distributions whose mean and diagonal covariance are parameterized by the output of a multi-layer perceptron. The policy and embedding distributions are jointly optimized by the reinforcement learning algorithm, while we train the inference distribution using supervised learning and a simple cross-entropy loss.
Iv-C Using Model Predictive Control for Zero-Shot Adaptation
To achieve unseen tasks on a real robot with no additional training, we freeze the multi-skill policy learned in Sec. IV-A, and use a new algorithm which we refer to as a “composer.” The composer achieves unseen tasks by choosing new sequences of latent skill vectors to feed to the frozen skill policy. Exploring in this smaller space is faster and more sample-efficient, because it encodes high-level properties of tasks and their relations. Each skill latent induces a different pre-learned behavior, and our method reduces the adaptation problem to choosing sequences of these pre-learned behaviors–continuously parameterized by the skill embedding–to achieve new tasks.
Note that we use the simulation itself to evaluate the future outcome of the next action. For each step, we set the state of the simulation environment to the observed state of the real environment. This equips our robot with with the ability to predict the behavior of different skill latents. Since our robot is trained in a simulation-to-real framework, we can reuse the simulation from the pre-training step as a tool for foresight when adapting to unseen tasks. This allow us to select a latent skill online which is locally-optimal for a task, even if that task was seen not during training. We show that this scheme allows us to perform zero-shot task execution and composition for families of related tasks. This is in contrast to existing methods, which have mostly focused on direct alignment between simulation and real, or data augmentation to generalize the policy using brute force. Despite much work on simulation-to-real methods, neither of these approaches has demonstrated the ability to provide the adaptation ability needed for general-purpose robots in the real world. We believe our method provides a third path towards simulation-to-real adaptation that warrants exploration, as a higher-level complement to these effective-but-limited existing low-level approaches.
We denote the new task corresponding to reward function , the real environment in which we attempt this task , and the RL discount factor . We use the simulation environment , frozen skill embedding , and latent-conditioned skill policy , all trained in Sec. IV-A, to apply model-predictive control in the latent space as follows (Algorithm 1).
We first sample candidate latents according to . We observe the state of real environment .
For each candidate latent , we set the initial state of the simulation to . For a horizon of time steps, we sample the frozen policy , conditioned on the candidate latent , and execute the actions the simulation environment , yielding and total discounted reward for each candidate latent. We then choose the candidate latent acquiring the highest reward , and use it to condition and sample the frozen policy to control the real environment for a horizon of time steps.
We repeat this MPC process to choose and execute new latents in sequence, until the task has been achieved.
The choice of MPC horizon has a significant effect on the performance of our approach. Since our latent variable encodes a skill which only partially completes the task, executing a single skill for too long unnecessarily penalizes a locally-useful skill for not being globally optimal. Hence, we set the MPC horizon to not more than twice the number of steps that a latent is actuated in the real environment .
We evaluate our approach by completing two sequencing tasks on a Sawyer robot: drawing a sequence of points and pushing a box along a sequential path. For each of the experiments, the robot must complete an overall task by sequencing skills learned during the embedding learning process. Sequencing skills poses a challenge to conventional RL algorithms due to the sparsity of rewards in sequencing tasks . Because the agent only receives a reward for completing several correct complex actions in a row, exploration under these sequencing tasks is very difficult for conventional RL algorithms. By reusing the skills we have consolidated in the embedding space, we show a high-level controller can effectively compose these skills in order to achieve such difficult sequencing tasks.
V-a Sawyer: Drawing a Sequence of Points
In this experiment, we ask the Sawyer Robot to move its end-effector to a sequence of points in 3D space. We first learn the low level policy that receives an observation with the robot’s seven joint angles as well as the Cartesian position of the robot’s gripper, and output incremental joint positions (up to 0.04 rads) as actions. We use the Euclidean distance between the gripper position and the current target is used as the cost function. We trained the policy and the embedding network on eight goal positions in simulation, forming a 3D rectoid enclosing the workspace. Then, we use the model-predictive control to choose a sequence latent vector which allows the robot to draw an unseen shape. For both simulation and real robot experiments, we attempted two unseen tasks: drawing a rectangle in 3D space (Figs. 5 and 7) and drawing a triangle in 3D space (Figs. 6 and 8).
V-B Sawyer: Pushing the Box through a Sequence of Waypoints
In this experiment, we test our approach with a task that requires contact between the Sawyer Robot and an object. We ask the robot to push a box along a sequence of points in the table plane. We choose the Euclidean distance between the position of the box and the current target position as the reward function. The policy receives a state observation with the relative position vector between the robot’s gripper and the box’s centroid and outputs incremental gripper movements (up to ) as actions.
We first pre-train a policy to push the box to four goal locations relative to its starting position in simulation. We trained the low-level multi-task policy with four tasks in simulation: up, down, left, and right of the box starting position. We then use the model-predictive control to choose a latent vectors and feed it with the state observation to frozen multi-task policy which controls the robot.
For both simulation and real robot experiments, we use the simulation as a model of the environments. In the simulation experiments, we use model-predictive controller to push the box to three points. In the real robot experiments, we ask the Sawyer Robot to complete two unseen tasks: pushing up-then-left and pushing left-then-down.
Vi-a Sawyer Drawing
In the unseen drawing experiments, we sampled vectors from the skill latent distribution, and for each of them performed an MPC optimization with a horizon of steps. We then execute the latent with highest reward for steps on the target robot. In simulation experiments, the Sawyer Robot successfully draw a rectangle with by sequencing 54 latents (Fig. 2) and drew by sequencing a triangle with 56 latents (Fig. 3). In the real robot experiments, the Sawyer Robot successfully completed the unseen rectangle-drawing task by choosing 62 latents (Fig. 4) in 2 minutes of real time and completed the unseen triangle-drawing task by choosing 53 latents (Fig. 5) in less than 2 minutes.
Vi-B Sawyer Pusher Sequencing
In the pusher sequencing experiments, we sample vectors from the latent distribution. We use an MPC optimization with a simulation horizon of steps, and execute each chosen latent in the environment for steps. In simulation experiments, the robot completed the unseen up-left task less than 30 seconds of equivalent real time and the unseen right-up-left task less than 40 seconds of equivalent real time. In the real robot experiments, the robot successfully completed the unseen left-down task by choosing 3 latents over approximately 1 minute of real time, and the unseen push up-left task by choosing 8 latents in about 1.5 minutes of real time.
These experiment results show that our learned skills are composable to complete the new task. In comparison with performing a search as done in , our approach is faster in wall clock time because we perform the model prediction in simulation instead of on the real robot. Note that our approach can utilize the continuous space of latents, whereas previous search methods only use an artificial discretization of the continuous latent space. In the unseen box-pushing real robot experiment (Fig. 7, Right), the Sawyer robot pushes the box towards the bottom-right right of the workspace to fix an error it made earlier in the task. This intelligent reactive behavior was never explicitly trained during the pre-training in simulation process. This shows that by sampling from our latent space, the model-predictive controller successfully selects a skill that is not pre-defined during training process.
In this work, we combine task representation learning simulation-to-real training, and model-predictive control to efficiently acquire policies for unseen tasks with no additional training. Our experiments show that applying model predictive control to these learned skill representations can be a very efficient method for online learning of tasks. The tasks we demonstrated are more complex than the underlying pre-trained skills used to achieve them, and the behaviors exhibited by our robot while executing unseen tasks were more adaptive than demanded by the simple reward functions us. Our method provides a partial escape from the reality gap problem in simulation-to-real methods, by mixing simulation-based long-range foresight with locally-correct online behavior.
For future work, we plan to apply our model-predictive controller as an exploration strategy to learn a composer policy that uses the latent space as action space. We look forward to efficiently learning a policy on real robots with guided exploration in our latent space.
The authors would like to thank Angel Gonzalez Garcia, Jonathon Shen, and Chang Su for their work on the garage111https://github.com/rlworkgroup/garage reinforcement learning for robotics framework, on which the software for this work was based. We also want to thank the authors of multiworld222https://github.com/vitchyr/multiworld for providing a well-tuned Sawyer Block Pushing simulation environment. This research was supported in part by National Science Foundation grants IIS-1205249, IIS-1017134, EECS-0926052, the Office of Naval Research, the Okawa Foundation, and the Max-Planck-Society. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.
V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare,
A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen,
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
S. Legg, and D. Hassabis, “Human-level control through deep reinforcement
learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. [Online].
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,”Nature, vol. 529, no. 7587, pp. 484–489, 2016. [Online]. Available: https://doi.org/10.1038/nature16961
-  A. Yahya, A. Li, M. Kalakrishnan, Y. Chebotar, and S. Levine, “Collective robot reinforcement learning with distributed asynchronous guided policy search,” CoRR, vol. abs/1610.00673, 2016. [Online]. Available: http://arxiv.org/abs/1610.00673
-  N. Jakobi, P. Husbands, and I. Harvey, “Noise and the reality gap: The use of simulation in evolutionary robotics,” in Advances in Artificial Life, F. Morán, A. Moreno, J. J. Merelo, and P. Chacón, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 1995, pp. 704–720.
A. A. Visser, N. Dijkshoorn, M. V. D. Veen, and R. Jurriaans, “Closing the gap
between simulation and reality in the sensor and motion models of an
autonomous ar.drone.” [Online]. Available:
-  X. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” CoRR, vol. abs/1710.06537, 2017. [Online]. Available: http://arxiv.org/abs/1710.06537
-  J. Schulman, P. Abbeel, and X. Chen, “Equivalence between policy gradients and soft q-learning,” CoRR, vol. abs/1704.06440, 2017. [Online]. Available: http://arxiv.org/abs/1704.06440
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR, vol. abs/1312.6114, 2013. [Online]. Available: http://arxiv.org/abs/1312.6114
-  P. Pastor, M. Kalakrishnan, L. Righetti, and S. Schaal, “Towards associative skill memories,” in Humanoids, Nov 2012.
-  E. Rueckert, J. Mundo, A. Paraschos, J. Peters, and G. Neumann, “Extracting low-dimensional control variables for movement primitives,” in ICRA, May 2015.
-  K. Hausman, J. Springenberg, Z. Wang, N. Heess, and M. Riedmiller, “Learning an embedding space for transferable robot skills,” in ICLR, 2018. [Online]. Available: https://openreview.net/forum?id=rk07ZXZRb
-  R. C. Julian, E. Heiden, Z. He, H. Zhang, S. Schaal, J. Lim, G. S. Sukhatme, and K. Hausman, “Scaling simulation-to-real transfer by learning composable robot skills,” in International Symposium on Experimental Robotics. Springer, 2018. [Online]. Available: https://ryanjulian.me/iser_2018.pdf
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in ICRA. IEEE, 2017.
-  B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” Feb. 2018. [Online]. Available: http://arxiv.org/abs/1802.06070
-  S. Kamthe and M. P. Deisenroth, “Data-efficient reinforcement learning with probabilistic model predictive control,” CoRR, vol. abs/1706.06491, 2017. [Online]. Available: http://arxiv.org/abs/1706.06491
-  F. Sadeghi and S. Levine, “CAD2RL: Real single-image flight without a single real image,” in RSS, 2017.
-  S. Barrett, M. E. Taylor, and P. Stone, “Transfer learning for reinforcement learning on a physical robot,” in Ninth International Conference on Autonomous Agents and Multiagent Systems - Adaptive Learning Agents Workshop (AAMAS - ALA), May 2010. [Online]. Available: http://www.cs.utexas.edu/users/ai-lab/?AAMASWS10-barrett
-  S. Sæmundsson, K. Hofmann, and M. P. Deisenroth, “Meta reinforcement learning with latent variable gaussian processes,” CoRR, vol. abs/1803.07551, 2018. [Online]. Available: http://arxiv.org/abs/1803.07551
-  J. D. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and S. Levine, “Self-Consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings,” Jun. 2018. [Online]. Available: http://arxiv.org/abs/1806.02813
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
-  “MuJoCo: A physics engine for model-based control,” in IROS, 2012.
-  M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in NIPS, 2017.