1 Introduction
Motivation
The Constructivist hypothesis proposes that humans learn to perform new behaviors by using what they already know cooper1993paradigm . To learn new behaviors, it proposes that humans leverage their prior experiences across behaviors, and that they also generalize and compose previouslylearned behaviors into new ones, rather than learning them from scratch drescher1991made
. Whether we can make robots learn so efficiently is an open question. Much recent work on robot learning has focused on “deep” reinforcement learning (RL), inspired by achievements of deep RL in continuous control
lillicrap2015ddpg and game play domains mnih2015dqn . While recent attempts in deep RL for robotics are encouraging levine2016end ; chebotarhausmanzhang17icml ; gu2017deep , performance and generality on real robots remains challenging.A major obstacle to widespread deployment of deep RL on real robots is data efficiency: most deep RL algorithms require millions of samples to converge duan2016benchmark . Learning from scratch using these algorithms on a real robot is therefore a resourceintensive endeavor, e.g. by requiring multiple robots to learn in parallel for weeks levine2018robotarmy . One promising approach is to train deep RL algorithms entirely in fasterthanrealtime simulation, and transfer the learned policies to a real robot.
Problem Statement
Our contribution is a method for exploiting hierarchy, while retaining the flexibility and expressiveness of endtoend RL approaches.
Consider the illustrative example of block stacking. One approach is to learn a single monolithic policy which, given any arrangement of blocks on a table, grasps, moves, and stacks each block to form a tower. This formulation is succinct, but requires learning a single sophisticated policy. We observe that block stacking–and many other practical robotics tasks–is easily decomposed into a few reusable primitive skills (e.g. locate and grasp a block, move a grasped block over the stack location, place a grasped block on top of a stack), and divide the problem into two parts: learning to perform and mix the skills in general, and learning to combine these skills into particular policies which achieve highlevel tasks.
Related Work
Our approach builds on the work of Hausman et al. hausman2018learning
, that learns a latent space which parameterizes a set of motion skills, and shows them to be temporally composable using interpolation between coordinates in the latent space. In addition to learning reusable skills, we present a method which learns to compose them to achieve highlevel tasks, and an approach for transferring compositions of those skills from simulation to real robots. Similar latentspace methods have been recently used for better exploration
abhishekmeta ; diversityisallyouneed and hierarchical RL heess2016modulate ; tuomas18latenthrl ; coreyes2018self .Our work is related to parameterspace metalearning methods, which seek to learn a single shared policy which is easily generalized to all skills in a set, but do not address skill sequencing specifically. Similarly, unlike recurrent metalearning methods, which implicitly address sequencing of a family of subskills to achieve goals, our method addresses generalization of single skills while providing an explicit representation of the relationship between skills. We show that explicit representation allows us to combine our method with many algorithms for robot autonomy, such as optimal control, searchbased planning, and manual programming, in addition to learningbased methods. Furthermore, our method can be used to augment most existing reinforcement learning algorithms, rather than requiring the formulation of an entirely new family of algorithms to achieve its goals.
Previous works proposed frameworks such as Associative Skill Memories pastor2012asm and probabilistic movement primitives rueckert2015movprim
to acquire a set of reusable skills. Other approaches introduce particular model architectures for multitask learning, such as Progressive Neural Networks
rusu2016progressive or Attention Networks rajendran2017adaapt .Common approaches to simulationtoreal transfer learning include randomizing the dynamic parameters of the simulation
peng2017simreal , and varying the visual appearance of the environment sadeghi2017cadrl . Another approach is explicit alignment: given a mapping of common features between the source and target domains, domaininvariant state representations tzeng2015adaption , or priors on the relevance of input features kroemer2016mlp , can further improve transferability.Our method can leverage these techniques to improve the stability of the transfer learning process in two ways: (1) by training transferable skills which generalize to nearby skills from the start and (2) by intentionally learning composable parameterizations of those skills, to allow them to be easily combined before or after transfer.
2 Technical Approach
Our work synthesizes two recent methods in deep RL–pretraining in simulation and learning composable motion policies–to make deep reinforcement learning more practical for real robots. Our strategy is to split the learning process into a twolevel hierarchy (Fig. 1), with lowlevel skill policies learned in simulation, and highlevel task policies learned or planned either in simulation or on the real robot, using the imperfectlytransferred lowlevel skills.
Skill Embedding Learning Algorithm
In our multitask RL setting, we predefine a set of lowlevel skills with IDs , and accompanying, perskill reward functions .
In parallel with learning the joint lowlevel skill policy as in conventional RL, we learn an embedding function which parameterizes the lowlevel skill library using a latent variable . Note that the true skill identity is hidden from the policy behind the embedding function . Rather than reveal the skill ID to the policy, once per rollout we feed the skill ID
, encoded as s onehot vector, through the stochastic embedding function
to produce a latent vector . We feed this same value of to the policy for the entire rollout, so that all steps in a trajectory are correlated with the same value of .
where 

To aid in learning the embedding function, we learn an inference function which, given a trajectory window of length , predicts the latent vector which was fed to the lowlevel skill policy when it produced that trajectory. This allows us to define an augmented reward which encourages the policy to produce distinct trajectories for different latent vectors. We learn in parallel with the policy and embedding functions, as shown in Eq. 2.
We also add a policy entropy bonus , which ensures that the policy does not collapse to a single solution for each lowlevel skill, and instead encodes a variety of solutions. All the above reward augmentations arise naturally from applying a variational lower bound to an entropyregularized, multitask RL formulation which uses latent variables as the task context input to the policy. For a detailed derivation, refer to hausman2018learning .
The full robot training and transfer method consists of three stages.
Stage 1: PreTraining in Simulation while Learning Skill Embeddings
We begin by training in simulation a multitask policy for all lowlevel skills, and a composable parameterization of that library
(i.e. a skill embedding). This stage may be performed using any deep RL algorithm, along with the modified policy architecture and loss function described above. Our implementation uses Proximal Policy Optimization
schulman2017ppo and the MuJoCo physics engine todorov2012mujoco .The intuition behind our pretraining process is as follows. The policy obtains an additional reward if the inference function is able to predict the latent vector which was sampled from the embedding function at the beginning of the rollout. This is only possible if, for every latent vector , the policy produces a distinct trajectory of states , so that the inference function can easily predict the source latent vector. Adding these criteria to the RL reward encourages the policy to explore and encode a set of diverse policies that can perform each lowlevel skill in various ways, parameterized by the latent vector.
Stage 2: Learning Hierarchical Policies
In the second stage, we learn a highlevel “composer” policy, represented in general by a probability distribution
over the latent vector . The composer actuates the lowlevel policy by choosing at each time step to compose the previouslylearned skills. This hierarchical organization admits our novel approach to transfer learning: by freezing the lowlevel skill policy and embedding functions, and exploring only in the prelearned latent space to acquire new tasks, we can transfer a multitude of highlevel task policies derived from the lowlevel skills.This stage can be performed directly on the the real robot or in simulation. As we show in Sec. 3, composer policies may treat the latent space as either a discrete or continuous space, and may be found using learning, searchbased planning, or even manual sequencing and interpolation. To succeed, the composer policy must explore the latent space of prelearned skills, and learn to exploit the behaviors the lowlevel policy exhibits when stimulated with different latent vectors. We hypothesize that this is possible because of the richness and diversity of lowlevel skill variations learned in simulation, which the composer policy can exploit by actuating the skill embedding.
Stage 3: Transfer and Execution
Lastly, we transfer the lowlevel skill policy, embedding and highlevel composer policies to a real robot and execute the entire system to perform highlevel tasks.
3 Experiments
Point Environment
Before experimenting on complex robotics problems, we evaluate our approach in a point mass environment. Its lowdimensional state and action spaces, and high interpretability, make this environment our most basic test bed. We use it for verifying the principles of our method and tuning its hyperparameters before we deploy it to more complex experiments. Portrayed in Fig.
2 is a multitask instantiation of this environment with four goals (skills).At each time step, the policy receives as state the point’s position and chooses a twodimensional velocity vector as its action. The policy receives a negative reward equal to the distance between the point and the goal position.
After 15,000 time steps, the embedding network learns a multimodal embedding distribution to represent the four tasks (Fig. 2). Introducing entropy regularization hausman2018learning to the policy alters the trajectories significantly: instead of steering to the goal position in a straight line, the entropyregularized policy encodes a distribution over possible solutions. Each latent vector produces a different solution. This illustrates that our approach is able to learn multiple distinct solutions for the same skill, and that those solutions are addressable using the latent vector input.
Sawyer Experiment: Reaching
We ask the Sawyer robot to move its gripper to within of a goal point in 3D space. The policy receives a state observation with the robot’s seven joint angles, plus the cartesian position of the robot’s gripper, and chooses incremental joint movements (up to ) as actions.
We trained the lowlevel policy on eight goal positions in simulation, forming a 3D cuboid enclosing a volume in front of the robot (Fig. 4). The composer policies feed latent vectors to the pretrained lowlevel skill policy to achieve highlevel tasks such as reaching previouslyunvisited points (Fig. 5).
Composition Experiments
All Sawyer composition experiments use the same lowlevel skill policy, pretrained in simulation. We experimented both with composition methods which directly transfer the lowlevel skills to the robot (direct), and with methods which use the lowlevel policy for a second stage of pretraining in simulation before transfer (sim2real).
Task interpolation in the latent space (direct)
We evaluate the embedding function to obtain the mean latent vector for each of the 8 pretraining tasks, then feed linear interpolations of these means to the latent input of the lowlevel skill policy, transferred directly to the real robot. For a latent pair , our experiment feeds for , then for , and finally for . We observe that the linear interpolation in latent space induces an implicit motion plan between the two points, despite the fact that pretraining never experienced this state trajectory. In one experiment, we used this method iteratively to control the Sawyer robot to draw a Ushaped path around the workspace (Fig. 5.1).
Endtoend learning in the latent space (sim2real)
Using DDPG lillicrap2015ddpg , an offpolicy reinforcement learning algorithm, we trained a composer policy to modulate the latent vector to reach a previouslyunseen point. We then transferred the composer and lowlevel policies to the real robot. The policy achieved a gripper distance error of , the threshold for task completion as defined by our reward function (Fig. 5.2).
Searchbased planning in the latent space (sim2real and direct)
We used Uniform Cost Search in the latent space to find a motion plan (i.e. sequence of latent vectors) for moving the robot’s gripper along a triangular trajectory. Our search space treats the latent vector corresponding to each skill ID as a discrete option. We execute a plan by feeding each latent in in the sequence to the lowlevel policy for , during which the lowlevel policy executes in closedloop.
In simulation, this strategy found a plan for tracing a triangle in less than , and that plan successfully transferred to the real robot (Fig. 5.3). We replicated this experiment directly on the real robot, with no intermediate simulation stage. It took of real robot execution time to find a motion plan for the triangle tracing task.
Sawyer Experiment: Box Pushing
We ask the Sawyer robot to push a box to a goal location relative to its starting position, as defined by a 2D displacement vector in the table plane. The policy receives a state observation with the robot’s seven joint angles, plus a relative cartesian position vector between the robot’s gripper and the box’s centroid. The policy chooses incremental joint movements (up to ) as actions. In the real experimental environment, we track the position of the box using motion capture and merge this observation with proprioceptive joint angles from the robot.
Composition Experiments
Task interpolation in the latent space (direct)
We evaluated the embedding function to obtain the mean latent vector for each of the four pretrained pushing skills (i.e. up, down, left, and right of start position). We then fed the mean letant of adjacent skills (e.g. ) while executing the pretrained policy directly on the robot (Fig. 8).
We find that in general this strategy induces the policy to move the block to a position between the two pretrained skill goals. However, magnitude and direction of block movement was not easily predictable from the pretrained goal locations, and this behavior was not reliable for half of the synthetic goals we tried.
Searchbased planning in the latent space (sim2real)
Similar to the searchbased composer on the reaching experiment, we used Uniform Cost Search in the latent space to find a motion plan (sequence of latent vectors) for pushing the block to unseen goal locations (Fig. 9). We found that searchbased planning was able to find a latentspace plan to push the block to any location within the convex hull formed by the four pretrained goal locations. Additionally, our planner was able to push blocks to some targets significantly outside this area (up to ). Unfortunately, we were not able to reliably transfer these composed policies to the robot.
We attribute these failures to transfer partially to the nonuniform geometry of the embedding space, and partially to the difficulty of transferring contactbased motion policies learned in simulation, and discuss these results further in Sec. 4.
4 Main Experimental Insights
The point environment experiments verify the principles of our method, and the singleskill Sawyer experiments demonstrate its applicability to real robotics tasks. Recall that all Sawyer skill policies used only joint space control to actuate the robot, meaning that the skill policies and composer needed to learn how using the robot to achieve taskspace goals without colliding the robot with the world or itself.
The Sawyer composition experiments provide the most insight into the potential of latent skill decomposition methods for scaling simulationtoreal transfer in robotics. The method allows us to reduce a complex control problem–jointspace control to achieve taskspace objectives–into a simpler one: control in latent skillspace to achieve taskspace objectives.
We found that the method performs best on new skills which are interpolations of existing skills. We pretrained on just eight reaching skills with full endtoend learning in simulation, and all skills were always trained starting from the same initial position. Despite this narrow initialization, our method learned a latent representation which allowed later algorithms to quickly find policies which reach to virtually any goal inside the manifold of the pretraining goals. Composed policies were also able to induce noncolliding jointspace motion plans between pretrained goals (Fig. 5).
Secondly, a major strength of the method is its ability to combine with a variety of existing, wellcharacterized algorithms for robotic autonomy. In addition to modelfree reinforcement learning, we successfully used manual programming (interpolation) and searchbased planning on the latent space to quickly reach both goals and sequences of goals that were unseen during pretraining (Figs. 5, 8, and 9). Interestingly, we found that the latent space is useful for control not only in its continuous form, but also via a discrete approximation formed by the mean latent vectors of the pretraining skills. This opens the method to combination with large array of efficient discrete planning and optimization algorithms, for sequencing lowlevel skills to achieve longhorizon, highlevel goals.
Conversely, algorithms which operate on full continuous spaces can exploit the continuous latent space. We find that a DDPGbased composer with access only to a discrete latent space (formed from the latent means of eight pretrained reaching skills and interpolations of those skills) is significantly outperformed by a DDPG composer that leverages the entire embedding space as its action space (Fig. 10). This implies that the embedding function contains information on how to achieve skills beyond the instantiations the skill policy was pretrained on.
The method in its current form has two major challenges.
First is the difficulty of the simulationtoreal transfer problem even in the singleskill domain. We found in the Sawyer boxpushing experiment (Fig. 7) that our ability to train transferable policies was limited by our simulation environment’s ability to accurately model friction. This is a wellknown weakness in physics simulators for robotics. A more subtle challenge is evident in Figure 4, which shows that our reaching policy did not transfer with complete accuracy to the real robot despite it being freespace motion task. We speculate that this is a consequence of the policy overfitting to the latent input during pretraining in simulation. If the skill latent vector provides all the information the policy needs to execute an openloop trajectory to reach the goal, it is unlikely to learn closedloop behavior.
The second major challenge is constraining the properties of the latent space, and reliably training good embedding functions, which we found somewhat unstable and hard to tune. The present algorithm formulation places few constraints on the algebraic and geometric relationships between different skill embeddings. This leads to counterintuitive results, such as the mean of two pushing policies pushing in the expected direction but with unpredictable magnitude (Fig. 8), or the latent vector which induces a reach directly between two goals (e.g A and B) actually residing much closer to the latent vector for goal A than for goal B (Fig. 5). This lack of constraints also makes it harder for composing algorithms to plan or learn in the latent space.
5 Conclusion
Our experiments illustrate the promise and challenges of applying of stateoftheart deep reinforcement learning to real robotics problems. For instance, our policies were able to learn and generalize taskspace control and even motion planning skills, starting from jointspace control, with no handengineering for those use cases. Simultaneously, the training and transfer process requires careful engineering and some handtuning. In the case of simulationtoreal techniques, our performance is also bounded by our ability to build and execute reasonable simulations of our robot and its environment, which is not a trivial task.
In future work, we plan to study how to learn skill embedding functions more reliably and with constraints which make them even more amenable to algorithmic composition, and further exploring how to learn and plan in latent space effectively. We also plan to combine our method with other transfer learning techniques, such as dynamics randomization peng2017simreal , to improve the transfer quality of embedded skill policies in future experiments. Our hope is to refine the method into a easilyapplicable method for skill learning and reuse. We also look forward to further exploring the relationship between our method and metalearning techniques, and the combination of our method with techniques for learning representations of the observation space.
6 Acknowledgements
The authors would like to thank Angel Gonzalez Garcia, Jonathon Shen, and Chang Su for their work on the garage^{1}^{1}1https://github.com/rlworkgroup/garage reinforcement learning for robotics framework, on which the software for this work was based. This research was supported in part by National Science Foundation grants IIS1205249, IIS1017134, EECS0926052, the Office of Naval Research, the Okawa Foundation, and the MaxPlanckSociety. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.
Bibliography
 [1] P.A. Cooper. Paradigm shifts in designed instruction: From behaviorism to cognitivism to constructivism. Educational technology, 33(5):12–19, 1993.

[2]
G.L. Drescher.
Madeup minds: a constructivist approach to artificial intelligence
. MIT press, 1991.  [3] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
 [4] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. JMLR, 17(1):1334–1373, 2016.
 [6] Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. 2017.
 [7] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In ICRA. IEEE, 2017.
 [8] Y. Duan, X. Chen, J. Schulman, and P. Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. arXiv, 48:14, 2016.

[9]
S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen.
Learning handeye coordination for robotic grasping with deep learning and largescale data collection.
IJRR, 37(45):421–436, 2018.  [10] K. Hausman, J.T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In ICLR, 2018.
 [11] A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, and S. Levine. Metareinforcement learning of structured exploration strategies. CoRR, abs/1802.07245, 2018.
 [12] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. CoRR, abs/1802.06070, 2018.
 [13] N. Heess, G. Wayne, Y. Tassa, T. Lillicrap, M. Riedmiller, and D. Silver. Learning and transfer of modulated locomotor controllers. CoRR, abs/1610.05182, 2016.
 [14] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical reinforcement learning. CoRR, abs/1804.02808, 2018.

[15]
J. D. CoReyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, and
S. Levine.
SelfConsistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings.
ArXiv eprints, June 2018.  [16] P. Pastor, M. Kalakrishnan, L. Righetti, and S. Schaal. Towards associative skill memories. In Humanoids, Nov 2012.
 [17] E. Rueckert, J. Mundo, A. Paraschos, J. Peters, and G. Neumann. Extracting lowdimensional control variables for movement primitives. In ICRA, May 2015.
 [18] A.A. Rusu, N.C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. CoRR, abs/1606.04671, 2016.
 [19] J. Rajendran, P. Prasanna, B. Ravindran, and M.M. Khapra. ADAAPT: A deep architecture for adaptive policy transfer from multiple sources. CoRR, abs/1510.02879, 2015.
 [20] X.B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Simtoreal transfer of robotic control with dynamics randomization. CoRR, abs/1710.06537, 2017.
 [21] F. Sadeghi and S. Levine. CAD2RL: Real singleimage flight without a single real image. In RSS, 2017.
 [22] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell. Towards adapting deep visuomotor representations from simulated to real environments. CoRR, abs/1511.07111, 2015.
 [23] O. Kroemer and G.S. Sukhatme. Learning relevant features for manipulation skills using metalevel priors. CoRR, abs/1605.04439, 2016.
 [24] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [25] MuJoCo: A physics engine for modelbased control. In IROS, 2012.
Comments
There are no comments yet.