Log In Sign Up

Composing Ensembles of Policies with Deep Reinforcement Learning

Composition of elementary skills into complex behaviors to solve challenging problems is one of the key elements toward building intelligent machines. To date, there has been plenty of work on learning new policies or skills but almost no focus on composing them to perform complex decision-making. In this paper, we propose a policy ensemble composition framework that takes the robot's primitive policies and learns to compose them concurrently or sequentially through reinforcement learning. We evaluate our method in problems where traditional approaches either fail or exhibit high sample complexity to find a solution. We show that our method not only solves the problems that require both task and motion planning but also exhibits high data efficiency, which is currently one of the main limitations of reinforcement learning.


Hierarchical Reinforcement Learning for Concurrent Discovery of Compound and Composable Policies

A common strategy to deal with the expensive reinforcement learning (RL)...

Learning Humanoid Robot Running Skills through Proximal Policy Optimization

In the current level of evolution of Soccer 3D, motion control is a key ...

Deep Reinforcement Learning to Acquire Navigation Skills for Wheel-Legged Robots in Complex Environments

Mobile robot navigation in complex and dynamic environments is a challen...

SEERL: Sample Efficient Ensemble Reinforcement Learning

Ensemble learning is a very prevalent method employed in machine learnin...

Towards Task-Prioritized Policy Composition

Combining learned policies in a prioritized, ordered manner is desirable...

Sample Efficient Social Navigation Using Inverse Reinforcement Learning

In this paper, we present an algorithm to efficiently learn socially-com...

Hierarchical Reinforcement Learning for Sequencing Behaviors

Recent literature in the robot learning community has focused on learnin...

1 Introduction

Reinforcement learning (RL) (Sutton & Barto, 2018) is emerging as one of the promising methods to solve complex decision-making tasks by maximizing human-defined rewards (Silver et al., 2016; Qureshi et al., 2017, 2018; Levine et al., 2016). Currently, the primary focus of RL research is on learning new behaviors (Lillicrap et al., 2015; Schulman et al., 2015, 2017; Haarnoja et al., 2018b) rather than composing already acquired skills for a combinatorial outburst in the robot’s abilities (Lake et al., 2017).

In this paper, we propose a novel policy ensemble composition method111Supplementary material and videos are available at that takes the basic, low-level, and easily obtainable set of the robot policies and learns a composite model through standard- or hierarchical-RL (Lillicrap et al., 2015; Schulman et al., 2015, 2017; Haarnoja et al., 2018b; Dayan & Hinton, 1993; Vezhnevets et al., 2017; Florensa et al., 2017; Nachum et al., 2018). Our composite model combines low-level policies into complex strategies to solve challenging problems. We evaluate our method in challenging environments where learning a single policy is either not possible or does not yield satisfactory results. Our approach has the following salient features. First, it does not assume any structure on the low-level policies. These policies could be entirely agnostic of high-level tasks, and our method can still compose them to solve complex problems. Second, we do not make any assumption on how the low-level policies are obtained. They could be human-defined decision functions or behaviors learned via imitation or RL. Third, our method is trainable through both standard and hierarchical RL algorithms and demonstrate considerable improvements in data efficiency and overall performance in solving problems.

Figure 1:

Composition as Markov Decision Process. Fig. (a) represents the graphical model for a simple MDP. Fig. (b) is the augmented graphical model that integrates composition of sub-level policies. Fig. (c) is the new MDP with the augmented state-space.

2 Related Work

In this section, we present a brief overview of existing solutions to the problem of compositionality and their limitations. We highlight that in the past, the research in learning new primitive skills such as Dynamic Movement Primitives (DMPs) (Schaal et al., 2005) received more attention than building methods to compose already acquired skills. We also mention hierarchical reinforcement learning methods to distinguish them from compositional reinforcement learning methods.

2.1 Composition of control objectives

Compositionality of past skills to acquire new skills has hardly been considered in RL, and to the best of the authors’ knowledge, there exist quite a few approaches that address this problem and yet to a limited extent. For instance, Todorov (2009) and Haarnoja et al. (2018a) addressed the problem of compositionality by combining the independent rewards and Q-functions, respectively, of human-defined sub-tasks of the given complex problem. Their methods extract the composite policy by merely maximizing the averaged reward/action-value functions of the individual sub-level tasks. In a similar vein, Sahni et al. (2017) propose a temporal-logic based composition of low-level policies that solve sub-tasks, defined by human experts, of the given task. In our work, we aim for the composition of general-purpose skills, policies, or motor primitives that are oblivious of complicated high-level tasks and can be reused by the agent wherever and whenever needed, even in different domains/tasks, during the operation. Another recent approach (Hausman et al., 2018) uses entropy regularization to learn diverse transferable skills that are composed sequentially to solve the same or slightly different task. In contrast, our method does not aim to learn composable skills but uses primitive motor skills for composition. Furthermore, we solve the challenge of the concurrent composition of skills rather than just sequential execution.

2.2 Dynamic Movement Primitives

Dynamic Movement Primitives (DMPs) (Ijspeert et al., 2002) correspond to compact, parameterized, and modular representation of robot skills. The ultimate goal of DMPs research is to obtain elemental movements that could be combined to obtain complex skills. A lot of research in this area revolves around formulating and learning such DMPs (Schaal et al., 2005; Ijspeert et al., 2013; Paraschos et al., 2013; Tamosiunaite et al., 2011; Matsubara et al., 2011)

. However, there is again quite a few approaches that address the challenge of composing such DMPs in an efficient, scalable manner. To date, such DMPs are usually combined through human-defined heuristic, mixture models or learning from demonstrations

(Konidaris et al., 2012; Muelling et al., 2010; Arie et al., 2012). There also exist some techniques that determine a model of primitive robot actions and the composite policies are extracted via planning (Veeraraghavan & Veloso, 2008; Zoliner et al., 2005). However, such methods tend to be less data efficient as learning the model of robot actions requires a vast amount of interactive experiences (Kaelbling & Lozano-Pérez, 2017). Unlike the methods mentioned above, we do not aim to learn composable representations or models of low-level policies. Instead, we strive for a compositional approach that can blend any black-box elemental controllers, policies, or even DMPs through conventional RL techniques.

2.3 Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning (HRL) commonly corresponds to a two-level policy structure that learns to solve complex tasks by either decomposing them into subtasks (Nachum et al., 2018) or through temporal abstraction using option framework. The task decomposition methods learn a high-level policy, through maximizing the task reward, to generate intermediate goals for the low-level policy that through achieving them can solve the overall complex task. These methods outperform traditional RL in problems with sub-optimal reward functions requiring complex decision making and task planning (Dayan & Hinton, 1993; Vezhnevets et al., 2017; Nachum et al., 2018). The options framework (Sutton et al., 1999; Precup, 2000) learns a set of sub-level policies (options), their termination functions, and a high-level over options to solve the given task. Prior work in option-based HRL formulation required pre-defined heuristic to determine options. Recent advancements led to the option-critic algorithm (Bacon et al., 2017) that concurrently learns the high-level policy over options and the underlying options with their termination functions. Despite an exciting step, option-critic algorithm requires regularization (Vezhnevets et al., 2016; Harb et al., 2018) or else it ends up discovering options for every time step or a single option for the entire task.

In practice, HRL methods tend to exhibit high sample complexity and therefore, require a huge number of interactions with the real environment. The primary objective of our composition model is to improve the sample complexity of conventional RL and HRL by exploiting prior knowledge or skills. In our experiments, we show that our method is trainable through HRL such as (Nachum et al., 2018) and exhibits significant improvement in data-efficiency and performance compared to traditional HRL methods.

3 Background

We consider a standard RL formulation (Fig. 1 (a)) based on Markov Decision Process (MDP) defined by a tuple , where and represent the state and action space,

is the set of transition probabilities, and

denotes the reward function. At time , the agent observes a state and performs an action . The agent’s action transitions the environment state from to with respect to the transition probability and leads to a reward .

For compositionality, we extend the standard RL framework by assuming that the agent has access to the finite set of primitive policies that could correspond to agent’s skills, controller, or motor-primitives. Our composition model is agnostic to the structure of primitive policy functions, but for the sake of this work, we assume that each of the sub-policies solves the MDP defined by a tuple . Therefore, , and are the state-space, transition probabilities and rewards of the primitive policy , respectively. Each of the primitive policies , , takes a state and outputs a distribution over the agent’s action space . We define our composition model as a composite policy , parameterize by , that outputs a distribution over the action space conditioned on the environment’s current state and the primitive policies . The state space of the composite model is . The space could include any task specific information such as target locations. Hence, in our framework, the state inputs to the primitive policies and composite policy need not to be the same.

In remainder of this section, we show that our composition model solves an MDP problem. To avoid clutter, we assume that both primitive policy ensemble and composite policy have the same state space , i.e., . The composition model samples an action from a distribution parameterized by the actions of sub-level policies and the state of the environment. We can augment the naive graphical model in Fig. 1 (a) to incorporate the outputs of sub-policies to determine the composite actions, as shown in Fig. 1 (b). It can be seen that by defining a new state space as , where are the outputs of sub-level policies, we can construct a new MDP, as shown in Fig. 1 (c), to represent our composite model. This new MDP is defined as where is the new composite state-space, is the action-space, is the transition probability function, and is the reward function for the given task.

4 Policy Ensemble Composition

In this section, we present our policy ensemble composition framework, shown in Fig. 2. Our composition model consists of i) the encoder network that takes the outputs of primitive policies and embeds them into latent spaces; ii) the decoder network that takes current state

of the environment and the latent embeddings from the encoder network to parameterize the attention network; iii) the attention network that outputs the probability distribution over the primitive low-level policies representing their mixture weights. The remainder of the section explains the individual models of our composition framework and the overall training procedure.

Figure 2: Policy ensemble composition model that takes the state information and a set of primitive policies’ output to compute a composite action .

4.1 Encoder Network

Our encoder is a bidirectional recurrent neural network (BRNN) that consists of Long Short-Term Memory units

(Hochreiter & Schmidhuber, 1997). The encoder takes the outputs of the policy ensemble and transform them into latent states of forward and backward RNN, denoted as and , respectively, where . The states of forward and backward RNN corresponds to their last hidden states denoted as and , respectively, in Fig. 2.

4.2 Decoder Network

Our decoder is a simple feed-forward neural network that takes the last hidden states of the forward and backward encoder network, i.e.,

, and the current state of the environment to map them into a latent space . The state input to the decoder network is defined as , where is the state input to the low-level policy ensemble and could be any additional information related to the given task, e.g., goal position of the target to be reached by the agent.

4.3 Attention Network

The composition weights (see Fig. 2) are determined by the attention network as follows:


where and . The weights for the composite policy are computed using gumbel-softmax denoted as , where T is the temperature term (Jang et al., 2016).

4.4 Composite policy

Given the primitive policy ensemble , the composite action is the weighted sum of all primitive policies outputs, i.e.,

. Since, we consider the primitive policies to be Gaussian distributions, the output of each primitive policy is parameterized by mean

and variance

, i.e., . Hence, the composite policy can be represented as , where denotes Gaussian distribution, and . Given the mixture weights, other types of primitive policies, such as DMPs (Schaal et al., 2005), can also be composed together by the weighted combination of their normalized outputs.

4.5 Composite model training objective

The general objective of RL methods is to maximize the cumulative expected reward, i.e., , where is a discount factor. We consider the policy gradient methods to update the parameters of our composite model, i.e., , where is the learning rate. We show that our composite policy can be trained through standard RL and HRL methods, described as follow.

4.5.1 Standard Reinforcement learning

In standard RL, the policy gradients are determined by either on-policy or off-policy updates (Lillicrap et al., 2015; Schulman et al., 2015, 2017; Haarnoja et al., 2018b) and any of them could be used to train our composite model. However, in this paper, we consider off-policy soft-actor critic (SAC) method (Haarnoja et al., 2018b) for the training of our policy function. SAC maximizes the expected entropy in addition to the expected reward, i.e.,



is a hyperparameter. We use SAC as it motivates exploration and has been shown to capture the underlying multiple modes of an optimal behavior. Since there is no direct method to estimate a low-variance gradient of Eq (2), we use off-policy value function-based optimization algorithm (for details refer to Appendix A.1 of supplementary material).

4.5.2 Hierarchical Reinforcement Learning

In HRL, there are currently two streams - task decomposition through sub-goals (Nachum et al., 2018) and option framework (Bacon et al., 2017) that learns temporal abstractions. In the options framework, the options can be composite policies that are acquired with their termination functions. In task decomposition methods that generate sub-goal through high-level policy, the low-level policy can be replaced with our composite policy. In our work, we use the latter approach (Nachum et al., 2018), known as HIRO algorithm, to train our policy function.

Like, standard HIRO, we use two level policy structure. At each time step , the high-level policy , with parameters , observes a state and takes an action by generating a goal in the state-space for the composite low-level policy to achieve. The takes the state , the goal , and the primitive actions to predict a composite action through which an agent interacts with the environment. The high-level policy is trained to maximize the expected task rewards given by the environment whereas the composite low-level policy is trained to maximize the expected intrinsic reward defined as the negative of distance between current and goal states, i.e., . To conform with HIRO settings, we perform off-policy correction of the high-level policy experiences and we train both high- and low-level policies via TD3 algorithm (Fujimoto et al., 2018) (for details refer to Appendix A.2 of supplementary material).

(a) Ant Random Goal
(b) Ant Cross Maze
(c) Pusher
(d) HalfCheetah Hurdle
(e) Ant Maze
(f) Ant Push
(g) Ant Fall
Figure 3: Benchmark control and manipulation tasks requiring an agent to reach or move the object to the given targets (shown in red for pusher and green for rest).

5 Experiments and Results

We evaluate and compare our method against standard RL, and HRL approaches in challenging environments (shown in Fig. 3) that requires complex task and motion planning. The implementation details of all presented methods and environment settings are provided in Appendix C of supplementary material. We also do an ablative study in which we take away different components of our composite model to highlight their importance.

We consider the following seven environments for our analysis: (1) Pusher: A simple manipulator has to push an object to a given target location. (2) Ant Random Goal: In this environment, a quadruped-Ant is trained to reach the randomly sampled goal location in the confined circular region. (3) Ant Cross Maze: The cross-maze contains three target locations. The task for a quadruped Ant is to reach any of the three given targets by navigating through a 3D maze without collision. (4) HalfCheetah Hurdle: In this problem, the task for a halfcheetah is to run and jump over the three barriers to reach the given target location. (5) Ant Maze: A -shaped maze poses a challenging navigation task for a quadruped-Ant. In this task, the agent is given random targets all along the maze to reach while training. However, during the evaluation, we test the agent for reaching the farthest end of the maze. (6) Ant Push: A challenging environment that requires both task and motion planning. The environment contains a movable block, and the goal region is located behind that block. The task for an agent is to reach the target by first moving to the left of the maze so that it can move up and right to push the block out of the way for reaching the target. (7) Ant Fall: A navigation task where the target is located across the rift in front of the agent’s initial position. There also happen to be a moveable block, so the agent has to move to the right, push the block forward, fill the gap, walk across, and move to the left to reach the target location.

In all tasks, we also acquire primitive skills of the agent for our composite policy. For Ant, we use four basic policies for moving left, right, up, and down. The pusher uses two primitive policies that are to push an object to the left and down. In HalfCheetah hurdle environment, the low-level policies include jumping and running forward. Furthermore, in all environments, except pusher, the primitive policies were agnostic of high-level target locations that were therefore provided separately to our composite model via decoder network.

Methods Environments
Ant Random Goal Ant Cross Maze Pusher HalfCheetah Hurdle
Our Method
Table 1: Performance comparison of our model against SAC (Haarnoja et al., 2018b), TRPO (Schulman et al., 2015), and PPO (Schulman et al., 2017)

on benchmark control tasks in terms of distance (lower the better) of an agent from the given target. The mean final distances with standard deviations over ten trials are reported. We also normalize the reported values by the agent initial distance from the goal so values close to 1 or higher show failure. It can be seen that our method (shown in bold) accomplishes the tasks by reaching goals whereas other methods fail except for SAC in simple Pusher and Ant Random Goal environments.

5.1 Comparative study

In our comparative studies, we divide our test environments into two groups. The first group includes Pusher, Random Goal Ant, Ant Cross Maze, and HalfCheetah-Hurdle environments, whereas the second group comprises the remaining environments that require task and motion planning under weak reward signals.

Figure 4: Comparison results of our method against several standard RL methods averaged over ten trials in a set of difficult tasks. The vertical and horizontal axis represents the distance of the agent/object from the target and environment steps in millions, respectively. Note that our composition framework learns to solve the task with high samples efficiency, whereas other benchmark methods either fail or perform poorly.

In the first group of settings, we compare our composite model trained with SAC (Haarnoja et al., 2018b) against the standard Gaussian policies obtained using SAC (Haarnoja et al., 2018b), PPO (Schulman et al., 2017), and TRPO (Schulman et al., 2015). We exclude HRL methods in these cases for two reasons. First, the environment rewards sufficiently represent the underlying task, whereas HRL approaches are applicable in cases that have a weak reward signal or require task and motion planning. Second, HRL methods usually need a large number of training steps generally much more than tradition RL methods. Table 1 presents the mean and standard deviation of the agent’s final distance from the given targets after the end of an evaluation rollout over the ten trials. Fig. 4 shows the mean learning performance over all trials during the three million training steps. In these set of problems, TRPO and PPO entirely fail to reach the goal, and SAC performs reasonably well but only in simple Ant Random Goal and Pusher environments as it fails in other cases. Our composite policy obtained using SAC successfully solves all tasks and exhibit high data-efficiency by learning in merely a few thousand training steps.

Figure 5:

Performance comparison of our composition model trained with HIRO against standard HIRO in three challenging environments with a standard Ant of 150 units torque limit. We report mean and standard error, over ten trials, of agent final distances from the given given goals, normalized by their initial distance, over 7 million steps.

In our second group of environments, we use distance-based rewards that are weak as greedily following them does not lead to solving the problem. Furthermore, in these environments, policies trained with standard RL, including our composite policy, failed to solve the problem even after 20 million training steps. Therefore, we trained our composite policy with HIRO (Nachum et al., 2018) and compared its performance against standard HIRO formulation (Nachum et al., 2018). We also tried to include option-critic framework (Bacon et al., 2017), but we were unable to get any considerable results with their online implementation despite several attempts with the parameter tuning. One of the reasons option-critic fails is because it relies purely on task rewards to learn, which makes them inapplicable for cases with weak reward signals (Nachum et al., 2018). Furthermore, unlike HIRO that used a modified Ant with 30 units joint torque limit, we use Mujoco standard Ant that has a torque limit of 150 units and makes the learning even harder as the Ant is now more prone to instability.

Fig. 5 shows the learning performance, averaged over ten trials, during 7 million steps. In these problems, the composite policy with HIRO outperforms standard HIRO (Nachum et al., 2018) by a significant margin that certifies the utility of solving RL tasks using composition by leveraging basic pre-acquired skills. HIRO performs poorly with standard Ant as it imposes a harder control problem since the agent should also learn to balance the Ant to prevent it from flipping over due to high torques. We were able to replicate the results of HIRO (Nachum et al., 2018) on their modified Ant (Torque Limit 30) and also, our composition model gave comparably better results on modified Ant than standard Ant. However, we use a standard-Ant to conform among all Ant environments presented in this paper. In the Ant Fall environment, composition model struggles to perform well which we believe is because the low-level policies were trained in a 2D planner space rather than a 3D space with an elevation that slightly changes the underlying state-space.

Figure 6:

Ablative Study: Performance comparison, averaged over ten trials, of our composite model against its ablated variations that lack attention model or both attention and bidirectional-RNN (AttBRNN) in three different environments.

5.2 Ablative study

We remove attention-network, and both attention-network and BRNN (AttBRNN) from our composition model to highlight their importance in the proposed architecture in solving complex problems. We train all models with SAC (Haarnoja et al., 2018b). The first model is our composite policy without attention in which the decoder network takes the state information and last hidden states of the encoder (BRNN) to directly output actions rather than mixture weights. The second model is without attention network and BRNN; it is a feed-forward neural network that takes the state information and the primitive actions and predicts the action to interact with the environment. Fig. 6 shows the mean performance comparison, over ten trials, of our composite model against its ablated versions on a Ant Random Goal, Cross Maze Ant, and Pusher environment. We exclude remaining test environments in this study as ablated models completely failed to perform or show any progress. Note that the better performance of our method compared to ablated versions highlight the merits of our architecture design. Intuitively, BRNN allows the dynamic composition of a skill set of variable lengths and the attention network bypasses the complex transformation of action embeddings (model-without-attention) or actions and state-information (AttBRNN model) directly to action space.

6 Conclusions and Future work

We present a novel policy ensemble composition method that concurrently or sequentially compose primitive motor policies of the robot through RL to solve challenging problems. Our experiments show that composition is vital for solving problems that require task and motion planning where reinforcement learning and hierarchical reinforcement learning methods either fail or need a massive number of interactions with the environment for learning.

In our future work222In our supplementary material (Appendix B), we provide a detailed discussion with preliminary results to concretely illuminate our future research direction., we plan to extend our composition framework to acquire new skills that do not exist in the given primitive skillset but are necessary to solve the given problem. We also aim towards a system that learns the hierarchies of composition models by combining primitive policies into complex policies which would further be composed together for a combinatorial outburst in the robot’s skillset.


Appendix A Composite Model Training Algorithms

a.1 Training with Soft Actor-Critic

In this section, we briefly describe the procedure to train our composition model using SAC (Haarnoja et al., 2018b). Although any RL method can be used to optimize our model, we use SAC as it is reported to perform better than other training methods. Our composite policy is a tractable function parameterized by . The composite policy update through SAC requires the approximation of Q- and value-functions. The parametrized value- and Q-function are denoted as with parameters , and with parameters , respectively. Since, SAC algorithm build on the soft-policy iteration, the soft value-function and soft Q-function are learned by minimizing the squared residual error and squared Bellman error , respectively, i.e.,


where is a replay buffer, and is the Bellman target computed as follows:


The function is the target value function with parameters . The parameters are the moving average of the parameters computed as , where is the smoothing coefficient. Finally the policy parameters are updated by minimizing the following expected KL-divergence.


where is a partition function that normalizes the distribution. Since, just-like SAC, our Q-function is differentiable, the above cost function can be determined through a simple reparametization trick, see Haarnoja et al. (2018b) for details. Like SAC, we also maintain two Q-functions that are trained independently, and we use the minimum of two Q-functions to compute Eqn. 3 and Eqn. 6. This way of using two Q-function has been shown to alleviate the positive biasness problem in the policy improvement step. The overall training procedure is summarized in Algorithm 1.

Initialize parameter vectors

, , , Input: Primitive policies for each iteration do
       for each environment step do
             Compute primitive policies state Sample primitive actions Sample composite action Sample next state
      for each gradient step do
             Update value function Update Q-function Update policy
Algorithm 1 Composition model training using SAC

a.2 Training with HIRO

In this section, we outline the algorithm to train composite policy through HIRO that employs the two level policy structure. The high-level policy generates the sub-goals for the low-level composite policy to achieve. The low-level composite policy also have access to the primitive policy actions. Like HIRO, we use TD3 algorithm (Fujimoto et al., 2018) to train both high-level and low-level policies with their corresponding Q-functions, and , respectively. The low-level policy , with parameters , is trained to maximize the Q-values from the low-level Q-function for the given state-goal pairs. The Q-function parameters are optimized by minimizing temporal-difference error for the given transitions, i.e.,


where and .

The high-level policy , with parameters , is trained to maximize the values of . The Q-function parameters are trained through minimizing the following loss for the given transitions.


During training, the continuous adaptation of low-level policy poses a non-stationery problem for the high-level policy. To mitigate the changing behavior of low-level policy, Nachum et al. (2018) introduced off-policy correction of the high-level actions. During correction, the high-level policy action is usually re-labeled with that would induce the same low-level policy behavior as was previously induced by the original high-level action (for details, refer to (Nachum et al., 2018)). Algorithm 2 presents the procedure to train our composite policy with HIRO.

Initialize parameter vectors , , , Input: Primitive policies for each iteration do
       for each environment step do
             Compute primitive policies state Sample primitive actions Sample high-level action Sample composite action Sample next state
      for each gradient step do
             Sample mini-batch with c-step transitions Compute rewards for low-level policy Update w.r.t using (Nachum et al., 2018) Update w.r.t using (Nachum et al., 2018)
Algorithm 2 Composition model training using HIRO

Appendix B Future Research Directions & Discussion

In this section, we present preliminary results to highlight the potential research avenues to further extend our method.

Figure 7: New skill acquisition task. The composite model has access to a primitive policy for moving right and a trainable policy function. Our method trained the new function to move in the upward direction to reach the given goal.

b.1 New skills acquisition

Apart from compositionality, another research problem in the way of building intelligent machines is the autonomous acquisition of new skills that were lacking in the system and therefore, hindered it from the solving the given task (Lake et al., 2017). In this section, we demonstrate that our composition model holds the potential for building such a system that autonomously acquires missing skills and compose them together with the existing skill set to solve the given problem.

Fig. 7 shows a simple ant-maze environment in which the 3D quadruped-Ant has to reach the target, indicated as a green region. In this problem, we equip our composition model with a single primitive policy for walking in the right direction and a trainable, parametric policy function. The trainable policy function takes the state without goal information and outputs the parameters for Gaussian distribution to model the robot’s action space. Note that the composite model requires a skill of walking upward in addition to moving in the right direction to reach the given goal. To determine if our composition model can teach the new parametric policy function to move upward, we trained our composition model along with the new policy using the shared Q- and value-functions. Once trained, we observed that our composition model learned the parametric policy function for walking in an upward direction with slight right-turning behavior.

The plot in Fig. 7 shows the performance of our composition model and standard RL policy obtained with SAC in this environment. The vertical axis indicates the distance from the target location averaged over five randomly seeded trials. The horizontal axis shows the number of environment steps in millions. It can be observed that our method converged faster than standard RL policy. It shows that our approach is also utilizing the existing skills and therefore learns faster than a standard policy function that solves the entire task from scratch. Furthermore, in current settings, we do not impose any constraint on the new policy function that would make it learn only the missing skills rather than replicating existing skills or solving the given task entirely by itself. We leave the formulation of such constraint for autonomous acquisition of new, missing skills to our future works.

b.2 Hierarchy of composition models

In this section, we discuss that our method can be extended to create the hierarchies of composition models by combining primitive policies into complex policies which can further be composed together to form even more complex policies. Fig. 8 presents an example of such a model. The composition hierarchy begins by combining basic primitive skills into complex models up to the level to build a complex composite function. Such a tree/hierarchy can be constructed through stage-wise training of composite models similar to Boosting algorithms (Schapire, 1999). We performed a simple experiment in the Ant Cross Maze environment where the primitive left, right, up and down policies were composed to create three composite models for reaching left, right and top target locations. These composite models were further combined to solve the original Ant Cross Maze task. We observed that our hierarchical composite model converged significantly faster than the single-layer composite model. However, we leave further experimentation and development of stage-wise training algorithm, similar to Boosting (Schapire, 1999), to our future works.

Figure 8: Viewing the composition model as a MDP, we can construct a hierarchy of compositions by integrating multiple composition policies. In the above figure, node and could be actions sampled from sub-level or composite policies. Node are compositions of the policies in the base layer, and node is the overall composite action at third layer.

Appendix C Implementation details

c.1 Environment Details

In this section, we present the environment details including reward functions, primitive policies, and state space information. The reward functions are presented in the Table 3 together with the overall reward scaling values.

c.1.1 Ant environments

In these environments, we use 8 DOF four-legged Ant with 150 units torque limit. The primitive policies of moving left, right, down and up were shared across all these tasks. In these environments, the information in the state corresponds to the target location. Let us introduce the notation to defined reward function. Let , , , and denote xy-position of the robot’s torso, xy-position of the goal, joint torques, and contact-cost, respectively. The scaling factors are defined as . The reward function for the following environments is defined as with reward scaling of 5 units:


Ant Random Goal: In this environment, the ant has to navigate to any randomly sampled target within the confined circular region of radius 5 units. The goal radius is defined to be 0.25 units. The reward function coefficients , , , , and are , , , , and , respectively.

Ant Cross Maze: In this environment, the ant has to navigate through the 3D maze to reach any of the target sampled from the three targets. The goal radius is defined to be 1.0 units. The reward function parameters are same as for the random-goal ant environment.

Simple Ant Maze: The Ant has to navigate through the maze to reach the single target, shown in Fig. 7. The goal radius and reward function parameter were defined as in ant-maze and random-goal ant environments.

For the remaining environment (Ant Maze, Ant Push and Ant Fall), we use the following reward function with no reward scaling:


where coefficients , , and are set to be , , and .

Ant Maze: In this environment, we place the Ant in a -shaped maze for a navigation task between given start and goal configurations. The goal radius is defined to be 5 units. During training, the goal is uniformly sampled from space, and the Ant initial location is always fixed at . During testing, the agent is evaluated to reach the farthest end of the maze located at within L2 distance of 5.

Ant Push: In this environment, the Ant is initially located at coordinate, the moveable block is at , and the goal is at . The agent is trained to reach randomly sampled targets whereas during testing, we evaluate the agent to reach the goal at within L2 distance of 5.

Ant Fall: In this environment, the Ant has to navigate in a 3D maze. The initial agent location is , and a movable block is at at the same elevation as Ant. Their is a rift in the region . To reach the target on the other side of the rift, the Ant must push the block down into the rift, and then step on it to get to the goal position.

Learning rate - -
Discount factor () 0.99 0.99 0.99 0.99
Nonlinearity in feedforward networks ReLU ReLU ReLU ReLU
Minibatch samples size 256 128 - -
Replay buffer size - -
Batch-size - - 1000 1000
Target parameters smoothing coefficient () 0.005 0.005 - -
Target parameters update interval 1 2 - -
Gradient steps 1 1 0.01 0.01
Gumbel-softmax temperature 0.5 0.5 - -
Table 2: Hyperparameters
Model Architectures Hidden units
Composition-HIRO High-level Policy: Three layer feed forward network 300
Encoder Network: Bidirectional RNN with LSTMs 128
Decoder Network (Single layer feed forward network) 128
Attention Network: 128
Composition-SAC Encoder Network: Bidirectional RNN with LSTMs 128
Decoder Network (Single layer feed forward network) 128
Attention Network: 128
HIRO High-level Policy: Three layer feed forward network 300
Low-level Policy: Three layer feed forward network 300
Standard RL policy Two layer feed forward network 256
Table 3: Network Architectures

c.1.2 Pusher

In pusher environment, a simple manipulator has to move an object to the target location. The primitive policies were to push the object to the bottom and left. In this environment, the state information for both primitive policies and the composite policy include the goal location. Therefore, , in this case, is null. The reward function is given as:


where , , , and are xy-position of object, xy-position of goal, xy-position of arm, and joint-torques. The coefficients , , and are , , and , respectively.

c.1.3 Halfcheetah-hurdle

In halfcheetah-hurdle environment, a 2D cheetah has to jump over the three hurdles to reach the target. In this environment, the information in the state corresponds to the x-position of the next nearest hurdle in front of the agent as well as the distance from that hurdle. The reward function is defined as:


where , , , and are xy-position of robot torso, xy-position of goal, velocity along z-axis, and velocity along x-axis, respectively. The function returns a count indicating the number of hurdles in front of the robot. The indicator function returns 1 if the agent has reached the target otherwise 0. The function is a collision checker which returns 1 if the agent collides with the hurdle otherwise 0. The reward function coefficients , , , , , and are , , , , and , respectively.

c.2 Hyperparameters and Network Architectures

Table 2 summarizes the hyperparameters used to train policies with SAC (Haarnoja et al., 2018b), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and HIRO (Nachum et al., 2018).

Table 3 summarizes the network architectures. The standard RL policy structure correspond to simple SAC, TRPO and PPO policies. The right most column shows the hidden units per layer.