Data-Efficient Hierarchical Reinforcement Learning

05/21/2018 ∙ by Ofir Nachum, et al. ∙ Google 0

Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve more complex tasks. Yet, the majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios. In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control. For generality, we develop a scheme where lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. To address efficiency, we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge. This allows us to take advantage of recent advances in off-policy model-free RL to learn both higher- and lower-level policies using substantially fewer environment interactions than on-policy algorithms. We term the resulting HRL agent HIRO and find that it is generally applicable and highly sample-efficient. Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning (RL) has made significant progress on a range of continuous control tasks, such as locomotion skills schulman2015trust ; ddpg ; heess2017emergence , learning dexterous manipulation behaviors rajeswaran , and training robot arms for simple manipulation tasks gu2017deep ; vevcerik2017leveraging . However, most of these behaviors are inherently atomic: they require performing some simple skill, either episodically or cyclically, and rarely involve complex multi-level reasoning, such as utilizing a variety of locomotion behaviors to accomplish complex goals that require movement, object interaction, and discrete decision-making.

Hierarchical reinforcement learning (HRL), in which multiple layers of policies are trained to perform decision-making and control at successively higher levels of temporal and behavioral abstraction, has long held the promise to learn such difficult tasks dayan1993feudal ; parr1998reinforcement ; sutton1999between ; barto2003recent . By having a hierarchy of policies, of which only the lowest applies actions to the environment, one is able to train the higher levels to plan over a longer time scale. Moreover, if the high-level actions correspond to semantically different low-level behavior, standard exploration techniques may be applied to more appropriately explore a complex environment. Still, there is a large gap between the basic definition of HRL and the promise it holds to successfully solve complex environments. To achieve the benefits of HRL, there are a number of questions that one must suitably answer: How should one train the lower-level policy to induce semantically distinct behavior? How should the high-level policy actions be defined? How should the multiple policies be trained without incurring an inordinate amount of experience collection? Previous work has attempted to answer these questions in a variety of ways and has provided encouraging successes vezhnevets2017feudal ; florensa2017stochastic ; frans2017meta ; heess2016learning ; sigaud2018policy . However, many of these methods lack generality, requiring some degree of manual task-specific design, and often require expensive on-policy training that is unable to benefit from advances in off-policy model-free RL, which in recent years has drastically brought down sample complexity requirements td3 ; sac ; barth2018distributed .

Figure 1: The Ant Gather task along with the three hierarchical navigation tasks we consider: Ant Maze, Ant Push, and Ant Fall. The ant (magenta rectangle) is rewarded for approaching the target location (green arrow). A successful policy must perform a complex sequence of directional movement and, in some cases, interact with objects in its environment (red blocks); e.g., pushing aside an obstacle (second from right) or using a block as a bridge (right). In our HRL method, a higher-level policy periodically produces goal states (corresponding to desired positions and orientations of the ant and its limbs), which the lower-level policy is rewarded to match (blue arrow).

For generality, we propose to take advantage of the state observation provided by the environment to the agent, which in locomotion tasks can include the position and orientation of the agent and its limbs. We let the high-level actions be goal states and reward the lower-level policy for performing actions which yield it an observation close to matching the desired goal. In this way, our HRL setup does not require a manual or multi-task design and is fully general.

This idea of a higher-level policy commanding a lower-level policy to match observations to a goal state has been proposed before dayan1993feudal ; vezhnevets2017feudal . Unlike previous work, which represented goals and rewarded matching observations within a learned embedding space, we use the state observations in their raw form. This significantly simplifies the learning, and in our experiments, we observe substantial benefits for this simpler approach.

While these goal-proposing methods are very general, they require training with on-policy RL algorithms, which are generally less efficient than off-policy methods gu2016q ; tpcl . On-policy training has been attractive in the past since, outside of discrete control, off-policy methods have been plagued with instability gu2016q

, which is amplified when training multiple policies jointly, as in HRL. Other than instability, off-policy training poses another challenge that is unique to HRL. Since the lower-level policy is changing underneath the higher-level policy, a sample observed for a certain high-level action in the past may not yield the same low-level behavior in the future, and thus not be a valid experience for training. This amounts to a non-stationary problem for the higher-level policy. We remedy this issue by introducing an off-policy correction, which re-labels an experience in the past with a high-level action chosen to maximize the probability of the past lower-level actions. In this way, we are able to use past experience for training the higher-level policy, taking advantage of progress made in recent years to provide stable, robust, and general off-policy RL methods 

td3 ; tpcl ; barth2018distributed .

In summary, we introduce a method to train a multi-level HRL agent that stands out from previous methods by being both generally applicable and data-efficient. Our method achieves generality by training the lower-level policy to reach goal states learned and instructed by the higher-levels. In contrast to prior work that operates in this goal-setting model, we use states as goals directly, which allows for simple and fast training of the lower layer. Moreover, by using off-policy training with our novel off-policy correction, our method is extremely sample-efficient. We evaluate our method on several difficult environments. These environments require the ability to perform exploratory navigation as well as complex sequences of interaction with objects in the environment (see Figure 1). While these tasks are unsolvable by existing non-HRL methods, we find that our HRL setup can learn successful policies. When compared to other published HRL methods, we also observe the superiority of our method, in terms of both final performance and speed of learning. In only a few million experience samples, our agents are able to adequately solve previously unapproachable tasks.

2 Background

We adopt the standard continuous control RL setting, in which an agent interacts with an environment over periods of time according to a behavior policy . At each time step , the environment produces a state observation .The agent then samples an action and applies the action to the environment. The environment then yields a reward sampled from an unknown reward function and either terminates the episode at state or transitions to a new state sampled from an unknown transition function . The agent’s goal is to maximize the expected future discounted reward , where is a user-specified discount factor. A well-performing RL algorithm will learn a good behavior policy from (ideally a small number of) interactions with the environment.

2.1 Off-Policy Temporal Difference Learning

Temporal difference learning is a powerful paradigm in RL, in which a policy may be learned efficiently from state-action-reward transition tuples collected from interactions with the environment. In our HRL method, we utilize the TD3 learning algorithm td3 , a variant of the popular DDPG algorithm for continuous control ddpg .

In DDPG, a deterministic neural network policy

is learned along with its corresponding state-action Q-function by performing gradient updates on parameter sets and . The Q-function represents the future value of taking a specific action starting from a state . Accordingly, it is trained to minimize the average Bellman error over all sampled transitions, which is given by

(1)

The policy is then trained to yield actions which maximize the Q-value at each state. That is, is trained to maximize over all collected from interactions with the environment.

We note that although DDPG trains a deterministic policy , its behavior policy, which is used to collect experience during training is augmented with Gaussian (or Ornstein-Uhlenbeck) noise ddpg . Therefore, actions are collected as

for fixed standard deviation

, which we will shorten as . We will take advantage of the fact that the behavior policy is stochastic for the off-policy correction in our HRL method. TD3 td3 makes several modifications to DDPG’s learning algorithm to yield a more robust and stable procedure. Its main modification is using an ensemble over Q-value models and adding noise to the policy when computing the target value in Equation 1.

3 General and Efficient Hierarchical Reinforcement Learning

In this section, we present our framework for learning hierarchical policies, HIRO: HIerarchical Reinforcement learning with Off-policy correction. We make use of parameterized reward functions to specify a potentially infinite set of lower-level policies, each of which is trained to match its observed states to a desired goal. The higher-level policy chooses these goals for temporally extended periods, and uses an off-policy correction to enable it to use past experience collected from previous, different instantiations of the lower-level policy.

3.1 Hierarchy of Two Policies

[itemsep=1.1ex,leftmargin=0.1in] Collect experience , , , . Train with experience transitions using as additional state observation and reward given by goal-conditioned function . Train on temporally-extended experience , where is re-labelled high-level action to maximize probability of past low-level actions . Repeat.
Figure 2: The design and basic training of HIRO. The lower-level policy interacts directly with the environment. The higher-level policy instructs the lower-level policy via high-level actions, or goals, which it samples anew every steps. On intermediate steps, a fixed goal transition function determines the next step’s goal. The goal simply instructs the lower-level policy to reach specific states, which allows the lower-level policy to easily learn from prior off-policy experience.

We extend the standard RL setup to a hierarchical two-layer structure, with a lower-level policy and a higher-level policy (see Figure 2). The higher-level policy operates at a coarser layer of abstraction and sets goals to the lower-level policy, which correspond directly to states that the lower-level policy attempts to reach. At each time step , the environment provides an observation state . The higher-level policy observes the state and produces a high-level action (or goal) by either sampling from its policy when , or otherwise using a fixed goal transition function (which in the simplest case can be a pass-through function, although we will consider a slight variation in our specific design). This provides temporal abstraction, since high-level decisions via are made only every steps. The lower-level policy observes the state and goal and produces a low-level atomic action , which is applied to the environment. The environment then yields a reward sampled from an unknown reward function and transitions to a new state sampled from an unknown transition function .

The higher-level controller provides the lower-level with an intrinsic reward , using a fixed parameterized reward function . The lower-level policy will store the experience for off-policy training. The higher-level policy collects the environment rewards and, every time steps, stores the higher-level transition for off-policy training.

3.2 Parameterized Rewards

Figure 3: An example of a higher-level policy producing goals in terms of desired observations, which in this task correspond to positions and orientations of all of the joints of a quadrupedal robot (including root position). The lower-level policy has direct control of the agent (pink), and is rewarded for matching the position and orientation of its torso and each limb to the goal (blue rectangle, raised for visibility). In this way, the two-layer policy can perform a complex task involving a sequence of movements and interactions; e.g. pushing a block aside to reach a target (green).

Our higher-level policy produces goals indicating desired relative changes in state observations. That is, at step , the higher-level policy produces a goal , indicating its desire for the lower-level agent to take actions that yield it an observation that is close to . Although some state dimensions (e.g., the position of the quadrupedal robot in Figure 3) are more natural as goal subspaces, we chose this more generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions. This makes our method general and easy to apply to new problem settings. To maintain the same absolute position of the goal regardless of state change, the goal transition model is defined as

(2)

We define the intrinsic reward as a parameterized reward function based on the distance between the current observation and the goal observation:

(3)

This rewards the lower-level policy for taking actions that yield observations that are close to the desired value . In our evaluations on simulated ant locomotion, we use all positional observations as the representation for , without distinguishing between the root position or the joints, making for a generic and broadly applicable choice of goal space. The reward and transition function are computed only with respect to these positional observations. See Figure 3 for an example of the goals chosen during a successful navigation of a complex environment.

The lower-level policy may be trained using standard methods by simply incorporating as an additional input into the value and policy models. For example, in DDPG, the equivalent objective to Equation 1 in terms of lower-level Q-value function is to minimize the error

(4)

for all transitions . The policy would be trained to maximize the Q-value for all sampled state-goal tuples .

Parameterized rewards are not a new concept, and have been studied previously uvf ; held2017automatic . They are a natural choice for a generally applicable HRL method and have therefore appeared as components of other HRL methods vezhnevets2017feudal ; kulkarni2016hierarchical ; plappert2018multi ; levy2017hierarchical . A significant distinction between our method and these prior approaches is that we directly use the state observation as the goal, and changes in the state observation as the action space for the higher-level policy, in contrast to prior methods that must train the goal representation. This allows the lower-level policy to begin receiving reward signals immediately, even before the lower-level policy has figured out how to reach the goal and before the task’s extrinsic reward provides any meaningful supervision. In our experiments (Section 5), we find that this produces substantially better results.

3.3 Off-Policy Corrections for Higher-Level Training

While a number of prior works have proposed two-level HRL architectures that involve some sort of goal setting, such designs in previous work generally require on-policy training vezhnevets2017feudal . This is because the changing behavior of the lower-level policy creates a non-stationary problem for the higher-level policy, and old off-policy experience may exhibit different transitions conditioned on the same goals. However, for HRL methods to be applicable to real-world settings, they must be sample-efficient, and off-policy algorithms (often based on some variant of Q-function learning) generally exhibit substantially better sample efficiency than on-policy actor-critic or policy gradient variants. In this section, we describe how we address the challenge of off-policy training of the higher-level policy.

We would like to take the higher-level transition tuples , where denotes the sequence , which are collected by the higher-level policy and convert them to state-action-reward transitions that can be pushed into the replay buffer of any standard off-policy RL algorithm. However, since transitions obtained from past lower-level controllers do not accurately reflect the actions (and therefore resultant states ) that would occur if the same goal were used with the current lower-level controller, we must introduce a correction that translates old transitions into ones that agree with the current lower-level controller.

Our main observation is that the goal of a past high-level transition may be changed to make the actual observed action sequence more likely to have happened with respect to the current instantiation of . The high-level action which in the past induced a low-level behavior may be re-labeled to a goal which is likely to induce the same low-level behavior with the current instantiation of the lower-level policy. Thus, we propose to remedy the off-policy issue by re-labeling the high-level transition with a different high-level action chosen to maximize the probability , where the intermediate goals are computed using the fixed goal transition function . In effect, each time we modify the low-level policy , we would like to answer the question: for which goals would this new controller have taken the same actions as the old one?

Most RL algorithms will use random action-space exploration to select actions, which means that the behavior policy (even for deterministic algorithms such as DDPG ddpg ) is stochastic and the log probability may be computed as

(5)

To approximately maximize this quantity in practice, we compute this log probability for a number of goals , and choose the maximal goal to re-label the experience. In our implementation, we calculate the quantity on eight candidate goals sampled randomly from a Gaussian centered at . We also include the original goal and a goal corresponding to the difference in the candidate set, to have a total of 10 candidates. This provides a suitably diverse set of to approximately solve the of Equation 5, while also biasing the result to be closer to candidates which we believe to be appropriate given our knowledge of the problem (see additional implementation details in the Appendix). Our approach here is only an approximation, and we elaborate on possible alternative off-policy corrections in the Appendix.

4 Related Work

Discovering meaningful and effective hierarchies of policies is a long standing research problem in RL dayan1993feudal ; parr1998reinforcement ; sutton1999between ; dietterich2000hierarchical ; bacon2017option

. Classically, the work on HRL focused on discrete state domains, where state visitation and transition statistics can be used to construct heuristic sub-goals for low-level policies 

stolle2002learning ; mannor2004dynamic ; chentanez2005intrinsically . The options framework sutton1999between ; precup2000temporal , a popular formulation for HRL, proposes a termination policy for each sub-policy (option). While the traditional options framework relies on prior knowledge for designing options, bacon2017option recently derived an actor-critic algorithm for learning them jointly with the higher-level policy. This option-critic architecture bacon2017option is an important step toward end-to-end HRL; however, such approaches are often prone to learning either a sub-policy that terminates every time step, or one effective sub-policy that runs through the whole episode. In practice, regularizers are essential to learn multiple effective and temporally abstracted sub-policies bacon2017option ; harb2017waiting ; vezhnevets2016strategic .

To guarantee learning useful sub-policies, recent work has studied approaches that provide auxiliary rewards for the low-level policies chentanez2005intrinsically ; heess2016learning ; kulkarni2016hierarchical ; tessler2017deep ; florensa2017stochastic . These approaches rely on hand-crafted rewards based on prior domain knowledge konidaris2007building ; heess2016learning ; kulkarni2016hierarchical ; tessler2017deep or diversity-encouraging rewards like mutual information daniel2012hierarchical ; florensa2017stochastic . A number of works have suggested that semantically distinct behavior can be induced by training on a set of diverse tasks, and have suggested pre-training the lower-level policy on such tasks heess2016learning ; florensa2017stochastic , or training the multi-level hierarchical policy in a multi-task setup frans2017meta ; sigaud2018policy . However, having access to a collection of suitably similar tasks is a luxury which is not always available and may require hand-design. Our method uses a generic reward that is specified with respect to the state space, and therefore avoids designing various rewards or multiple tasks.

Another difference from most HRL work florensa2017stochastic ; frans2017meta is that we use off-policy learning, leading to significant improvements in sample efficiency. In end-to-end HRL, off-policy RL creates a non-stationary problem for the higher-level policy, since the lower-level is constantly changing. We are aware of only one recent work which applies HRL in an off-policy setting levy2017hierarchical . As in our work, the authors devise a hierarchical structure in which a lower-level policy is trained to reach observations directed by a higher-level policy. The multiple layers of policies are trained jointly in an off-policy manner, while ignoring the non-stationarity problem which we realize is a key issue for off-policy HRL. Accordingly, we derive and test an off-policy correction in the context of HRL, and empirically show that this technique is crucial to successfully train hierarchical policies on complex tasks.

Our work is related to FeUdal Networks (FuN) vezhnevets2017feudal , originally inspired from feudal RL dayan1993feudal . FuN also makes use of goals and a parameterized lower-level reward. Unlike our method, FuN represents the goals and computes the rewards in terms of a learned state representation. In our experiments, we found this technique to under-perform compared to our approach, which uses the state in its raw form. We find that this has a number of benefits. For one, the lower-level policies can immediately begin receiving intrinsic rewards for reaching goals even before the higher-level policy receives a meaningful supervision signal from the task reward. Additionally, the representation is generic and simple to obtain. Goal-conditioned value functions mahadevan2007proto ; sutton2011horde ; uvf ; andrychowicz2017hindsight ; pong2018temporal are actively explored outside the context of HRL. Continued progress in this field may be used to further improve HRL methods.

5 Experiments

In our experiments, we compare HIRO method to prior techniques, and ablate the various components to understand their importance. Our experiments are conducted on a set of challenging environments that require a combination of locomotion and object manipulation. Visualizations of these environments are shown in Figure 1. See the Appendix for more details on each environment.

Ant Gather. The ant gather task is a standard task introduced in duan2016benchmarking . A simulated ant must navigate to gather apples while avoiding bombs, which are randomly placed in the environment at the beginning of each episode. The ant receives a reward of for each apple and a reward of for each bomb.

Ant Maze. For the first difficult navigation task we adapted the maze environment introduced in duan2016benchmarking . In this environment an ant must navigate to various locations in a ‘’-shaped corridor. We increase the default size of the maze so that the corridor is of width . In our evaluation, we assess the success rate of the policy when attempting to reach the end of the maze.

Ant Push. In this task we introduce a movable block which the agent can interact with. A greedy agent would move forward, unknowingly pushing the movable block until it blocks its path to the target. To successfully reach the target, the ant must first move to the left around the block and then push the block right, clearing the path towards the target location.

Ant Fall. This task extends the navigation to three dimensions. The ant is placed on a raised platform, with the target location directly in front of it but separated by a chasm which it cannot traverse by itself. Luckily, a movable block is provided on its right. To successfully reach the target, the ant must first walk to the right, push the block into the chasm, and then safely cross.

5.1 Comparative Analysis

The primary comparisons to previous HRL methods are done with respect to FeUdal Networks (FuN) vezhnevets2017feudal , stochastic neural networks for HRL (SNN4HRL) florensa2017stochastic , and VIME houthooft2016vime (see Table 1, and Appendix for more details). As these algorithms often come with problem-specific design choices, we modify each for fairer comparisons. In terms of problem assumptions, our work is closest to that of FuN which is applicable to any single task without specific sub-policy reward engineering. MLSH frans2017meta is another promising recent work for HRL; however, since it relies on learning meaningful sub-policies through experiencing multiple, diverse, hand-designed tasks, we do not include explicit comparisons. We leave exploring our method in the context of multi-task learning for future work.

Ant Gather Ant Maze Ant Push Ant Fall
HIRO 3.021.49 0.990.01 0.920.04 0.660.07
FuN representation
FuN transition PG
FuN cos similarity
FuN
SNN4HRL
VIME
Table 1:

Performance of the best policy obtained in 10M steps of training, averaged over 10 randomly seeded trials with standard error. Comparisons are to variants of FuN 

vezhnevets2017feudal , SNN4HRL florensa2017stochastic , and VIME houthooft2016vime . Even after extensive hyper-parameter searches, we were unable to achieve competitive performance from the baselines on any of our tasks. In the Appendix, we include the only competitive result we could achieve – VIME on Ant Gather trained for a much longer amount of time.
Ant Gather Ant Maze Ant Push Ant Fall
Figure 4: Results of our method and a number of variants on a set of difficult tasks. Each plot shows average reward (for Ant Gather) or average success rate (for the rest; see Appendix) over 10 randomly seeded trials, with x-axis in millions of environment steps. We find that HIRO can perform well across all tasks. We also note that HIRO learns rapidly; on the complex navigation tasks it requires only a few million environment steps (a few days in real-world interaction time) to achieve good performance. Our method is only out-performed on Ant Gather by a variant that pre-trains the lower-level policy (thus not needing an off-policy correction).

FeUdal Network (FuN). Unlike SNN4HRL or VIME, the official open-source code for FuN was not available at the time of submission, and therefore we aimed to replicate key design choices of FuN from our algorithm implementation. FuN vezhnevets2017feudal

primarily proposes four components: (1) transition policy gradient, (2) directional cosine similarity rewards, (3) goals specified with respect to a learned representation, and (4) dilated RNN. Since our tasks are low-dimensional and fully observed, we do not include design choice (4). For each of (1), (2), and (3), we apply an equivalent modification of our HRL method and evaluate its performance on the same tasks. We also evaluate all modifications together as an approximation to the entire FuN paradigm. Results in Table 

1 show that on our tasks, the FuN modifications do not learn well, and other than Ant Gather are significantly out-performed by HIRO. In particular, it is worth noting that the use of learned representations, rather than observation goals, leads to almost no improvement on the tasks. This suggests that the choice of using goal observations as lower-level goals significantly improves HRL performance, by providing a strong supervision signal to the lower-level policy right from the beginning of training.

Stochastic Neural Networks for HRL (SNN4HRL). SNN4HRL florensa2017stochastic initially trains the low-level policy with a proxy reward to encourage learning useful diverse exploration policies, and then the high-level policy is trained in the tasks of interest while the low-level is fixed. While SNN4HRL can perform better than FuN, it is still far behind our proposed HRL method.

Variational Information Maximizing Exploration (VIME). VIME houthooft2016vime is not an HRL method but is used as a strong baseline in SNN4HRL. As discussed in florensa2017stochastic and matched by our results, for the benchmark’s short horizon task of length 500, it performs approximately the same as SNN4HRL.

Option-Critic Architecture. We extended the option-critic architecture implementation bacon2017option for continuous actions and attempted a number of alternative variants besides the naïve modification of the original. No versions yielded reasonable performance in our tasks, and so we omit it from the results. This is possibly due to difficulty in continuous control tasks, but most importantly the option-critic sub-policies rely solely on the external reward, making learning gait policies difficult.

5.2 Ablative Analysis

In Figure 4 we present results of our proposed HRL method (“HIRO”) compared with a number of variants to understand the importance of various design choices:

With lower-level re-labelling. We evaluate the benefit of recent proposals andrychowicz2017hindsight ; tdm to increase the amount of data available to an agent trained using a parameterized reward (the lower-level policy in our setup) by re-labeling experiences with randomly sampled goals. This allows the lower-level policy to use experience collected with respect to a specific goal to be used to learn behavior with respect to any alternative goal . Our results show that this technique can provide an initial speed-up in training; however, its performance is quick to plateau. We hypothesize that re-labeling goals randomly may make lower-level training more difficult, since the policy must learn to not only satisfy the goals provided by the higher-level agent, but instead almost any conceivable goal. The benefit of re-labeling goals will require more research, and we encourage future work to investigate better ways to harness its benefits.

With pre-training. In this variant we evaluate a simpler method to avoid the non-stationary issue in higher-level off-policy training. Rather than correct for past experiences, we instead pre-train the lower-level policy for 2M steps (using goals sampled from a Gaussian) before freezing it and training the higher-level policy alone (this variant also has the advantage of allowing the higher-level policy to learn with respect to a deterministic, non-exploratory lower-level policy). In the harder navigation tasks, we find that pre-training is detrimental. This is understandable, as these tasks require specialization in different low-level behavior for different stages of the navigation. By allowing the lower-level policy to continually learn as new parts of the environment are encountered, we are able to learn a lower-level policy which is better able to satisfy the desired goals of the higher-level. In contrast, in the simpler and mostly homogeneous Ant Gather task, the advantage of pre-training is significant. This suggests that our off-policy correction is still not perfect, and there is potentially significant benefit to be obtained by improving it.

No off-policy correction. We assess the advantage of including the off-policy correction compared to training off-policy naïvely, ignoring the non-stationary issue. Interestingly, training an HRL policy this way can do quite well. However, in the harder tasks (Ant Push, Ant Fall) the issue becomes difficult to ignore. Accordingly, we observe a significant benefit from using the off-policy correction.

No HRL. Finally, we evaluate the ability of a single non-HRL policy to learn in these environments. This variant makes almost no progress on the tasks compared to our HRL method.

6 Conclusion

We have presented a method for training a two-layer hierarchical policy. Our approach is general, using learned goals to pass instructions from the higher-level policy to the lower-level one. Moreover, we have described a method by which both polices may be trained in an off-policy manner concurrently for highly sample-efficient learning. Our experiments show that our method outperforms prior HRL algorithms and can solve exceedingly complex tasks that combine locomotion and rudimentary object interaction. We note that our results are still far from perfect, and there is much work left for future research to improve the stability and performance of HRL methods on these tasks.

7 Acknowledgments

We thank Ben Eysenbach and others on the Google Brain team for insightful comments and discussions.

References

Appendix A Discussion on Alternative Off-Policy Corrections for High-Level Actions

Through our experiments, we found that our proposed maximum likelihood-based action relabeling works well empirically; however, we also tried other variants of off-policy correction schemes. While none of the methods below worked as well as ours in the tested domains based on preliminary experiments, we summarize them below as a reference for further future work on off-policy correction for HRL.

The experience replay stores sampled from following a low-level policy . is low-level action and is high-level action (or goal for the low-level policy). We want to estimate the following objective for the current low-level policy , where represents the target network,

(6)
(7)
(8)
(9)

We remind the reader that is computed using a deterministic dynamics from using for .

Direct Importance Correction

. A naïve approach is to directly use the unbiased estimator based on importance weighting defined by the expectation in Eq. 

9,

(10)
(11)
(12)

For the continuous action domains in our paper, we found this estimator, while unbiased, has very high variance, and does not work well in practice.

Importance-Based Action Relabeling. Instead of computing the high-variance importance weight for the sample goal , we may also try to find a new goal such that the importance weight is approximately 1. This leads to the action relabeling objective as used in our method,

(13)
(14)

where

can be found by minimizing loss functions such as,

(15)
(16)

Since there is no guarantee that exists to make the loss function go to 0, this estimator is still biased. However, we could expect that the bias may be reduced.

Model-Based Relabeling. What we need to ensure for off-policy correction is that is consistent with the dynamics of MDP transition and current low-level policy . If we can approximate either the high-level forward dynamics or the inverse model , then we may directly do model-based prediction to relabel for either or . While the action relabeling TD objective is given as Eq. 14, the state relabeling objective is given by,

(17)
(18)

The question is how to get or . While we can fit parametric functions on samples of data, this is often as difficult as fully model-based approach. We may instead make use of that fact that the low-level is trying to reach the given goal states. Assuming the low-level policy eventually gets to complete the given goals, we may use the following forms,

(19)
(20)

This resembles transition policy gradient in FuN [48], where the high-level policy is trained by assuming the low-level approximately completes the assigned goals. Empirically, we did not observe this outperformed our approach on the tested domains.

Appendix B Environment Details

Environments use the MuJoCo simulator [45] with and frame skip set to .

b.1 Gather

We use the Gather environment provided by Rllab with a simulated ant agent. The ant is equivalent to the standard Rllab Ant, except that its gear range is reduced from to . In addition to observing , , and the current time step , the agent also observes depth readings as defined by the standard Gather environment. We set the activity range to and the sensor span to , which matches the settings in [10].

Each episode is terminated either when the ant falls or at 500 steps.

The reward used is the default reward (number of apples minus number of bombs).

b.2 Navigation

We devise three navigation tasks to evaluate our method. In each navigation task, we create an environment of blocks, some movable and some with fixed position. We use the same ant agent used in Gather. The agent observes , , the current time step , and the target location. Its actions correspond to torques applied to joints. At the beginning of each episode, the environment samples a target position and the agent is provided a reward at each step corresponding to negative L2 distance from the target: . In one of the navigation tasks (Falling), the L2 distance is measured with respect to 3 coordinates: , and . Each episode is 500 steps long (i.e., the episode does not terminate when the ant falls).

We describe the specifics of each navigation task below.

b.2.1 Maze

In this task, immovable blocks are placed to confine the agent to a “”-shaped corridor. That is, blocks are placed everywhere except at . The agent is initialized at position . At each episode, a target position is sampled uniformly at random from .

At evaluation time, we evaluate the agent only on its ability to reach . We define a “success” as being within an L2 distance of from the target on the ultimate step of the episode.

b.2.2 Push

In this task, immovable blocks are placed everywhere except at . A movable block is placed at . The agent is initialized at position . At each episode, the target position is fixed to . Therefore, the agent must first move to the left, the push the movable block to the right, and then navigate to the target unimpeded.

At evaluation time, we evaluate the agent on its ability to reach . We define a “success” as being within an L2 distance of from the target on the ultimate step of the episode.

b.2.3 Fall

In this task, the agent is initialized on a platform of height . Immovable blocks are placed everywhere except at . The raised platform is absent in the region . A movable block is placed at . The agent is initialized at position . At each episode, the target position is fixed to . Therefore, to cross the chasm, the agent must first push the movable block into the chasm and walk on top of it before navigating to the target.

At evaluation time, we evaluate the agent on its ability to reach . We define a “success” as being within an L2 distance of from the target on the ultimate step of the episode.

Appendix C Implementation Details

c.1 Network Structure

We use the same basic network structure as proposed by the TD3 algorithm [12], with the only difference being that we use layers of size rather than .

The output of the lower-level actor network (activated by ) is scaled to the range of the low-level actions, which is .

The output of the higher-level actor network is scaled to an approximated range of high-level actions: for the desired relative ; for the desired relative ; for the desired relative torso orientations; and the remaining limb angle ranges are available from the ant.xml file.

c.2 Training Parameters

  • Discount for both controllers.

  • Adam optimizer; actor learning rate ; critic learning rate .

  • Soft update targets for both controllers.

  • Replay buffer of size 200,000 for both controllers.

  • Lower-level train step and target update performed every 1 environment step.

  • Higher-level train step and target update performed every 10 environment steps.

  • Reward scaling of for lower-level; for higher-level.

  • Lower-level exploration is Gaussian noise with .

  • Higher-level exploration is Gaussian noise with .

c.3 Off-Policy Correction

Given a high-level experience transition , we select 10 candidate to maximize the log-probability of the lower-level actions. One is taken to be the original ; another to be ; and the remaining eight are sampled randomly from a Gaussian centered at with standard deviation (and subsequently clipped to lie within the high-level action range).

c.4 Evaluation

Learned hierarchical policies are evaluated every 50,000 training steps by averaging performance over 50 random episodes.

Appendix D Benchmark Details

d.1 FuN

FuN [48]

primarily proposes four components: (1) transition policy gradient, (2) directional cosine similarity rewards, (3) goals specified with respect to a learned representation, and (4) dilated RNN. Since our tasks are low-dimensional and fully observed, we do not include design choice (4). For each of (1), (2), and (3), we apply an equivalent modification of our HRL method and evaluate its performance on the same tasks. For representation learning, we augment our method with a two-hidden-layer feed-forward neural network for embedding the observations before passing them to the lower and higher-level policies. The higher-level policy specifies high-level actions and rewards low-level behavior with respect to this representation. For the transition policy gradient, we modify our off-policy correction to instead replace a goal

with a goal sampled from a Gaussian centered at , with standard deviation set to . This is analogous to FuN’s transition policy gradient, which trains the higher-level policy under the assumption that its state transitions are distributed symmetrically around its proposed goals. For directional rewards, we replace our relative position parameterized reward function with a cosine similarity reward function equivalent to that used in FuN.

Ant Gather
Figure 5: Performance of HIRO compared to VIME and SNN4HRL, averaged over 10 trials with x-axis in millions of experience samples. After a hyper-parameter search on the baselines, we were only able to get competitive performance with HIRO from VIME on Ant Gather, with a significantly higher amount of experience. On the other tasks, we were unable to achieve good baseline performance, even with more experience. The SNN4HRL curve does not include 25M transitions used in pre-training.

d.2 Snn4hrl

SNN4HRL [10] trains the hierarchical policy stage-wise. It first uses a proxy reward with mutual information (MI) bonus to learn a mixture of low-level skills encouraging diverse movements, and then trains a high-level policy that controls switching among these skills to optimize for the task reward. We imported our additional environments to the official open-source codes with minimal modifications and followed the similar benchmark setup as in [10]. All policies are trained with TRPO [39] with step size 0.01 and discount 0.99. All neural networks (the SNN, the Latent Regressor Network and the Manager Network) have 2 layers of 32 hidden units as done in [10]333While the policy network sizes are significantly smaller than those used for our method, we observed no significant improvements with larger network sizes and this observation conforms with prior results that on-policy policy gradient methods can perform well on MuJoCo benchmark tasks with very small networks [9, 37].. We report the final results averaged over 10 random seeds, where we ran the low-level training using 5 seeds, and ran the high-level training using 2 seeds per trained low-level policy.

For the skill training, the mesh density used to grid the (x, y) space and give the MI bonus is 10 divisions/unit, and the mutual information bonus coefficient is set to . Bilinear integration is used for the SNN. The number of skills trained is 6. The batch size and the maximum path length for the pre-train task are 50,000 and 500 respectively. To guarantee learning effective skills, we trained the policy with 500 updates, or 25M transitions. The visualization of learned skill primitives is shown in Figure 6, where the learned 6 sub-policies successfully cover different exploration directions.

For the high-level policy learning444In both SNN4HRL [10] and VIME [21], primarily the results are reported and compared on SwimmerMaze and SwimmerGather, and therefore the experimental results are different., we fixed the low-level policies as done in SNN4HRL and trained a policy that outputs a categorical action every 10 time steps that choose which skill to execute. For fair comparisons, we experimented with both sparse and dense rewards for the maze environments, and searched over batch sizes for transitions. We observed that the dense rewards did not help for SNN4HRL significantly, since the policy often quickly converge to local optimum. We found the batch size of is too noisy, and the batch size of is unnecessarily sample intensive, so the high-level policy is trained using batch size of , the default value in their paper, for 300 updates, or 15M transitions. The combined training sample size of 40M is generously more than 10M used for our methods; however, our method still outperforms these SNN4HRL results substantially.

d.3 Variational Information Maximizing Exploration

Variational Information Maximizing Exploration (VIME) [21], while not a HRL algorithm, exhibits good performance on prior benchmark maze and gather tasks, and is also used as a strong baseline in SNN4HRL [10]. We ran the algorithm using the default settings in the official open-source implementation. Batch size of 50,000 is used. We report the average performance across 5 seeds after running the algorithm for 300 updates, or 15M transitions. Only the Gather task required more samples to converge to the final performance, and required 25M+ transitions to reach the same performance as what our method reached in a few million transitions.

Figure 6: Visitation plots for 2 random seeds for the low-level SNN policy in the SNN4HRL benchmark. All 6 policies diversify in different exploration directions.

d.4 Option-Critic Architecture

We also experimented with continuous-action variants of the option-critic architecture [2]. The option-policy for option is parameterized as a Gaussian, whose mean is output from a neural network taking in and , and variance is chosen to be global and diagonal. We first tested naively extending the official open-source implementation for continuous action, and then tried modifying the learning procedure such that the critic learns the state-option-action value function instead of the state-option value function in the original implementation. This creates slight changes for the value and policy training objectives, while the loss for termination policy is basically kept the same. Concretely, for the first variant, we trained and the option-policy with the following gradients,

(21)
(22)
(23)

where represents the target network, and and represent using off-policy and on-policy transition samples respectively. For simplicity of explanation, we assumed that the reward only depends on states, but similar arguments can be made for the general case. There are two pragmatic problems for this objective. First, the policy gradient, which relies on a score function estimate, could be high variance especially with respect to a continuous policy . We experimented with several choices of baselines , including and . The second problem is that the off-policy learning for does not use the action taken and only relies on . This effectively creates the same non-stationarity problem with respect to the high-level policy as our method, since it ignores that for the same and , the next state can be different due to changing . To counter both problems, we also explored another variant of the option-critic implementation at the expense of potentially more computation and network parameters, which conforms more closely with the policy gradient theorems in the original paper. Specifically, we trained and the option-policy with the following gradients,

(24)
(25)
(26)
(27)

In this implementation, we observe that the off-policy learning for can effectively utilize both and , removing the non-stationarity problem, and the policy gradient can be estimated with lower variance using reparametrization trick [22] through the critic directly. Furthermore, since the policy gradient no longer requires next state estimate, off-policy state samples may also be used along with enumeration over all ,

(28)

Making similar approximations for the termination policy, this enables a fully off-policy actor-critic algorithm like DDPG [27] for the option-critic architecture.

While we tried these modifications, we could not make the option-critic implementation work reasonably on our domains. The main difficulty is likely because the low-level option-policies are learned using only the external task reward, a limitation in a direct end-to-end hierarchical policy structure. While in our experiments we could not show substantial successes, the algorithm may work better with more sophisticated modifications to the policy evaluation or policy improvement routines based on recent advances [30, 49, 14, 16, 12], and we leave further comparisons for future work.