Diversity-Driven Extensible Hierarchical Reinforcement Learning

11/10/2018 ∙ by Yuhang Song, et al. ∙ 0

Hierarchical reinforcement learning (HRL) has recently shown promising advances on speeding up learning, improving the exploration, and discovering intertask transferable skills. Most recent works focus on HRL with two levels, i.e., a master policy manipulates subpolicies, which in turn manipulate primitive actions. However, HRL with multiple levels is usually needed in many real-world scenarios, whose ultimate goals are highly abstract, while their actions are very primitive. Therefore, in this paper, we propose a diversity-driven extensible HRL (DEHRL), where an extensible and scalable framework is built and learned levelwise to realize HRL with multiple levels. DEHRL follows a popular assumption: diverse subpolicies are useful, i.e., subpolicies are believed to be more useful if they are more diverse. However, existing implementations of this diversity assumption usually have their own drawbacks, which makes them inapplicable to HRL with multiple levels. Consequently, we further propose a novel diversity-driven solution to achieve this assumption in DEHRL. Experimental studies evaluate DEHRL with five baselines from four perspectives in two domains; the results show that DEHRL outperforms the state-of-the-art baselines in all four aspects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hierarchical reinforcement learning (HRL) recombines sequences of basic actions to form subpolicies [Sutton, Precup, and Singh1999, Parr and Russell1998, Dietterich2000]. It can be used to speed up the learning [Bacon, Harb, and Precup2017], improve the exploration to solve tasks with sparse extrinsic rewards (i.e., rewards generated by the environment) [Şimşek and Barto2004], or learn meta-skills that can be transferred to new problems [Frans et al.2018]. Although most previous approaches to HRL require hand-crafted subgoals to pretrain subpolicies [Heess et al.2016] or extrinsic rewards as supervisory signals [Vezhnevets et al.2016], the recent ones seek to discover subpolicies without manual subgoals or pretraining. Most of them are working in a top-down fashion, such as [Xu et al.2018a, Bacon, Harb, and Precup2017], where a given agent first explores until it accomplishes a trajectory that reaches a positive extrinsic reward. Then, it tries to recombine the basic actions in the trajectory to build useful or reasonable subpolicies.

However, such top-down solutions are not practical in some situations, where the extrinsic rewards are sparse, and the action space is large (called sparse extrinsic reward problems); this is because positive extrinsic rewards are almost impossible to be reached by exploration using basic actions in such scenarios. Therefore, more recent works focus on discovering “useful” subpolicies in a bottom-up fashion [Lakshminarayanan et al.2016, Kompella et al.2017], which are capable of discovering subpolicies before reaching a positive extrinsic reward. In addition, the bottom-up strategy can discover subpolicies that better facilitate learning for an unseen problem [Frans et al.2018]. This is also called meta-learning [Finn, Abbeel, and Levine2017], or more precisely meta-reinforcement-learning [Al-Shedivat et al.2018], where subpolicies shared across different tasks are called meta-subpolicies (or meta-skills).

Figure 1: Playing OverCooked with HRL of three levels.

However, none of the above methods shows the capability to build extensible HRL with multiple levels, i.e., building subpolicies upon subpolicies, which is usually needed in many real-world scenarios, whose ultimate goals are highly abstract, while their basic actions are very primitive. We take the game OverCooked (shown in Fig. 1) as an example. The ultimate goal of OverCooked is to let an agent fetch multiple ingredients (green box) in a particular sequence according to a to-pick list (yellow box), which is shuffled in every episode. However, the basic action of the agent is so primitive (thus called primitive action in the following) that it can just move one of its four legs towards one of the four directions at each step (marked by red arrows), and the agent’s body can be moved only when all four legs are moved to the same direction. Consequently, although we can simplify the task by introducing subpolicies to learn to move the body towards different directions with four steps of primitive actions, the ultimate goal is still difficult to be reached, because the to-pick list changes every episode.

Fortunately, this problem can be easily overcome if HRL has multiple levels: by defining the previous subpolicies as subpolicies at level , HRL with multiple levels can build subpolicies at level to learn to fetch different ingredients based on subpolicies at level ; obviously, a policy based on subpolicies at level is capable to reach the ultimate goal more easily than that based on subpolicies at level .

Motivated by the above observation, in this work, we propose a diversity-driven extensible HRL (DEHRL) approach, which is constructed and trained levelwise (i.e., each level shares the same structure and is trained with exactly the same algorithm), making the HRL framework extensible to build higher levels. DEHRL follows a popular diversity assumption: diverse subpolicies are useful, i.e., subpolicies are believed to be more useful if they are more diverse. Therefore, the objective of DEHRL at each level is to learn corresponding subpolicies that are as diverse as possible, thus called diversity-driven.

However, existing implementations of this diversity assumption usually have their own drawbacks, which make them inapplicable to HRL with multiple levels. For example, (i) the implementation in [Daniel, Neumann, and Peters2012] works in a top-down fashion; (ii) the one in [Haarnoja et al.2018] cannot operate different layers at different temporal scales to solve temporally delayed reward tasks; and (iii) the implementation in [Gregor, Rezende, and Wierstra2016, Florensa, Duan, and Abbeel2017] is not extensible to higher levels.

Consequently, we further propose a novel diversity-driven solution to achieve this assumption in DEHRL: We first introduces a predictor at each level to dynamically predict the resulting state of each subpolicy. Then, the diversity assumption is achieved by giving higher intrinsic rewards to subpolicies that result in more diverse states; consequently, subpolicies in DEHRL converge to taking actions that result in most diverse states. Here, intrinsic rewards are the rewards generated by the agent.

We summarize the contributions of this paper as follows:

  • We propose a diversity-driven extensible hierarchical reinforcement learning (DEHRL) approach. To our knowledge, DEHRL is the first learning algorithm that is built and learned levelwise with verified scalability, so that HRL with multiple levels can be realized end-to-end without human-designed extrinsic rewards.

  • We further propose a new diversity-driven solution to implement and achieve the widely adopted diversity assumption in HRL with multiple levels.

  • Experimental studies evaluate DEHRL with five baselines from four perspectives in two domains. The results show that, comparing to the baselines, DEHRL achieves the following advantages: (i) DEHRL can discover useful subpolicies more effectively, (ii) DEHRL can solve the sparse extrinsic reward problem more efficiently, (iii) DEHRL can learn better intertask-transferable meta subpolicies, and (iv) DEHRL has a good portability.

2 Diversity-Driven Extensible HRL

This section introduces the new DEHRL framework as well as the integrated diversity-driven solution. The structure of DEHRL is shown in Fig. 2, where each level contains a policy (denoted ), a predictor (denoted ), and an estimator. The above policy and predictor

are two deep neural networks (i.e., parameterized functions) with

and denoting their trainable parameters, while estimator only contains some operations without trainable parameters. For any two neighboring levels (e.g., level and level ), three connections are shown in Fig. 2:

  • The policy at the upper level produces the action , which is treated as an input for the policy at the lower level ;

  • The predictor at the upper level makes several predictions, which are passed to the estimator at the lower level ;

  • Using the predictions from the upper level , the estimator at the lower level generates an intrinsic reward to train the policy at the lower level ;

2.1 Policy

As shown in Fig. 2, the policies for different levels act at different frequencies, i.e., the policy samples an action every steps. Note that is always an integer multiple of , and always equals to , so the time complexity of the proposed framework does not grow linearly as the level goes higher. for are hyper-parameters. At level , the policy takes as input the current state and the action from the upper level , so that the output of is conditional to . Note that , where is the output action space of . Thus, should be set to the action space of the environment for the policy directly taking actions on the environment, while of are hyper-parameters. The policy takes as input both and to integrate multiple subpolicies into one model; a similar idea is presented in [Florensa, Duan, and Abbeel2017]. The detailed network structure of the policy is presented in the arXiv release111https://arxiv.org/abs/1811.04324. Then, the policy produces the action by sampling from a parameterized categorical distribution:

(1)

The reward to train combines the extrinsic reward from the environment and the intrinsic reward generated from the estimator at level . When facing games with very sparse extrinsic rewards, where is absent most of the time, will guide the policy at this level to learn diverse subpolicies, so that the upper level policy may reach the sparse positive extrinsic reward more easily. The policy is trained with the PPO algorithm [Schulman et al.2017], but our framework does not restrict the policy training algorithm to use. The following denotes the loss of training the policy :

(2)

where means that the gradients of this loss are only passed to the parameters in , and is a hyper-parameter set to all the time (the section on the estimator below will introduce a normalization of the intrinsic reward , making free from careful tuning).

Figure 2: The framework of DEHRL.

2.2 Predictor

As shown in Fig. 2, the predictor at level (i.e., ) takes as input the current state and the action taken by the policy at level (i.e.,

) as a one-hot vector. The integration of

and is accomplished by approximated multiplicative interaction [Oh et al.2015], so that any predictions made by the predictor is conditioned on the action input of . The predictor makes two predictions, denoted and , respectively. Thus, the forward function of is:

(3)

The detailed network structure of the predictor is given in the arXiv release. The first prediction in (3) is trained to predict the state after

steps with following loss function:

(4)

where MSE is the mean square error, and indicates that the gradients of this loss are only passed to the parameters in . The second prediction in (3) is trained to approximate the intrinsic reward at the lower level , with the loss function

(5)

where means that the gradients of this loss are only passed to the parameters in . The next section about the estimator will show that the intrinsic reward is also related to the action sampled according to the policy at the lower level . Since is not fed into the predictor , the intrinsic reward is actually an estimation of the expectation of under the current :

(6)

The above two predictions will be used in the estimator, described in the following section.

The predictor is active at the same frequency as that of the policy. Each time the predictor is active, it produces several predictions feeding to the estimator at level , including and , where denotes a set of predictions when feeding the predictor with actions other than .

2.3 Estimator

As shown in Fig. 2, taking as input and , the estimator produces the intrinsic reward , which is used to train the policy , as described in the policy section. The design of the estimator is motivated as follows:

  • If the currently selected subpolicy for the upper-level action differs from the subpolicies for other actions (i.e., ), then the intrinsic reward should be high;

  • The above difference can be measured via the distance between the states resulting from the subpolicy and subpolicies . Note that since is selected every steps, these resulting states are the ones at ;

In the above motivation, the resulting state of the subpolicy is the real state environment returned after steps (i.e., ), while the resulting states of the subpolicies have been predicted by the predictor at the upper level (i.e., ), as described in the last section. Thus, the intrinsic reward is computed as follows:

(7)

where is the distance chosen to measure the distance between states. In practice, we combine the L1 distance and the distance between the center of mass of the states, to obtain information on color changes as well as objects moving. A more advanced way to measure the above distance is to match features in states and to measure the movements of the matched features, or to integrate the inverse model in [Pathak et al.2017] to capture the action-related feature changes. However, the above advanced ways are not investigated here, due to the scope of the paper. Equation (7) gives a high intrinsic reward, if is far from overall. In practice, we find that punishing from being too close to the one state in that is closest to is a much better choice. Thus, we replace the sum in (7) with the minimum, and find that it consistently gives the best intrinsic reward estimation.

Estimating the intrinsic reward with distances of high dimensional states comes with the problem that the changes in distance that we want the intrinsic reward to capture is extremely small, compared to the mean of the distances. Thus, we use the estimation of the expectation of the intrinsic reward (i.e., described in last section) to normalize :

(8)

In practice, this normalization gives a stable algorithm without need to tune according to the distance that we choose or the convergence status of the predictor at the upper level. We jointly optimize the loss functions of (2), (4), and (5).

3 Experiments

16 5 5 1 1*4 1*4*12
Table 1: The settings of DEHRL.

We conduct experiments to evaluate DEHRL and six baselines based on two games, OverCooked (shown in Fig. 1) and MineCraft. The important hyper-parameters of DEHRL are summarized in Table 1, while other details (e.g., neural network architectures and hyper-parameters in the policy training algorithm) are provided in the arXiv release. Easy-to-run codes have been released to further clarify the details and facilitate future research222https://github.com/YuhangSong/DEHRL. An evaluation on more domains (such as MontezumaRevenge, etc.) can also be found in this repository.

Figure 3: State examples of Overcooked.

3.1 Subpolicy Discovery

We first evaluate DEHRL in OverCooked to see if it discovers diverse subpolicies more efficiently, compared to the state-of-the-art baselines towards option discovery [Florensa, Duan, and Abbeel2017, Bacon, Harb, and Precup2017].

As shown in Fig. 1, an agent in OverCooked can move one of the four legs towards one of the four directions at each step, so its action space is . Only after all four legs are moved towards the same direction, the body of an agent can be moved towards this direction, and all four legs are then reset. There are four different ingredients at the corners of the kitchen (marked by green box). An ingredient is automatically picked up when the agent reaches it. The left lower corner shows a list of ingredients that the chief needs to pick in sequence to complete a dish (marked by a red box), called to-pick list.

reward-level goal-type: any (easy) goal-type: fix (middle) goal-type: random (hard)
(easy) Get any ingredient. Get a particular ingredient.
Get the first ingredient shown
in the shuffling* to-pick list.
(hard) Get 4 ingredients in any order. Get 4 ingredients in a particular order.
Get 4 ingredients in order according
to the shuffling* to-pick list.
  • The to-pick list is shuffled every episode.

Table 2: The different settings of extrinsic rewards in OverCooked.

3.1.1 Without Extrinsic Rewards.

We first aim to discover useful subpolicies without extrinsic rewards. Since there are four legs, and every leg of an agent has five possible states (staying or moving towards four directions), there are totally possible states for every steps. As shown in Fig. 3, five of them are the most useful states (i.e., the ones that are most diverse to each other), whose four legs have the same state, making the body of the chief move towards four directions or stay still.

Consequently, a good implementation of the diversity assumption should be able to learn subpolicies at level that can result in the five most useful states (called five useful subpolicies) efficiently and comprehensively. Therefore, given (i.e., discovering only five subpolicies), and the number of steps being millions, the five subpolicies learned by DEHRL at level are exactly the five useful subpolicies. Furthermore, SNN [Florensa, Duan, and Abbeel2017] is a state-of-the-art implementation of the diversity assumption, which is thus tested as a baseline under the same setting. However, only one of the five useful subpolicies is discovered by SNN. We then repeat experiments three times with different training seeds, and the results are the same. Furthermore, we loose the restriction by setting (i.e., discovering subpolicies) and the number of steps is millions. With no surprise, the five useful subpolicies are always included in the discovered subpolicies of DEHRL; however, the subpolicies discovered by SNN still contain only one useful subpolicy.

The superior performance of DEHRL comes from the diversity-driven solution, which gives higher intrinsic rewards to subpolicies that result in more diverse states; consequently, subpolicies in DEHRL converge to taking actions that result in most diverse states. And the failure of SNN may be because the objective of SNN is to maximize mutual information, so it only guarantees to discover subpolicies resulting in different states, but these different states are not guarantees to be most diverse to each other. Similar failures are found in other state-of-the-art solutions (e.g., [Gregor, Rezende, and Wierstra2016]); we thus omit the analysis due to space limit.

As for finding useful subpolicies at higher levels, due to the failures at level , none of the state-of-the art solutions can generate useful subpolicies at higher levels. However, useful subpolices can be learned by DEHRL at higher levels. Fig. 4 visualizes five subpolicies learned by DEHRL at level 1, where four of them (marked by a green box) result in getting four different ingredients, which are the useful subpolicies at level .

Figure 4: Subpolicies learned at level 1 in DEHRL.

3.1.2 With Extrinsic Rewards.

Although DEHRL can work in a bottom-up fashion where no extrinsic reward is required to discover useful subpolicies, DEHRL actually also has a very superior performance in the scenarios when extrinsic rewards are given. Therefore, we compare DEHRL with two state-of-the-art top-down methods, option-critic [Bacon, Harb, and Precup2017] and FeUdal [Vezhnevets et al.2017], where extrinsic rewards are essential. As shown in Table 2, six different extrinsic reward settings are given to OverCooked, resulting in different difficulties.

To measure the performance quantitatively, two metrics, final performance score and learning speed score, which are based on reward per episode, are imported from [Schulman et al.2017]. Generally, the higher the reward per episode, the better the solution. Specifically, the final performance score averages the reward per episode over the last 100 episodes of training to measure the performance at final stages; while the learning speed score averages the extrinsic reward per episode over the entire training period to quantify the learning efficiency.

The results of the final performance score and the learning speed score are shown in Table 10 and Fig. 11, respectively. In Table 10, we find that DEHRL can solve the problems in all six settings, while option-critic can only solve the two easier ones. The failure of option-critic is because the extrinsic reward gets more sparse in the last four harder cases. Besides, Table 10 shows that FeUdal fails when it is extended to 3 levels: its key idea “transition policy gradient” does not work well for multi-level structures, so it is hard to converge for . Consequently, we state that DEHRL can also achieve a better performance than the state-of-the-art baselines when extrinsic rewards are given.

reward-level
goal-type
1
any
1
fix
1
random
2
any
2
fix
2
random
DEHRL 1.00 1.00 1.00 0.95 0.93 0.81
Option-critic 1.00 1.00 0.00 0.00 0.00 0.00
FeUdal 1.00 1.00 0.93 0.00 0.00 0.00
PPO 0.98 0.97 0.56 0.00 0.00 0.00
State Novelty 1.00 0.96 0.95 0.00 0.00 0.00
Transition Novelty 1.00 1.00 1.00 0.00 0.00 0.00
Table 3: Final performance score of DEHRL and baselines on OverCooked with six different extrinsic reward settings.

Figure 5: Learning speed score of DEHRL and the baselines on OverCooked with six different extrinsic reward settings.

3.2 Solving the Sparse Extrinsic Reward Problem

As option-critic fails in the sparse extrinsic reward problem, to illustrate the advantage of our method in solving this problem, we combine DEHRL with two state-of-the-art methods with better exploration strategies, namely, state novelty [Şimşek and Barto2004] and transition novelty [Pathak et al.2017]. In addition, as previously mentioned, our framework is based on the PPO algorithm [Schulman et al.2017], so we include PPO as a baseline as well. The evaluations are also based on the final performance scores and learning speed scores, which are shown in Table 10 and Fig. 11, respectively. The results show that, better than option-critic, the three new baselines are all able to solve the task in the third setting. However, they all still fail in the last three settings.

3.3 Meta HRL

HRL has recently shown a promising ability to learn meta-subpolicies that better facilitate an adaptive behavior for new problems [Solway et al.2014, Frans et al.2018]. We compare DEHRL against the state-of-the-art MLSH framework in [Frans et al.2018] to investigate such a performance.

We first test both DEHRL and MLSH in OverCooked with reward-level=1 and goal-type=random. In order to make MLSH work properly, instead of changing the goal every episode, as originally designed in goal-type=random, the goal in MLSH is changed every 5M steps; for a fair comparison, the top-level hierarchy of DEHRL is also reset every 5M steps (same as MLSH). The results of the episode extrinsic reward curves in 20M steps (the goal is changed five times) are shown in Fig. 6 (upper part). As expected, the episode extrinsic reward drops every time the goal is changed, since the top-level hierarchies of both methods are re-initialized. The increase speed of the episode extrinsic reward after each reset can measure the performance of methods in learning meta-subpolicies that facilitate an adaptive behavior for a new goal. Consequently, we find that DEHRL and MLSH have a similar meta-learning performance under this setting.

Figure 6: Meta HRL performances of DEHRL and MLSH in OverCooked with reward-level=1 (upper) and reward-level=2 (downer).

In addition, we repeat the above experiment with reward-level=2, and the results are shown in Fig. 6 (lower part). We find that DEHRL produces a better meta-learning ability than MLSH. This is because DEHRL is capable to learn the subpolicies at level to fetching four different ingredients, while MLSH can only learn subpolicies to moving towards four different directions (similar to the subpolicies at level of DEHRL). Obviously, based on the better intertask-transferable subpolicies learned at level , DEHRL will resolve the new goal more easily and quickly than MLSH. Thus, this finding shows that DEHRL can learn meta subpolicies at higher levels, which are usually better intertask-transferable than those learned by the baseline.

Figure 7: Worlds built by playing MineCraft without extrinstic reward.

3.4 Application of DEHRL in MineCraft

To show the portability of DEHRL, we further apply DEHRL in a popular video game called MineCraft, where the agent has much freedom to act and build.

In our settings, the agent has the first-person view via raw pixel input. At the beginning of each episode, the world is empty except one layer of GRASS blocks that can be broken. We allow the agent to play 1000 steps in each episode; then the world is reset. At each step, ten actions are available, i.e., moving towards four directions, rotating the view towards four directions, break and build a block333Only one kind of block (i.e., BRICK) can be built, and any blocks except STONE can be broken., and jump. Due to space limits, more detailed settings and more visualizations about the agent and this game are provided in the arXiv release.

Since the typical use of DEHRL is based on the intrinsic reward only, the existing work [Tessler et al.2017] that requires human-designed extrinsic reward signals to train subpolicies is not applicable to be used as baseline. Consequently, we compare the performance of DEHRL with three different numbers of levels in MineCraft to a framework with a random policy. Fig. 7 shows the building results, where we measure the performance by world complexity. As we can see, DEHRL builds more complex worlds than the random policy. Furthermore, with the increase of the number of levels, DEHRL tends to build more complex worlds.

The complexity of the worlds is quantified by Valid Operation, which is computed by the following equation:

where Valid Builds is the number of blocks that have been built and not broken at the end of an episode; Valid Breaks is the number of blocks that are originally in the world that have been broken. Consequently, blocks that are built but broken later will not be counted into Valid Build or Valid Break. The quantitative results in Fig. 7 are definitely consistent with our previous intuitive feeling.

Finally, since the predicted intrinsic reward () is an indication of the diversity of current subpolicies, we plot the averaged over all levels and visualize the world built by DEHRL at the point in Fig. 8, so that the relationship between and the built world is further illustrated.

Figure 8: Predicted intrinsic reward.

4 Related Work

Discovering diverse subpolicies in HRL. The diversity assumption is prevailing in the recent works of option discovery. Among them, SAC-LSP [Haarnoja et al.2018] is the most recent work, but whether it can operate different layers at different temporal scales is an open problem. HiREPS [Daniel, Neumann, and Peters2012] is also a popular approach, but it works in a top-down fashion. Thus, it is not clear whether the above two methods can be applied to sparse extrinsic reward tasks. In contrast, SNN [Florensa, Duan, and Abbeel2017] is designed to handle sparse extrinsic reward tasks, which achieves this diversity assumption explicitly by information-maximizing statistics. Besides, it is promising to apply SNN in HRL of multiple levels. However, SNN suffers from various failures when the possible future states are enormous, making it impractical on domains with a large action space and unable to further learn higher-level subpolicies. Similar failure cases are observed in [Gregor, Rezende, and Wierstra2016].

Extensible HRL. Recently, there has been some attempts in increasing the levels of HRL. Such works include MAXQ [Dietterich2000], which requires completely searching the subtree of each subtask, leading to high computational costs. In contrast, AMDP [Gopalan et al.2017] explores only the relevant branches. However, AMDP concentrates on the planning problem. Deeper levels are also supported in [Silver and Ciosek2012], but its scalability is not clear. DDO [Fox et al.2017] and DDCO [Krishnan et al.2017] discover higher-level subpolicies from demonstration trajectories. However, our work focuses on learning those purely end-to-end without human-designed extrinsic reward. Other works [Rasmussen, Voelker, and Eliasmith2017, Song et al.2018] also involve a modular structure that supports deeper-level HRL. However, there is no guarantee or verification on whether the structure can learn useful or diverse subpolicies at different temporal scales.

Meta HRL. Neuroscience research [Solway et al.2014] proposes that the optimal hierarchy is the one that best facilitates an adaptive behavior in the face of new problems. Its idea is accomplished with a verified scalability in MLSH [Frans et al.2018], where meta HRL is proposed. However, MLSH keeps reinitializing the policy at the top level, once the environment resets the goal. This brings several drawbacks, such as requiring auxiliary information from the environment about when the goal has been changed. In contrast, our method does not introduce such a restriction. Regardless of the difference, we compare with MLSH in our experiments under the settings of MLSH, where auxiliary information on goal resetting is provided. As such, the meta HRL ability of our approach is investigated.

Improved exploration with predictive models. Since we introduce the transition model to generate intrinsic rewards, our method is also related to RL improvements with predictive models, typically introducing sample models [Fu, Co-Reyes, and Levine2017], generative models [Song et al.2017], or deterministic models [Pathak et al.2017] as transition models to predict future states. However, the transition model in our DEHRL is introduced to encourage developing diverse subpolicies, while those in the above works are introduced to improve the exploration. Our method is compared with the above state novelty [Şimşek and Barto2004] and transition novelty [Pathak et al.2017] in our experiments.

5 Summary and Outlook

We have proposed DEHRL towards building extensible HRL that learns useful subpolicies over multiple levels efficiently. However, there are several interesting directions to explore. One of them is to develop algorithms that generate or dynamically adjust the settings of and . Furthermore, measuring the distance between states is another important direction to explore, where better representations of states may lead to improvements. Finally, DEHRL may be a promising solution for visual tasks [Xu et al.2018b] with diverse representation and mixed reward functions.

5.0.1 Acknowledgments.

This work was supported by the State Scholarship Fund awarded by China Scholarship Council and by the Alan Turing Institute under the UK EPSRC grant EP/N510129/1.

6 Supplementary Material

Hyperparameters Value
Horizon (T) 128
Adam stepsize
Learning rate

Number epochs

4
Minibatch size
Discount () 0.99
GAE parameter () 0.95
Number of actors 8
Clipping parameter ()
VF coefficient () 0.5
Entropy coefficient () 0.01
Table 4: PPO hyperparameters used for DEHRL at each level on OverCooked.
Hyperparameters Value
Horizon (T) 128
Adam stepsize
Learning rate
Number epochs 4
Minibatch size
Discount () 0.99
GAE parameter () 0.95
Number of actors 1
Clipping parameter ()
VF coefficient () 0.5
Entropy coefficient () 0.01
Table 5: PPO hyperparameters used for DEHRL at each level on MineCraft.
0 1 2
16 5 5
1
Table 6: DEHRL Settings on OverCooked.
0 1 2 3 4 5
11 8 8 8 8 8
1
Table 7: DEHRL Settings on MineCraft.
Input 1: current state (), as gray scaled image Input 2: action from level (), as one-hot vector
Conv: kernel size

, number of features 16, stride 4

lRELU
Conv: kernel size , number of features 32, stride 2
lRELU
Conv: kernel size , number of features 16, stride 1
lRELU
Flatten: is flatten to
FC: number of features 256
lRELU
Output 1: multiple policy functions, each one for one Output 2: multiple value functions, each one for one
Table 8: Network architecture of the policy at each level ().
Input 1: current state (), as gray scaled image Input 2: action from level (), as one-hot vector
Conv: kernel size , number of features 16, stride 4
FC: number of features 256
BatchNorm
lRELU
Conv: kernel size , number of features 32, stride 2
BatchNorm
lRELU
Conv: kernel size , number of features 16, stride 1
BatchNorm
lRELU
Flatten: is flatten to
FC: number of features 256
Dot-multiply
FC: number of features 256
FC: number of features 1
FC: number of features
Reshape: is reshaped to
DeConv: kernel size , number of features 32, stride 1
BatchNorm
lRELU
DeConv: kernel size , number of features 16, stride 2
BatchNorm
lRELU
DeConv: kernel size , number of features 1, stride 4
Sigmoid
Output 1: predicted state after steps () Output 2: predicted bounty at downer level ()
Table 9: Network architecture of the predictor at each level ().
reward-level / goal-type 1 / any 1 / fix 1 / random 2 / any 2 / fix 2 / random
DHERL 1.00 1.00 1.00 0.95 0.93 0.81
Option-critic[Bacon, Harb, and Precup2017] 1.00 1.00 0.00 0.00 0.00 0.00
PPO[Schulman et al.2017] 0.98 0.97 0.56 0.00 0.00 0.00
State Novelty[Şimşek and Barto2004] 1.00 0.96 0.95 0.00 0.00 0.00
Transition Novelty[Pathak et al.2017] 1.00 1.00 1.00 0.00 0.00 0.00
Table 10: Final performance of DHERL, Option-critic[Bacon, Harb, and Precup2017], PPO[Schulman et al.2017], State Novelty[Şimşek and Barto2004] and Transition Novelty[Pathak et al.2017] on OverCooked of 6 settings.
reward-level / goal-type 1 / any 1 / fix 1 / random 2 / any 2 / fix 2 / random
DHERL 0.92 0.72 0.71 0.51 0.43 0.13
Option-critic[Bacon, Harb, and Precup2017] 0.92 0.82 0.00 0.00 0.00 0.00
PPO[Schulman et al.2017] 0.81 0.57 0.06 0.00 0.00 0.00
State Novelty[Şimşek and Barto2004] 0.96 0.64 0.12 0.00 0.00 0.00
Transition Novelty[Pathak et al.2017] 0.95 0.77 0.42 0.00 0.00 0.00
Table 11: Learning Speed of DHERL, Option-critic[Bacon, Harb, and Precup2017], PPO[Schulman et al.2017], State Novelty[Şimşek and Barto2004] and Transition Novelty[Pathak et al.2017] on OverCooked of 6 settings.
Figure 9: Episode reward curve of DEHRL on all 6 settings of OverCooked. Different colors indicate runs with different training seeds. Shallower color indicates the original curve and the darker color indicates the filtered curve.
(a) Predicted states by predictor at level 1
(b) Predicted states by predictor at level 2
Figure 10: Predicted states by predictor at each level on OverCooked.
(a) Start state (b) Break a block (c) Build a block (d) Jump on a block
Figure 11: Example states in MineCraft.

6.1 Hyperparameters

The policy at each level is trained with Proximal Policy Optimization (PPO) algorithm [Schulman et al.2017]. Detailed settings of the hyper-parameters are shown in Table 4 and 5 for OverCooked and MineCraft respectively. Detailed settings of DEHRL framework for OverCooked and MineCraft are shown in Table 6 and 7 respectively.

6.2 Neural Network Details

The details of network architecture for policy and predictor at each level is shown in Table 8 and 9

respectively. Fully connected layer is denoted as FC and flatten layer is denoted as Flatten. We use leaky rectified linear units (denoted as LeakyRELU)

[Maas, Hannun, and Ng2013] with leaky rate

as the nonlinearity applied to all the hidden layers in our network. Batch normalization

[Ioffe and Szegedy2015] (denoted as BatchNorm) is applied after hidden convolutional layers (denoted as Conv) in predictor. For the predictor at each level, the integration of the two inputs, i.e., state and action, is accomplished by approximated multiplicative interaction [Oh et al.2015] (the dot-multiply operation in Table 9), so that any predictions made by the predictor are conditioned on the action input. Deconvolutional layers (denoted as DeConv) [Zeiler, Taylor, and Fergus2011] are applied for predicting the state after steps.

6.2.1 Performance on OverCooked

Here we include a comparison of DEHRL against Option-critic[Bacon, Harb, and Precup2017], PPO[Schulman et al.2017], State Novelty[Şimşek and Barto2004] and Transition Novelty[Pathak et al.2017] on OverCooked of 6 settings. Table 10 shows the final performance and Table 11 shows the learning speed.

Besides, there is an interesting question to answer for SNN [Florensa, Duan, and Abbeel2017]: If SNN is guaranteed to learn different subpolicies, will it learn the 5 useful ones provided with enough subpolicy models (set )? We train the above settings for 200M steps with 3 trials of different training seeds. Surprisingly, the best trial learns 2 useful subpolicies. The reason is that setting makes the estimation of the mutual information easily inaccurate, since the mutual information is estimated for every in .

Figure 9 shows the learning curves of DEHRL with three different training seeds.

Figure 10 shows the predicted states of the predictor at each level in DEHRL. Since the size of observations and the predictions is and they are gray scaled, it would be hard to have a clear visualization of the predictions. Thus, just for better visualization, current observation is subtracted from the predictions to remove the unchanged parts in Figure 10.

6.2.2 Performance on MineCraft

Figure 11 shows the example states observed by the agent in MineCraft.

References