Learning World Graphs to Accelerate Hierarchical Reinforcement Learning

07/01/2019 ∙ by Wenling Shang, et al. ∙ University of Amsterdam 5

In many real-world scenarios, an autonomous agent often encounters various tasks within a single complex environment. We propose to build a graph abstraction over the environment structure to accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals between them. Our approach has two stages. First, we jointly train a latent pivotal state model and a curiosity-driven goal-conditioned policy in a task-agnostic manner. Second, provided with the information from the world graph, a high-level Manager quickly finds solution to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker. The Worker can then also leverage the graph to easily traverse to the pivotal states of interest, even across long distance, and explore non-locally. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real world scenarios require an autonomous agent to play different roles within a single complex environment. For example, Mars rovers carry out scientific objectives ranging from searching for rocks to calibrating orbiting instruments Mars . Intuitively, a good understanding of the high-level structure of its operational environment would help an agent accomplish its downstream tasks. In reality, however, both acquiring such world knowledge and effectively applying it to solve tasks are often challenging. To address these challenges, we propose a generic two-stage framework that first learns high-level world structure in the form of a simple directed weighted graph graph and then integrates it into a hierarchical policy model.

In the initial stage, we alternate between exploring the world and updating a descriptor of the world in a graph format graph , referred to as the world graph (Figure 1), in an unsupervised fashion. The nodes, termed pivotal states, are the most critical states in recovering action trajectories chatzigiorgaki2009real ; jayaraman2018time ; ghosh2018learning . In particular, given a set of trajectories, we optimize a fully differentiable recurrent variational auto-encoder chung2015recurrent ; gregor2015draw ; kingma2013auto with binary latent variables nalisnick2016stick . Each binary latent variable is designated to a state and the prior distribution learned conditioning on that state indicates whether it belongs to the set of pivotal states. Wide-ranging and meaningful training trajectories are therefore essential ingredients to the success of the latent model. Existing world descriptor learning frameworks often use random ha2018world or curiosity-driven trajectories azar2019world . Our exploring agent collects trajectories from both random walks and a simultaneously learned curiosity-driven goal-conditioned policy ghosh2018learning ; nair2018visual . During training, exploration is also initiated from the current set of pivotal states, similar to the “cells” in Go-Explore ecoffet2019go

, except that ours are learned by the latent model instead of using heuristics. The edges of the graph, extrapolated from both the trajectories and the goal-conditioned policy, correspond to the actionable transitions between close-by pivotal states. Finally, the goal-conditioned policy can be used to further promote transfer learning on downstream tasks 

taylor2009transfer .

At first glimpse, the world graph seems suitable for model-based RL littman1996algorithms ; kaiser2019model , but our method emphasizes the connections among neighboring pivotal states rather than transitions over any arbitrary pair, which is usually considered a much harder problem gu2016continuous . Therefore, in the next stage, we propose a hierarchical reinforcement learning kulkarni2016hierarchical ; marthi2005concurrent (HRL) approach to incorporate the world graph for solving specific downstream tasks. Concretely, within the paradigm of goal-conditioned HRL dayan1993feudal ; nachum2018data ; vezhnevets2017feudal ; levy2017hierarchical , our approach innovates how the high-level Manager provides goals and how the low-level Worker navigates. Instead of sending out a single objective, the Manager first selects a pivotal state from the world graph and then specifies a final goal within a nearby neighborhood of the selected pivotal state. We refer to this sequential selection as the Wide-then-Narrow (WN) instruction. This construction allows us to utilize the information from the learned graph descriptor and to form passages between pivotal states through application of graph traversal techniques bertsekas1995dynamic , thanks to which the Worker can now focus on local objectives. Lastly, as previously mentioned, the goal-conditioned policy derived from learning the world graph can serve as an initialization to the Manager and Worker, allowing fast skill transfer to new tasks as demonstrated by our experiments.

In summary, our main contributions are:

  • [noitemsep,topsep=0pt,leftmargin=0.5cm]

  • A complete two-stage framework for 1) unsupervised world graph discovery and 2) accelerated HRL by integrating the graph.

  • The first stage proposes an unsupervised module to learn world graphs, including a novel recurrent differentiable binary latent model and a curiosity-driven goal-conditioned policy.

  • The second stage proposes a general HRL scheme with novel components such as the Wide-then-Narrow instruction, navigation via world graph traversal and skill transfer from goal-conditioned policy models.

  • Quantitative and qualitative empirical findings over a complex 2D maze domain show that our proposed framework 1) produces a graph descriptor representative of the world and 2) improves both sample efficiency and final performance in solving downstream tasks by a large margin over baselines that lack the descriptor.

Figure 1: Top Left: Overall pipeline of our proposed 2-stage framework. Top Right (world graph discovery): a subgraph exemplifies how to forge edges and traverse between pivotal states (in blue). Bottom (Hierarhical RL): an example rollout from our proposed HRL policy with Wide-then-Narrow Manager instructions and world graph traversals, solving a challenging Door-Key task.

2 Environment

For ease of clear exposition and scientific control, we choose finite, fully observable yet complex 2D mazes gym_minigrid as our testbeds, i.e. for each state-action pair and their transitions , are finite. More involved environments can introduce interfering factors, shadowing the effects from the proposed method, e.g. the need of a well-calibrated latent goal space higgins2017scan ; zach2019hierarchical ; nachum2018near . Section 7 briefly speculates on extensions of our framework to other environments as future directions. We employ 3 mazes of small, medium and large sizes with varying compositions (see the Appendix for visualization). Despite being finite and fully observable, these mazes still pose much challenge, especially when the environment becomes large, engages stochasticity, provides only sparse reward or requires more complicated logic. The maze states received by the agent are in the form of bird-eye view matrix representations. More details on preprocessing are available in the Appendix.

Figure 2:

Our recurrent latent model with differentiable binary latent units to discover pivotal states. A prior network (left) learns the state-conditioned prior in Beta distribution,

. An inference encoder learns an approximate posterior in HardKuma distribution basting2019interpretable inferred from ’s, . A generation decoder reconstructs the action sequence from . During training, we sample from using the reparametrization trick kingma2013auto .

3 World Graph Discovery

We envision a simple directed weighted graph graph to capture the high-level structure of the world. Its nodes are a set of points of interest, termed pivotal states (), and edges represent transitions between nodes. Drawing intuition from unsupervised sequence segmentation chatzigiorgaki2009real ; jayaraman2018time

and imitation learning 

abbeel2004apprenticeship ; hussein2017imitation , we define as the most critical states in recovering action sequences generated by some agent, indicating that these states lead to the most information gain azar2019world . In other words, given a trajectory , we learn to identify the most critical subset of states for accurately inferring the action sequences taken in the full trajectory .

Supposing the state-action trajectories are available, we formulate a recurrent variational inference model (see Section 3.1blei2017variational ; chung2015recurrent ; gregor2015draw ; kingma2013auto , treating the action sequences as evidence and, for each state in the sequence, inferring a binary latent variable that controls whether to keep the state for action reconstruction. We learn a prior over each latent conditioned on its designated state , as opposed to using a fixed prior or conditioning on the surrounding trajectory, and use the prior mean as the criterion for including in .

Meaningful are learned from meaningful trajectories; hence, we develop a procedure to alternately update the latent model and improve trajectory collections. When collecting training trajectories, we place the agent at a state from the current iteration’s set of —this is possible since the agent can straightforwardly document and reuse the paths from its initial position to states in . This way naturally allows the exploration starting points to expand as the agent discovers more of its environment. Random walk trajectories tend to be noisy thus perhaps irrelevant to real tasks. We instead take inspiration from prior work on actionable representations ghosh2018learning and learn a goal-conditioned policy

for navigating between close-by states, reusing observed trajectories for unsupervised learning (Section 

3.2). To ensure broad state coverage and diverse trajectories, we add a curiosity reward from the unsupervised action reconstruction error to learn . The latent model is then updated with new trajectories. This cycle repeats until the action reconstruction accuracy plateaus. To form the edges of and, we again use both random trajectories and (Section 3.3). Lastly, the implicit knowledge of the world embedded in can be further transferred to downstream tasks through weight initialization, which will be discussed later on (Section 4.4).

The pseudo-code summarization of world graph discovery, implementation details, a visualization of how progresses over training and the final from different rollout policies are provided in the Appendix. The following sections concretely describe each component of our proposed process.

3.1 Recurrent Variational Model with Differentiable Binary Latent Variables

We propose a recurrent variational model with differentiable binary latent variables to discover (Figure 2). Given a trajectory , we treat the action sequence as evidence in order to infer a sequence of binary latent variables ’s. The evidence lower bound is


The objective is to reconstruct the action sequence given only the states where , with the boundary states always given . To ensure differentiablity, we opt to use a continuous relaxation of discrete binary latent variables by learning a Beta distribution as the priors for ’s russo2018tutorial . Moreover, we learn the prior for each conditioned on its associated state (Figure 2). The prior mean for each signifies on average how necessary is for action reconstruction. Also, the KL-divergence term in Equation 1 between the approximated posterior and the learned prior encourages similar trajectories to pick the same states for action reconstruction. We define as the top 20 states ranked by the learned prior means.

The approximate posteriors follow the Hard Kumaraswamy distribution basting2019interpretable [] which resemble the Beta distribution but is outside the exponential family. This choice allows us to sample 0’s and 1’s without sacrificing differentiability, accomplished via the stretch-and-rectify procedure basting2019interpretable ; louizos2017learning . The simple CDF of Kuma also makes the reparameterization trick easily applicble kingma2013auto ; rezende2014stochastic ; maddison2016concrete . Lastly, KL-divergence between Kuma and Beta distribution can be approximated in closed form nalisnick2016stick . We fix to ease optimization since the Kuma and Beta distributions coincide when .

There is not yet any constraint to prevent the model from selecting all states to reconstruct . To introduce a selection bottleneck, we impose a regularization on the expected norm of to promote sparsity at a targeted value  louizos2017learning ; basting2019interpretable . In other words, this objective constraints that there should be of activated given a sequence of length . Another similarly constructed transition regularization encourages isolated activation of , meaning the number of transition between and among ’s should roughly be . Note that both expectations in Equation 2 have closed forms for HardKuma.


Lagrangian Relaxation.

The overall optimization objective consists of action sequence reconstruction, KL-divergence, and (Equation 3). We tune the objective weights using Lagrangian relaxation higgins2017beta ; basting2019interpretable ; bertsekas1999nonlinear , treating ’s as learnable parameters and performing alternative optimization between ’s and the model parameters. We observe that as long as their initialization is within a reasonable range, ’s converge to local optimum autonomously,


Our finalized latent model allows efficient and stable mini-batch training. Alternative designs, such as Poisson prior kipf2018compositional for latent space and Transformer vaswani2017attention for sequence modeling, are also possibilities for future investigation. More details related to the latent model can be found in the Appendix.

3.2 Curiosity-Driven Goal-Conditioned Agent

A goal-conditioned policy, , or , is trained to reach a goal state given current state  ghosh2018learning . For large state spaces, training a goal-conditioned policy to navigate between any two states is non-trivial. However, our use-cases, including trajectory generation for unsupervised learning and navigation between nearby pivot states in downstream tasks, only require to reach goals over a short range. We train such an A2C-based goal-conditioned policy by sampling goals using the end points of random walks with reasonable length from a given starting state. Inspired by the success of intrinsic motivation methods—in particular, curiosity burda2018large ; achiam2017surprise ; pathak2017curiosity ; azar2019world —we leverage the readily available action reconstruction errors from the generative decoder as intrinsic reward signals to boost exploration when training . The pseudo-code describing this method is in the Appendix.

3.3 Edge Connections

The last crucial step towards the world graph completion is building the edge connections. After finalizing , we perform random walks from to discover the underlying adjacency matrix graph connecting individual ’s. More precisely, we claim a directed edge if there exist a random walk trajectory from to that does not intersect a third pivotal state. We then collect the shortest such actionable paths as the edge paths. Each path is further refined by if feasible. The action sequence length of the edge path between adjacent pivotal states defines the weight of the edge. Traversal between pivotal states are planned basing on the weight information using dynamic programming sutton1998introduction ; feng2004dynamic . For deterministic environments, the agents can simply follow the action sequence from the edge to transit between pivotal states. When the environment is stochastic, the agent traverses following the goal-conditioned policy (see Section 4.3). The planing in this case can potentially be improved by probabilistically or functionally encoding the edge weights yamaguchi2016neural ; ross2014introduction , which is left for future work.

Figure 3: Left: a general configuration of Feudal Netowrk; Manager and Worker are both A2C-LSTMs operating at different temporal resolutions. Right: proposed Wide-then-Narrow Manager instruction, where Manager first outputs a wide goal from a pre-defined set of candidate states , e.g. , and then zooms its attention to a closer up area around to narrow down the final subgoal .

4 Accelerated Hierarchical Reinforcement Learning

We now introduce a hierarchical reinforcement learning kulkarni2016hierarchical ; marthi2005concurrent (HRL) framework that leverages the world graph to accelerate learning downstream tasks. This framework has three core features:

  • [noitemsep,topsep=0pt,leftmargin=0.5cm]

  • the Manager uses two-step “Wide-then-Narrow” goal descriptions (Section 4.2),

  • the Worker traverses the learned world graph at appropriate time(Section 4.3),

  • the goal-conditioned policy learned in the graph discovery stage is used for weight initialization for the Worker and Manager (Section 4.4).

We show that our method learns to solve new tasks significantly faster and better compared to related baselines (Section 4.1). For implementation details, see the Appendix.

4.1 Preliminaries and Hierarchical Reinforcement Learning

Formally, we consider a Markov Decision Process, where at time

an agent in a state executes an action via a policy and receives rewards . The agent’s goal is to maximize its cumulative expected return , where are the transition and initial state distributions. To solve this problem, we consider a model-free, on-policy learning baseline, the advantage actor-critic (A2C) (wu2016training, ; pane2016actor, ; mnih2016asynchronous, ) and its hierarchical extension called Feudal Network (FN) dayan1993feudal ; vezhnevets2017feudal (Figure 4). At the core, A2C models both a value function

by regressing over the estimated

-step discounted returns with discount rate , , and a policy guided by advantage-based policy gradient schulman2015high . The policy entropy is regularized to encourage exploration. A2C’s hierarchical extension FN consists of a high-level controller (“Manager”), which learns to propose subgoals to the low-level controller (“Worker”), which learns to complete the subgoals. The Manager receives rewards from the environment and the Worker receives rewards from the Manager by reaching its subgoals. The high and low-level policy models are distinct and operate at different temporal resolutions: the Manager only outputs a new subgoal if either Worker completes its current one or the subgoal horizon

is exceeded. In this work, we mainly consider finite, discrete and fully observable mazes. As such, FN can use any state as a subgoal and the Manager policy can emit a probability vector of dimension

, although our framework supports more general subgoal definitions. More implementation details and pseudo-codes on our baselines are in the Appendix.

4.2 World Graph Nodes for Wide-then-Narrow Manager Instructions

To connect the learned graph to the HRL framework, the Manager needs the ability to designate any state as a subgoal while using the abstraction provided by . To that end, we structure the Manager’s output using a Wide-then-Narrow (WN) format:

  1. Given a pre-defined set of candidate states , the Manager uses a “wide” policy that outputs a “wide” subgoal . This work proposes the “wide” goals come from the learned pivotal states .

  2. Next, the Manager zooms its attention to an local area around . A “narrow-goal” policy then selects a final “narrow” goal , using both global and local information. Both goals are passed to the Worker, who is rewarded if reaching or within the horizon .

Using the WN goal format, the policy gradient for the Manager policy becomes:

where is the Manager’s advantage at time . Since the size of the action space scales linearly with , the exact entropy for the can easily become intractable (see the Appendix). Thus in practice we resort to an effective alternative .

4.3 Using World Graph Edges for Traversal

With pivotal states serving as wide-goals, we can effectively take advantage of the edges in the world graph through graph traversals:

  1. When to Traverse: When a Worker is given a goal pair , it can traverse the world via if it encounters a pivotal state that has a feasible connection to in .

  2. Planning: We estimate the optimal traversal route from to based on the edge weights. Here we use the classic dynamic programming planning methods sutton1998introduction ; feng2004dynamic , although other (learned) methods can be applied.

  3. Execution: once the route is planned, for deterministic environments, the agent simply follows the action sequence from the edge paths. For stochastic environments, we either disallow the agent to follow a route that is newly blocked and expect the Manager to adapt accordingly (e.g. in Door-Key) or rely on to navigate between pivotal states (e.g. in MultiGoal-Stochastic).

During learning, can be simultaneously fine-tuned to adapt task-specific environment stochasticity. When traversing under , if the agent fails to reach the next target pivotal state within a certain time limit, it would simply stop its current pursuit. The benefit of world graph traversal is to allow the Manager to assign more task-relevant goals that can be far away from an agent’s position yet easily reachable by leveraging the connectivity knowledge of the world. In this way, we can speed up learning by focusing the low-level exploration on the relevant parts of the world only, i.e., around those highly task-relevant pivotal states .

4.4 Knowledge Transfer through Goal-Conditioned Policy Initialization

Lastly, we leverage implicit knowledge of the world acquired by during world graph discovery to the subsequent HRL training. Transferring and generalizing skills between RL tasks often leads to performance gains taylor2009transfer ; barreto2017successor and goal-conditioned policies have been shown to capture the underlying structure of the environment well ghosh2018learning

. Additionally, optimizing a neural network system like HRL 

co2018self is sensitive to weight initialization mishkin2015all ; le2015simple

, due to its complexity and lack of clear supervision. Therefore, taking inspiration from the prevailing pre-training procedures in computer vision 

russakovsky2015imagenet ; donahue2014decaf and NLP devlin2018bert ; radford2019language , we achieve implicit skill transfer and improved optimization by initializing the task-specific Worker and the Manager with the weights from . Our empirical results put forward strong evidence that such initialization serves as an essential basis to solve challenging RL tasks later on, analogously to similar practices in the other domains.

Task MultiGoal Dense Reward MultiGoal Sparse Reward Stochastic MultiGoal
Maze size Small Medium Large Small Medium Large Small Medium Large
A2C Fail Fail Fail Fail Fail Fail
FN Fail Fail Fail Fail Fail Fail Fail Fail
FN init Fail Fail Fail Fail Fail Fail Fail
Fail Fail Fail Fail
Fail Fail Fail Fail Fail
Fail Fail Fail Fail
Fail Fail
Door-Key init traversal init traversal
Small Fail
Medium Fail Fail
Large Fail Fail Fail Fail Fail
Table 1: Top: results over MultiGoal and its sparse/stochastic versions (average reward after 100k training iterations). Bottom: results over Door-Key (average success rate in std), omitting suboptimal models for clearer presentation. Note that WN with neither init nor traversal fail across tasks thus not displayed here.

For full results, including standard-deviations, see the Appendix.

5 Experiments

We validate the effectiveness and assess the impact of each proposed component in a thorough ablation study on a set of 4 challenging maze tasks with different reward structures, levels of stochasticity and logic. Furthermore, we evaluate each task in three different mazes of increasing sizes (small, medium and large). Implementation details, snippets of the tasks and mazes are in the Appendix.

In all tasks, every action taken by the agent receives a negative reward penalty . The other specifics of each task are:

  • [noitemsep,topsep=0pt,leftmargin=0.5cm]

  • In MultiGoal, the agent needs to collect 5 randomly spawned balls and exit from a designated exit point. Reaching each ball or the exit point gives reward .

  • Its sparse version, MultiGoal-Sparse, only gives a single reward proportional to the number of balls collected upon exiting.

  • Its stochastic version, MultiGoal-Stochastic, spawns a lava block at a random location each time step that immediately terminates the episode with a negative reward of if stepped on.

  • Door-Key is a much more difficult task that adds new actions (“pick” and “open”) and new objects to the environment (additional walls, doors, keys). The agent needs to pick up the key, open the door (reward ) and reach the exit point on the other side (reward ).

Control Experiments

We ablate each proposed components and compare against the non-hierarchical and hierarchical baselines, A2C and FN. The proposed components always augment on top of FN.

  • [noitemsep,topsep=0pt,leftmargin=0.5cm]

  • First, we test initializing the Manager and Worker with the weights of .

  • Next, we evaluate WN with 3 different sets of ’s for the Manager to pick from: includes all valid states, are uniformly sampled states, are learned pivotal states. and are of the same size and their edge connections are obtained in the same way (Section 3.3)111The edge connection of is a trivial case excluded here as every state is 1 step away from its adjacent states. Also, note neither nor guaranteed state access is available to when forming edge connections, but we grant all pre-requisites for the fairest comparison possible..

  • Finally, we enable traversal on top of WN. If traversal is done through , along side with HRL training, we also refine (if given for initialization) or learn one from scratch (if not given for initialization).

We inherit most hyperparameters from the training of

in the world graph discovery stage, as the Manager and the Worker both share similar architecture as . The hyperparameters of in turn follow those from shang2018stochastic . For details, see the Appendix. We follow a rigorous evaluation protocol acknowledging the variability in outcomes in deep reinforcement learning henderson2018deep : each experiment is repeated with 3 seeds wu2017scalable ; ostrovski2017count , 10 additional validation seeds are used to pick the best model which is then tested on 100 testing seeds. Mean of selected testing results are in Table 1. We omit results of those experimental setups that consistently fail, meaning training is either not initiated or validation rewards are never above 0, on all tasks. See the Appendix for the full report, including standard deviations over the 3 seeds.

5.1 Empirical Analysis

Initialization with

Table 1 and Figure 4 show initialization with is crucial across all tasks, especially for the hierarchical models—e.g. a randomly initialized A2C outperforms a randomly initialized FN on small-maze MultiGoal. Models starting from scratch fail on almost all tasks within the maximal number of training iterations, unless coupled with traversal, which is still inferior to using -initialization.


Comparing A2C, FN and suggests WN is a highly effective way to structure Manager subgoals. For example, in small MultiGoal, () surpasses FN () by a large margin. We posit that the Manager tends to select from a certain smaller subset of , simplifying the learning of transitions between ’s for the Worker. As a result, the Worker can focus on solving local objectives. The same reasoning conceivably explains why traversal does not yield performance gains on small and medium MultiGoal. For instance, on small MultiGoal scores , slightly higher than with traversal (). However once mazes become large, the Worker struggles to master traversals on its own and thus starts to fail the tasks.

World Graph Traversal

In the case described above, the addition of world graph traversal plays an essential role, e.g. for large MultiGoal. As we conjectured in Section 4.3, this phenomenon can be explained by the much expanded exploration range and a lift of responsibility off the Worker to learn long distance transitions as a result of using traversal. Moreover, Figure 4 confirms another conjecture from Section 4.3: traversal speeds up convergence, more evidently with larger mazes. Lastly, in Door-Key, the agent needs to plan and execute a particular combination of actions. The huge discrepancy on medium Door-Key between using traversal or not, vs , suggests traversal indeed improves long-horizon planning.

Benefit of Learned Pivotal States

Comparing to reveals the quality of pivotal states identified by the latent model. Overall, either performs better (particularly for non-deterministic environments) or similarly as

, but with much less variance between different seeds. If one luckily picks a set of random states suitable for a task, it can deliver great results but the opposite is equally possible. In addition, edge formation over

still depends on the products from learning world graph, hence using pivotal states with its coupled is more favorable over .

Figure 4: Validation curves during training (mean and standard-deviation of reward, 3 seeds) for MultiGoal. Left: Compare between and , with or without traversal, all models here use WN and initialization. Observe that (1) traversal evidently speeds up convergence (2) carries higher variance and slightly inferior performance than . Right: compare with or without initialization on , all models use WN; initialization shows clear advantage.

6 Related Works

Pivotal state discovery is related to unsupervised sequence segmentation chatzigiorgaki2009real ; jayaraman2018time ; pertsch2019keyin ; blei2001topic ; chan2016latent and option or sub-task discovery in the context of RL jinnai2019discovering ; bacon2017option ; niekum2013incremental ; fox2017multi ; leon2016options ; kroemer2015towards ; kipf2018compositional . Among them, both kipf2018compositional and pertsch2019keyin employ sequential variational models to infer either task boundaries or key frames of demonstrations in an unsupervised manner, followed by applications to hierarchical RL or planning. Besides technical differences, instead of turning points for individual sequences, our module aims to identify a set of landmark states that all together can represent the world well. Also, our training examples come from carefully orchestrated exploration rather than human demonstrations.

Understanding the world is central to planning, control and (model-based) RL. In robotics, one often needs to locate or navigate itself by interpreting a map of the world lowry2015visual ; thrun1998learning ; angeli2008fast . Our exploration strategy borrows the high-level insight from robotics active localization, where robots are actively guided to investigate unfamiliar regions by humans fox1998active ; li2016active . Another direction in this area is to learn a world model azar2019world ; ha2018world ; guo2018neural that generates latent states tian2017latent ; haarnoja2018latent ; racaniere2017imagination . If the dynamics of the world are also learned, then it can be applied to planning mnih2016strategic ; hafner2018learning or model-based RL gregor2018temporal ; kaiser2019model . Although involving generative modeling, our framework differentiates itself through the functionality of our binary latent variables—indicators of whether a state, regardless of its representation, is a pivotal state.

The policy-learning phase in our framework uses the paradigm of goal-conditioned HRL levy2017hierarchical ; dayan1993feudal ; nachum2018data ; vezhnevets2017feudal . In addition, th WN mechanism borrows ideas from attentive object understanding fritz2004attentive ; ba2014multiple ; you2016image in vision. World graph traversal is inspired by classic optimal planning in Markov Decision Processes with dynamic programming bertsekas1995dynamic ; feng2004dynamic ; weiss1960dynamic ; sutton1998introduction . Lastly, initialization with goal-conditioned policies to transfer knowledge is inspired by transfer learning donahue2014decaf ; taylor2009transfer and skill generalization in RL barreto2017successor ; hausman2018learning ; ghosh2018learning .

Concurrently,  eysenbach2019search also proposes to plan a sequence of subgoals leading to a final destination according to a graph abstraction of the world that is obtained via goal-conditioned policy. However, under a different problem setup and use-case assumptions, the nodes in eysenbach2019search are not learned pivotal states, but directly come from a replay buffer; similarly, their graph is task-specific whereas ours is designed to assist a variety of downstream tasks.

7 Conclusion and Future Work

We propose a general two-stage framework to learn a concise world abstraction as a simple directed graph with weighted edges that facilitates HRL for a diversity of downstream tasks. Our thorough ablation studies on several challenging maze tasks show clear advantage of each proposed innovative component in our framework.

The framework can be extended to other types of environments, such as partially observable and high-dimensional ones, through, e.g., probabilistic or differentiable planning kaelbling1998planning ; yamaguchi2016neural ; ross2014introduction , latent embedding of (belief) states guo2018neural and goals nachum2018near . Other directions are to adapt our framework to evolving or constantly changing environments, through, e.g., meta-learning finn2017model , and/or to include off-policy methods to achieve better sample efficiency. Finally, the learned world graphs can potentially be applied beyond HRL, e.g., multi-tasking RL hessel2018multi , structured exploration by using pivotal state as checkouts ecoffet2019go or in multiagent settings bucsoniu2010multi ; hu1998multiagent .