Mapping State Space using Landmarks for Universal Goal Reaching

08/15/2019 ∙ by Zhiao Huang, et al. ∙ University of California, San Diego 2

An agent that has well understood the environment should be able to apply its skills for any given goals, leading to the fundamental problem of learning the Universal Value Function Approximator (UVFA). A UVFA learns to predict the cumulative rewards between all state-goal pairs. However, empirically, the value function for long-range goals is always hard to estimate and may consequently result in failed policy. This has presented challenges to the learning process and the capability of neural networks. We propose a method to address this issue in large MDPs with sparse rewards, in which exploration and routing across remote states are both extremely challenging. Our method explicitly models the environment in a hierarchical manner, with a high-level dynamic landmark-based map abstracting the visited state space, and a low-level value network to derive precise local decisions. We use farthest point sampling to select landmark states from past experience, which has improved exploration compared with simple uniform sampling. Experimentally we showed that our method enables the agent to reach long-range goals at the early training stage, and achieve better performance than standard RL algorithms for a number of challenging tasks.



There are no comments yet.


page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) allows training agents for planning and control tasks by feedbacks from the environment. While significant progress has been made in the standard setting of achieving a goal known at training time, e.g., to reach a given flag as in MountainCar Moore (1990)

, very limited efforts have been exerted on the setting when goals at evaluation are unknown at training time. For example, when a robot walks in an environment, the destination may vary from time to time. Tasks of this kind are unanimous and of crucial importance in practice. We call them Universal Markov Decision Process (UMDP) problems following the convention of

Levy et al. (2018).

Pioneer work handles UMDP problems by learning a Universal Value Function Approximator (UVFA). In particular, Schaul et al. Schaul et al. (2015) proposed to approximate a goal-conditioned value function 111 is the current state and is the goal.

by a multi-layer perceptron (MLP), and Andrychowicz et al. 

Andrychowicz et al. (2017) proposed a framework called hindsight experience replay (HER) to smartly reuse past experience to fit the universal value function by TD-loss. However, for complicated policies of long-term horizon, the UVFA learned by networks is often not good enough. This is because UVFA has to memorize the cumulative reward between all the state-goal pairs, which is a daunting job. In fact, the cardinality of state-goal pairs grows by a high-order polynomial over the horizon of goals.

While the general UMDP problem is extremely difficult, we consider a family of UMDP problems whose state space is a low-dimension manifold in the ambient space. Most control problems are of this type and geometric control theory has been developed in the literature Bullo and Lewis (2004). Our approach is inspired by manifold learning, e.g., Landmark MDS De Silva and Tenenbaum (2004). We abstract the state space as a small-scale map, whose nodes are landmark states selected from the experience replay buffer, and edges connect nearby nodes with weights extracted from the learned local UVFA. A network is still used to fit the local UVFA accurately. The map allows us to run high-level planning using pairwise shortest path algorithm, and the local UVFA network allows us to derive an accurate local decision. For a long-term goal, we first use the local UVFA network to direct to a nearby landmark, then route among landmarks using the map towards the goal, and finally reach the goal from the last landmark using the local UVFA network.

Our method has improved sample efficiency over purely network learned UVFA. There are three main reasons. First, the UVFA estimator in our framework only needs to work well for local value estimation. The network does not need to remember for faraway goals, thus the load is alleviated. Second, for long-range state-goal pairs, the map allows propagating accurate local value estimations in a way that neural networks cannot achieve. Consider the extreme case of having a long-range state-goal pair never experienced before. A network can only guess the value by extrapolation, which is known to be unreliable. Our map, however, can reasonably approximate the value as long as there is a path through landmarks to connect them. Lastly, the map provides a strong exploration ability and can help to obtain rewards significantly earlier, especially in the sparse reward setting. This is because we choose the landmarks from the replay buffer using a farthest-point sampling strategy, which tends to select states that are closer to the boundary of the visited space. In experiments, we compared our methods on several challenging environments and have outperformed baselines.

Our contributions are: First, We propose a sample-based method to map the visited state space using landmarks. Such a graph-like map is a powerful representation of the environment, maintains both local connectivity and global topology. Second, our framework will simultaneously map the visited state space and execute the planning strategy, with the help of a locally accurate value function approximator and the landmark-based map. It is a simple but effective way to improve the estimation accuracy of long-range value functions and induces a successful policy at the early stage of training.

2 Related work

Variants of goal-conditioned decision-making problems have been studied in literature Sutton et al. (2011); Mao et al. (2018); Schaul et al. (2015); Pong et al. (2018). We focus on the goal-reaching task, where the goal is a subset of the state space. The agent receives meaningful rewards if and only if it has reached the goal, which brings significant challenges to existing RL algorithms. A significant recent approach along the line is Hindsight Experience Replay (HER) by Andrychowicz et al Andrychowicz et al. (2017). They proposed to relabel the reached states as goals to improve data efficiency. However, they used only a single neural network to represent the value, learned by DDPG Lillicrap et al. (2015). This makes it hard to model the long-range distance. Our method overcomes the issue by using a sample-based map to represent the global structure of the environment. The map allows to propagate rewards to distant states more efficiently. It also allows to factorize the decision-making for long action sequences into a high-level planning problem and a low-level control problem.

Model-based reinforcement learning algorithms usually need to learn a local forward model of the environment, and then solve the multi-step planning problem with the learned model  Hafner et al. (2018); Oh et al. (2017); Silver et al. (2016); Henaff et al. (2017); Srinivas et al. (2018); Yu et al. (2019). These methods rely on learning an accurate local model and require extra efforts to generalize to the long term horizon Ke et al. (2018). In comparison, we learn a model of environment in a hierarchical manner, by a network-based local model and a graph-based global model (map). Different from previous works to fit forward dynamics in local models, our local model distills local cumulative rewards from environment dynamics. In addition, our global model, as a small graph-based map that abstracts the large state space, supports reward propagation at long range. One can compare our framework with Value Iteration Networks (VIN) Tamar et al. (2016)

. VIN focused on the 2D navigation problem. Given a predefined map of known nodes, edges, and weights, it runs the value iteration algorithm by ingeniously simulating the process through a convolutional neural network 

LeCun et al. (1998). In contrast, we construct the map based upon the learned local model.

Sample-Based Motion Planning (SBMP) has been widely studied in the robotics context Hart et al. (1968); LaValle (1998); Kavraki et al. (1994)

. The traditional motion planning algorithm requires the knowledge of the model. Recent work has combined deep learning and deep reinforcement learning for  

Ichter and Pavone (2018); Qureshi and Yip (2018); Klamt and Behnke (2019); Faust et al. (2018). In particularly, PRM-RL addressed the 2D navigation problem by combining a high-level shortest path-based planner and a low-level RL algorithm. To connect nearby landmarks, it leveraged a physical engine, which depends on sophisticated domain knowledge and limits its usage to other general RL tasks. In the general RL context, our work shows that one can combine a high-level planner and a learned local model to solve RL problems more efficiently. Some recent work also utilize the graph structure to perform planning Savinov et al. (2018); Zhang et al. (2018), however, unlike our approach that discovers the graph structure in the process of achieving goals, both Savinov et al. (2018); Zhang et al. (2018)

require supervised learning to build the graph. Specifically,

Savinov et al. (2018) need to learn a Siamese network to judge if two states are connected, and Zhang et al. (2018) need to learn the state-attribute mapping from human annotation.

Our method is also related to hierarchical RL research Levy et al. (2018); Kulkarni et al. (2016); Nachum et al. (2018). The sampled landmark points can be considered as sub-goals. Levy et al. (2018); Nachum et al. (2018) also used HER-like relabeling technique to make the training more efficient. These work attack more general RL problems without assuming much problem structure. Our work differs from previous work in how high-level policy is achieved. In their methods, the agent has to learn the high-level policy as another RL problem. In contrast, we exploit the structure of our universal goal reaching problem and find the high-level policy by solving a pairwise shortest path problem in a small-scale graph, thus more data-efficient.

3 Background

Universal Markov Decision Process (UMDP) extends an MDP with a set of goals . UMDP has reward function , where is the state space and is the action space. Every episode starts with a goal selected from by the environment and is fixed for the whole episode. We aim to find a goal conditioned policy to maximize the expected cumulative future return , which called goal-conditioned value, or universal value. Universal Value Function Approximators (UVFA) Schaul et al. (2015) use neural network to model where is the optimal policy, and apply Bellman equation to train it in a bootstrapping way. Usually, the reward in UMDP is sparse to train the network. For a given goal, the agent can receive non-trivial rewards only when it can reach the goal. This brings a challenge to the learning process.

Hindsight Experience Replay (HER) Andrychowicz et al. (2017) propose goal-relabeling to train UVFA in sparse reward setting. The key insight of HER is to “turn failure to success”, i.e. to make a failed trajectory become success, by replacing the original failed goals with the goals it has achieved. This strategy gives more feedback to the agent and improves the data efficiency for sparse reward environments. Our framework relies on HER to train an accurate low-level policy.

4 Universal Goal Reaching

Problem Definition:

Our universal goal reaching problem refers to a family of UMDP tasks. The state space of our UDMP is a low-dimension manifold in the ambient space. Many useful planning problems in practice are of this kind. Example universal goal reaching environments include labyrinth walking (e.g., AntMaze Duan et al. (2016)) and robot arm control (e.g., FetchReach Plappert et al. (2018)

). Their states can only transit in a neighborhood of low-dimensionality constrained by the degree of freedom of actions.

Following the notions in Sec 3, we assume that a goal in goal space which is a subset of the state space . For example, in a labyrinth walking game with continuous locomotion, the goal can be to reach a specific location in the maze at any velocity. Then, if the state

is a vector consisting of the location and velocity, a convenient way to represent the goal

would be a vector that only contains the dimensions of location, i.e., the goal space is a projection of the state space.

The universal goal reaching problem has a specific transition probability and reward structure. At every time step, the agent moves into a local neighborhood based on the metric in the state space, which might be perturbed by random noise. It also receives some negative penalty (usually a constant, e.g.,

in the experiments) unless it has arrived at the vicinity of the goal. A reward is received if the goal is reached. To maximize the accumulated reward, the agent has to reach the goal in fewest steps. Usually the only non-trivial reward appears rarely, and the universal goal reaching problem falls in the category of sparse reward environments, which are hard-exploration problems for RL.

A Graph View:

Assume that a policy takes at most steps to move from to and the reward at each step ’s absolute value is bounded by . Let be the expected total reward along the trajectory, and for all . We can prove222When , we can approximate by its first-order Taylor expansion .:

Thus, when and , UVFA can be approximated as:


In this case, it is easy to show that the value iteration based on Bellman Equation implies , where is the transition probability of optimal policy .

The relationship allows us to view the MDP as a directed graph, whose nodes are the state set , and edges are sampled according to the transition probability in the MDP. The general value iteration for RL problems is exactly the shortest path algorithm in terms of on this directed graph. Besides, because the nodes form a low-dimensional manifold, nodes that are far away in the state space can only be reached by a long path.

The MDP of our universal goal reaching problem is a large-scale directed graph whose nodes are in a low-dimensional manifold. This structure allows us to estimate the all-pair shortest paths accurately by a landmark based coarsening of the graph.

5 Approach

Figure 1: An illustration of our framework. The agent is trying to reach the other side of the maze by planning on a landmark-based map. The landmarks are selected from its past experience, and the edges between the landmarks are formed by a UVFA.

In this paper, we choose deep RL algorithms such as DQN and DDPG for discrete and continuous action space, respectively. UVFA Schaul et al. (2015) is a goal-conditioned extension of the original DQN, while HER (Sec 3), can produce more informative feedback for UVFA learning. Our algorithm is thus based upon HER, and the extension of this approach for DDPG is also straightforward.

5.1 Basic Idea

Our approach aims at addressing the fundamental challenges in UVFA learning. As characterized in the previous section, the UVFA estimation solves a pair-wise shortest path problem, and the underlying graph has a node space of high cardinality. Note that UVFA has to memorize the distance between every state-goal pairs, through trajectory samples from the starting state to the goal. For analysis purpose, we assume the state space has dimension and contains a ball of radius . Then the lower-bound bound of the amount of the state-goal pairs is at the order of 333The volume of state-goal pair set is , where and are the state and goal sets (assuming for analysis only) and is the Cartesian product., a high-order polynomial.

The large set of state-goal pairs poses the challenge. First, it takes longer time to sample enough state-goal pairs. Particularly, at the early stage, only few state-goal samples have been collected, so learning from them requires heavy extrapolation by networks, which is well known to be unreliable. Second, memorizing all the experiences is too difficult even for large networks.

We propose a map to abstract the visited state space by landmarks and edges to connect them. This abstraction is reasonable due to the underlying structure of our graph — a low-dimensional manifold Goldberg and Harrelson (2005). We also learn local UVFA networks that only needs to be accurate in the neighborhood of landmarks. As illustrated in Figure 1, an ant robot is put in an “U” Maze to reach a given position. It should learn to model the maze as a small-scale map based on its past experiences.

This solution addresses the challenges. For the UVFA network, it only needs to remember experiences in a local neighborhood. Thus, the training procedure requires much lower sample complexity. The map decomposes a long path into piece-wise short ones, and each of which is from an accurate local network.

Our framework contains three components: a value function approximator trained with hindsight experience replay, a map that is supported by sampled landmarks, and a planner that can find the optimal path with the map. We will introduce them in Sec 5.2, Sec 5.3, and Sec 5.4, respectively.

5.2 Learning a Local UVFA with HER

Specifically, we define the following reward function for goal reaching problem:

Here is the next observation after taking action . We first learn a UVFA based on HER, which has proven its efficiency for UVFA. HER smartly generates more feedback for the agent, by replacing some unachievable goals with those achieved in the near future. HER thus allows the agent to obtain denser rewards before it can eventually reach goals that are far away.

In experiments (see Sec 6.3), we find out that the agent trained with HER does master the skill to reach goals of increasing difficulty in a curriculum way. However, the agent can seldom reach the most difficult goals constantly, while the success rate of reaching easier goals remains stable. All these observations prove that HER’s value and policy is locally reliable.

To increase the agent’s ability to reach nearby goals and get a better local value estimation at the early stage, we change the replacement strategy in HER, ensuring that the replaced goals are sampled from the near future within a fixed number of steps.

The UVFA trained in this step will be used for two purposes: (1) to estimate the distance between two local states belonging to the same landmark, or between two nearby landmarks; and (2) to decide whether two states are close enough so that we can trust the distance estimation from the network. Although the learned UVFA is imperfect globally, it is enough for the two local usages.

5.3 Building a Map by Sampling Landmarks

After training the UVFA, we will obtain a distance estimation 444If the algorithm returns a function, we will calculate the value by selecting the optimal action and calculate the function and convert to by Eq. 1, a policy for any state-goal pair , and a replay buffer that contains all the past experiences. We will build a landmark-based map to abstract the state space based on the experiences. The pseudo-code for the algorithm is shown in Algorithm 1.

Input: state , goal , UVFA , clip_value
Output: Next subgoal
1 Sample transitions from replay buffer FPS() Farthest point sampling to find landmarks Initialize Map as graph for  do
3       if  then
Bellman_Ford(W) Calculate pairwise distance return
Algorithm 1 Planning with State-space Mapping (Planner)
Landmark Sampling

The replay buffer stores visited states. Instead of localizing few important states that play a key role in connecting the environment, instead, we seek to sample many states to cover the visited state space.

Limited by computation budget, we first uniformly sample a big set of states from the replay buffer, and then use the farthest point sampling (FPS) algorithm Arthur and Vassilvitskii (2007) to select landmarks to support the explored state space. The metric for FPS can either be the Euclidean distance between the original state representation or the pairwise value estimated by the agent.

We compare different sampling strategies in Section 6.3, and demonstrate the advantage of FPS in abstracting the visited state space and exploration.

Connecting Nearby Landmarks

We first connect landmarks that have a reliable distance estimation from the UVFA and assign the UVFA-estimated distance between them as the weight of the connecting edge.

Since UVFA is accurate locally but unreliable for long-term future, we choose to only connect nearby landmarks. The UVFA is able to return a distance between any pair , so we connect the pairs with distance below a preset threshold , which should ensure that all the edges are reliable, as well as the whole graph is connected.

With these two steps, we have built a directed weighted graph which can approximate the visited state space. This graph is our map to be used for high-level planning. Such map induces a new environment, where the action is to choose to move to another landmark. The details can be found in Algorithm 1.

5.4 Planning with the Map

We can now leverage the map and the local UVFA network to estimate the distance between any state-goal pairs, which induces a reliable policy for the agent to reach the goal.

For a given pair of , we can plan the optimal path between by selecting a serial of landmarks , so that the approximated distance will be . The policy from to can then be approximated as: . Here the summation of is the concatenation of the corresponding action sequence.

In our implementation, we run the shortest path algorithm to solve the above minimization problem. To speed up the pipeline, we first calculate the pairwise distances between each landmark and the goal when episode starts. When the agent is at state , we can choose the next subgoal by finding .

6 Experiments

6.1 FourRoom: An Illustrative Example

(a) FourRoom
(b) Mean Distortion Error
(c) Max Distance Estimation
(d) Success Rate
Figure 2: The results on FourRoom Environment. Figure 1(a) shows the sampled landmarks and the planned path based on our algorithm. Figure 1(c),  1(b),  1(d)

are different evaluation metrics of value estimation and success rate to reach the goal.

We first demonstrate the merits of our method in the FourRoom environment, where the action space is discrete. The environment is visualized in Figure 1(a). There are walls separating the space into four rooms, with narrow openings to connect them. For this discrete environment, we use DQN Mnih et al. (2013) with HER Andrychowicz et al. (2017) to learn the Q value. Here, we use the one-hot representation of the x-y position as the input of the network. The initial states and the goals are randomly sampled during training.

We first get from the learned Q-value by equation , and convert to pairwise distance based on Eq. 1. To evaluate the accuracy of distance estimation, we further calculate the ground truth distance by running a shortest path algorithm on the underlying ground-truth graph of maze. Then we adapt the mean distortion error (MDE) as the evaluation metric: .

Results are shown in Figure 1(b). Our method has a much lower MDE at the very beginning stage, which means that the estimated value is more accurate.

To better evaluate our superiority for distant goals, we first convert predicted values to corresponding distances, and then plot the maximal distance during training. From Figure 1(c), we can observe that the planning module have a larger output range than DQN. We guess that this comes from the max-operation in the Bellman-Ford equation, which pushes DQN to overestimate the Q value, or in other words, underestimate the distance for distant goals. However, the planner can still use piece-wise correct estimations to approximate the real distance to the goal.

We also compare our method with DQN on success reaching rate, and their performances are shown in Figure 1(d).

6.2 Continuous Control

In this section, we will compare our method with HER on challenging classic control tasks and MuJoCo Todorov et al. (2012) goal-reaching environments.

6.2.1 Environment Description

(a) 2DReach
(b) 2DPush
(c) BlockedFetchReach
(d) FetchPush
(e) PointMaze
(f) AntMaze
(g) Complex AntMaze
(h) Acrobot
Figure 3: The environments we use for continuous control experiments.

2DReach A green point in a 2D U-maze aims to reach the goal represented by a red point, as shown in Figure 2(a). The size of the maze is . The state space and the goal space are both in this 2D maze. At each step, the agent can move within as in x and y directions.

2DPush The green point A now need to push a blue point B to a given goal (red point) lying in the same U-maze as 2DReach, as shown in Figure 2(b). Once A has reached B, B will follow the movement of A. In this environment, the state is a -dim vector that contains the location of both A and B.

BlockedFetchReach & FetchPush We need to control a gripper to either reach a location in 3d space or push an object in the table to a specific location, as shown in Figure 2(c) and Figure 2(d). Since the original FetchReach implemented in OpenAI gym Brockman et al. (2016) is very easy to solve, we further add some blocks to increase the difficulty. We call this new environment BlockedFetchReach.

PointMaze & AntMaze As shown in Figure 2(e) and Figure 2(f), a point mass or an ant is put in a U-maze. Both agents are trained to reach a random goal from a random location and tested under the most difficult setting to reach the other side of maze within 500 steps. The states of point and ant are 7-dim and 30-dim, including positions and velocities.

Complex AntMaze As shown in Figure 2(g), an ant is put in a complex maze. It is trained to reach a random goal from a random location and tested under the most difficult setting to reach the farthest goal (indicated as the red point) within 1500 steps.

Acrobot As shown in Figure 2(h), an acrobot includes two joints and two links. Goals are states that the end-effector is above the black line at specific joint angles and velocities. The states and goals are both 6-dim vectors including joint angles and velocities.

6.2.2 Experiment Result

The results compared with HER are shown in Figure 4. Our method trains UVFA with planner and HER. It is evaluated under the test setting, using the model and replay buffer at corresponding training steps.

In the 2DReach and 2DPush task (shown in Figure 3(b)), we can see our method achieves better performance. When incorporating with control tasks, for BlockedFetchReach and FetchPush environments, the results still show that our performance is better than HER, but the improvement is not so remarkable. We guess this comes from the strict time limit of the two environments, which is only . We observe that pure HER can finally learn well, when the task horizon is not very long.

We expect that building maps would be more helpful for long-range goals, which is evidenced in the environments with longer episode length. Here we choose PointMaze and AntMaze with scale . For training, the agent is born at a random position to reach a random goal in the maze. For testing, the agent should reach the other side of the “U-Maze” within 500 steps. For these two environments, the performance of planning is significantly better and remains stable, while HER can hardly learn a reliable policy. Results are shown in Figure 3(e) and Figure 3(f).

We also evaluate our method on classic control, and more complex navigation + locomotion task. Here we choose Complex Antmaze and Acrobot, and results are shown in Figure 3(h) and Figure 3(g). The advantage over baseline demonstrates our method is applicable to complicated navigation tasks as well as general MDPs.

6.2.3 Comparison with HRL

We compare our method with HRL algorithms on large AntMaze (size ), as shown in Table 1. We choose to compare with HIRO Nachum et al. (2018), which is the SOTA HRL algorithm on AntMaze, and HAC Levy et al. (2018), which also uses the hindsight experience replay. We test these algorithms with the published codes555HIRO: 666HAC:, under both sparse reward setting and dense reward setting.

On sparse reward setting, our algorithm can work well and reach the goal at the very early stage (Ours sparse in Table 1). In contrast, neither HAC nor HIRO are able to reach the goal in 2M steps. HIRO doesn’t use HER to replace the unachievable goals, which makes such setting very challenging for the algorithm.

For dense reward setting, the map planner can obtain a high success rate at very early stage shown as Ours dense in Table 1. Compared with HIRO dense, we can see that a planner can reach distant goals sooner, since we don’t need to train a high-level policy to propose subgoals for the low-level agent.

HAC introduced several complex hyper-parameters, and we couldn’t make it work well for both settings.

0.5M 0.75M 1M 1.25M 1.5M 1.75M 2M
Ours Sparse 0.0 0.03 0.3 0.4 0.45 0.5 0.5
HIRO Sparse 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Ours Dense 0.0 0.09 0.45 0.5 0.7 0.8 0.9
HIRO Dense 0.0 0.0 0.0 0.1 0.4 0.6 0.8
Table 1: Success Rate on Large AntMaze at different training steps.
(a) 2DReach
(b) 2DPush
(c) BlockedFetchReach
(d) FetchPush
(e) PointMaze
(f) AntMaze
(g) Complex AntMaze
(h) Acrobot
Figure 4: Experiments on the continuous control environments. The red curve indicates the performance of our method at different training steps.
(a) Multi-level AntMaze
(b) Average Steps
(c) Success Rate
Figure 5: AntMaze of multi-level difficulty. Figure 4(b) and Figure 4(c) is the average steps and success rate to reach different level of goals, respectively.

6.3 Ablation Study

We study some key factors that affect our algorithm on AntMaze.

Choice of Clip Range and Landmarks There are two main hyper-parameters for the planner – the number of landmarks and the edge clipping threshold . Figure 5(a) shows the evaluation result of the model trained after 0.8M steps in AntMaze. We see that our method is generally robust under different choices of hyper-parameters. Here is the negative distance between landmarks. If it’s too small, the landmarks will be isolated and can’t form a connected graph. The same problem comes when the landmarks are not enough.

The Local Accuracy of HER We evaluate our model trained between 02.5M steps, for goals of different difficulties. We manually define the difficulty level of goals, as shown in Figure 4(a). Goal’s difficulty increases from Level 1 to Level 6. We plot the success rate as well as the average steps to reach these goals. We find out that, for the easier goals, the agent takes less time and less steps to master the skill. The success rate and average steps also remain more stable during the training process, indicating that our base model is more reliable and stable in the local area.

Landmark Sampling Strategy Comparison Our landmarks are dynamically sampled from the replay buffer by iterative FPS algorithm using distances estimated by UVFA, and get updated at the beginning of every episode. The FPS sampling tends to find states at the boundary of the visited space, which implicitly helps exploration. We test FPS and uniform sampling in fix-start AntMaze (The ant is born at a fixed position to reach the other side of maze for both training and testing). Figure 5(b) shows that FPS has much higher success rate than uniform sampling. Figure 5(c) shows landmark-based graph at four training stages. Through FPS, landmarks expand gradually towards the goal (red dot), even if it only covers a small proportion of states at the beginning.

(a) Hyperparameters of the planner
(b) FPS vs. Uniform Sampling
(c) Landmark-based Map
Figure 6: Figure 5(a) shows the relationship with the landmarks and clip range in the planner. Figure 5(b) shows FPS outperforms uniform sampling. And Figure 5(c) is the landmark-based map at different training steps constructed by FPS.

7 Conclusion

Learning a structured model and combining it with RL algorithms are important for reasoning and planning over long horizons. We propose a sample-based method to dynamically map the visited state space and demonstrate its empirical advantage in routing and exploration in several challenging RL tasks. Experimentally we showed that this approach can solve long-range goal reaching problems better than model-free methods and hierarchical RL methods, for a number of challenging games, even if the goal-conditioned model is only locally accurate. However, our method also has limitations. First, we empirically observe that some parameters, particularly the threshold to check whether we have reached the vicinity of a goal, needs hand-tuning. Secondly, a good state embedding is still important for the learning efficiency of our approach, since we do not include heavy component of learning state embedding. Thirdly, we find that in some environments whose intrinsic dimension is very high, especially when the topological structure is hard to abstract, sample-based method is not enough to represent the visited state space. And for those environments which is hard to obtain a reliable and generalizable local policy, this approach will also suffer from the accumulated error.


  • [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §1, §2, §3, §6.1.
  • [2] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §5.3.
  • [3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §6.2.1.
  • [4] F. Bullo and A. D. Lewis (2004) Geometric control of mechanical systems. Texts in Applied Mathematics, Vol. 49, Springer Verlag, New York-Heidelberg-Berlin. External Links: ISBN 0-387-22195-6 Cited by: §1.
  • [5] V. De Silva and J. B. Tenenbaum (2004) Sparse multidimensional scaling using landmark points. Technical report Cited by: §1.
  • [6] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. CoRR abs/1604.06778. External Links: Link, 1604.06778 Cited by: §4.
  • [7] A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, M. Fiser, and J. Davidson (2018) PRM-rl: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5113–5120. Cited by: §2.
  • [8] A. V. Goldberg and C. Harrelson (2005) Computing the shortest path: a search meets graph theory. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 156–165. Cited by: §5.1.
  • [9] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §2.
  • [10] P. E. Hart, N. J. Nilsson, and B. Raphael (1968)

    A formal basis for the heuristic determination of minimum cost paths

    IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §2.
  • [11] M. Henaff, W. F. Whitney, and Y. LeCun (2017) Model-based planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177. Cited by: §2.
  • [12] B. Ichter and M. Pavone (2018) Robot motion planning in learned latent spaces. CoRR abs/1807.10366. External Links: Link, 1807.10366 Cited by: §2.
  • [13] L. Kavraki, P. Svestka, and M. H. Overmars (1994) Probabilistic roadmaps for path planning in high-dimensional configuration spaces. Vol. 1994, Unknown Publisher. Cited by: §2.
  • [14] N. R. Ke, A. Singh, A. Touati, A. Goyal, Y. Bengio, D. Parikh, and D. Batra (2018) Modeling the long term future in model-based reinforcement learning. Cited by: §2.
  • [15] T. Klamt and S. Behnke (2019) Towards learning abstract representations for locomotion planning in high-dimensional state spaces. arXiv preprint arXiv:1903.02308. Cited by: §2.
  • [16] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §2.
  • [17] S. M. LaValle (1998) Rapidly-exploring random trees: a new tool for path planning. Cited by: §2.
  • [18] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
  • [19] A. Levy, R. Platt, and K. Saenko (2018) Hierarchical reinforcement learning with hindsight. arXiv preprint arXiv:1805.08180. Cited by: §1, §2, §6.2.3.
  • [20] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.
  • [21] J. Mao, H. Dong, and J. J. Lim (2018) Universal agent for disentangling environments and tasks. Cited by: §2.
  • [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Link, 1312.5602 Cited by: §6.1.
  • [23] A. W. Moore (1990) Efficient memory-based learning for robot control. Technical report . Cited by: §1.
  • [24] O. Nachum, S. S. Gu, H. Lee, and S. Levine (2018) Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §2, §6.2.3.
  • [25] J. Oh, S. Singh, and H. Lee (2017) Value prediction network. In NIPS, Cited by: §2.
  • [26] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba (2018) Multi-goal reinforcement learning: challenging robotics environments and request for research. CoRR abs/1802.09464. External Links: Link, 1802.09464 Cited by: §4.
  • [27] V. Pong, S. Gu, M. Dalal, and S. Levine (2018) Temporal difference models: model-free deep RL for model-based control. CoRR abs/1802.09081. External Links: Link, 1802.09081 Cited by: §2.
  • [28] A. H. Qureshi and M. C. Yip (2018) Deeply informed neural sampling for robot motion planning. CoRR abs/1809.10252. External Links: Link, 1809.10252 Cited by: §2.
  • [29] N. Savinov, A. Dosovitskiy, and V. Koltun (2018) Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653. Cited by: §2.
  • [30] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In

    International conference on machine learning

    pp. 1312–1320. Cited by: §1, §2, §3, §5.
  • [31] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016-01) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: Document, ISSN 0028-0836 Cited by: §2.
  • [32] A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn (2018) Universal planning networks. CoRR abs/1804.00645. Cited by: §2.
  • [33] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup (2011) Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768. Cited by: §2.
  • [34] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel (2016) Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154–2162. Cited by: §2.
  • [35] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.2.
  • [36] T. Yu, G. Shevchuk, D. Sadigh, and C. Finn (2019) Unsupervised visuomotor control through distributional planning networks. CoRR abs/1902.05542. External Links: Link, 1902.05542 Cited by: §2.
  • [37] A. Zhang, A. Lerer, S. Sukhbaatar, R. Fergus, and A. Szlam (2018) Composable planning with attributes. arXiv preprint arXiv:1803.00512. Cited by: §2.