1 Introduction
Reinforcement learning (RL) allows training agents for planning and control tasks by feedbacks from the environment. While significant progress has been made in the standard setting of achieving a goal known at training time, e.g., to reach a given flag as in MountainCar Moore (1990)
, very limited efforts have been exerted on the setting when goals at evaluation are unknown at training time. For example, when a robot walks in an environment, the destination may vary from time to time. Tasks of this kind are unanimous and of crucial importance in practice. We call them Universal Markov Decision Process (UMDP) problems following the convention of
Levy et al. (2018).Pioneer work handles UMDP problems by learning a Universal Value Function Approximator (UVFA). In particular, Schaul et al. Schaul et al. (2015) proposed to approximate a goalconditioned value function ^{1}^{1}1 is the current state and is the goal.
by a multilayer perceptron (MLP), and Andrychowicz et al.
Andrychowicz et al. (2017) proposed a framework called hindsight experience replay (HER) to smartly reuse past experience to fit the universal value function by TDloss. However, for complicated policies of longterm horizon, the UVFA learned by networks is often not good enough. This is because UVFA has to memorize the cumulative reward between all the stategoal pairs, which is a daunting job. In fact, the cardinality of stategoal pairs grows by a highorder polynomial over the horizon of goals.While the general UMDP problem is extremely difficult, we consider a family of UMDP problems whose state space is a lowdimension manifold in the ambient space. Most control problems are of this type and geometric control theory has been developed in the literature Bullo and Lewis (2004). Our approach is inspired by manifold learning, e.g., Landmark MDS De Silva and Tenenbaum (2004). We abstract the state space as a smallscale map, whose nodes are landmark states selected from the experience replay buffer, and edges connect nearby nodes with weights extracted from the learned local UVFA. A network is still used to fit the local UVFA accurately. The map allows us to run highlevel planning using pairwise shortest path algorithm, and the local UVFA network allows us to derive an accurate local decision. For a longterm goal, we first use the local UVFA network to direct to a nearby landmark, then route among landmarks using the map towards the goal, and finally reach the goal from the last landmark using the local UVFA network.
Our method has improved sample efficiency over purely network learned UVFA. There are three main reasons. First, the UVFA estimator in our framework only needs to work well for local value estimation. The network does not need to remember for faraway goals, thus the load is alleviated. Second, for longrange stategoal pairs, the map allows propagating accurate local value estimations in a way that neural networks cannot achieve. Consider the extreme case of having a longrange stategoal pair never experienced before. A network can only guess the value by extrapolation, which is known to be unreliable. Our map, however, can reasonably approximate the value as long as there is a path through landmarks to connect them. Lastly, the map provides a strong exploration ability and can help to obtain rewards significantly earlier, especially in the sparse reward setting. This is because we choose the landmarks from the replay buffer using a farthestpoint sampling strategy, which tends to select states that are closer to the boundary of the visited space. In experiments, we compared our methods on several challenging environments and have outperformed baselines.
Our contributions are: First, We propose a samplebased method to map the visited state space using landmarks. Such a graphlike map is a powerful representation of the environment, maintains both local connectivity and global topology. Second, our framework will simultaneously map the visited state space and execute the planning strategy, with the help of a locally accurate value function approximator and the landmarkbased map. It is a simple but effective way to improve the estimation accuracy of longrange value functions and induces a successful policy at the early stage of training.
2 Related work
Variants of goalconditioned decisionmaking problems have been studied in literature Sutton et al. (2011); Mao et al. (2018); Schaul et al. (2015); Pong et al. (2018). We focus on the goalreaching task, where the goal is a subset of the state space. The agent receives meaningful rewards if and only if it has reached the goal, which brings significant challenges to existing RL algorithms. A significant recent approach along the line is Hindsight Experience Replay (HER) by Andrychowicz et al Andrychowicz et al. (2017). They proposed to relabel the reached states as goals to improve data efficiency. However, they used only a single neural network to represent the value, learned by DDPG Lillicrap et al. (2015). This makes it hard to model the longrange distance. Our method overcomes the issue by using a samplebased map to represent the global structure of the environment. The map allows to propagate rewards to distant states more efficiently. It also allows to factorize the decisionmaking for long action sequences into a highlevel planning problem and a lowlevel control problem.
Modelbased reinforcement learning algorithms usually need to learn a local forward model of the environment, and then solve the multistep planning problem with the learned model Hafner et al. (2018); Oh et al. (2017); Silver et al. (2016); Henaff et al. (2017); Srinivas et al. (2018); Yu et al. (2019). These methods rely on learning an accurate local model and require extra efforts to generalize to the long term horizon Ke et al. (2018). In comparison, we learn a model of environment in a hierarchical manner, by a networkbased local model and a graphbased global model (map). Different from previous works to fit forward dynamics in local models, our local model distills local cumulative rewards from environment dynamics. In addition, our global model, as a small graphbased map that abstracts the large state space, supports reward propagation at long range. One can compare our framework with Value Iteration Networks (VIN) Tamar et al. (2016)
. VIN focused on the 2D navigation problem. Given a predefined map of known nodes, edges, and weights, it runs the value iteration algorithm by ingeniously simulating the process through a convolutional neural network
LeCun et al. (1998). In contrast, we construct the map based upon the learned local model.SampleBased Motion Planning (SBMP) has been widely studied in the robotics context Hart et al. (1968); LaValle (1998); Kavraki et al. (1994)
. The traditional motion planning algorithm requires the knowledge of the model. Recent work has combined deep learning and deep reinforcement learning for
Ichter and Pavone (2018); Qureshi and Yip (2018); Klamt and Behnke (2019); Faust et al. (2018). In particularly, PRMRL addressed the 2D navigation problem by combining a highlevel shortest pathbased planner and a lowlevel RL algorithm. To connect nearby landmarks, it leveraged a physical engine, which depends on sophisticated domain knowledge and limits its usage to other general RL tasks. In the general RL context, our work shows that one can combine a highlevel planner and a learned local model to solve RL problems more efficiently. Some recent work also utilize the graph structure to perform planning Savinov et al. (2018); Zhang et al. (2018), however, unlike our approach that discovers the graph structure in the process of achieving goals, both Savinov et al. (2018); Zhang et al. (2018)require supervised learning to build the graph. Specifically,
Savinov et al. (2018) need to learn a Siamese network to judge if two states are connected, and Zhang et al. (2018) need to learn the stateattribute mapping from human annotation.Our method is also related to hierarchical RL research Levy et al. (2018); Kulkarni et al. (2016); Nachum et al. (2018). The sampled landmark points can be considered as subgoals. Levy et al. (2018); Nachum et al. (2018) also used HERlike relabeling technique to make the training more efficient. These work attack more general RL problems without assuming much problem structure. Our work differs from previous work in how highlevel policy is achieved. In their methods, the agent has to learn the highlevel policy as another RL problem. In contrast, we exploit the structure of our universal goal reaching problem and find the highlevel policy by solving a pairwise shortest path problem in a smallscale graph, thus more dataefficient.
3 Background
Universal Markov Decision Process (UMDP) extends an MDP with a set of goals . UMDP has reward function , where is the state space and is the action space. Every episode starts with a goal selected from by the environment and is fixed for the whole episode. We aim to find a goal conditioned policy to maximize the expected cumulative future return , which called goalconditioned value, or universal value. Universal Value Function Approximators (UVFA) Schaul et al. (2015) use neural network to model where is the optimal policy, and apply Bellman equation to train it in a bootstrapping way. Usually, the reward in UMDP is sparse to train the network. For a given goal, the agent can receive nontrivial rewards only when it can reach the goal. This brings a challenge to the learning process.
Hindsight Experience Replay (HER) Andrychowicz et al. (2017) propose goalrelabeling to train UVFA in sparse reward setting. The key insight of HER is to “turn failure to success”, i.e. to make a failed trajectory become success, by replacing the original failed goals with the goals it has achieved. This strategy gives more feedback to the agent and improves the data efficiency for sparse reward environments. Our framework relies on HER to train an accurate lowlevel policy.
4 Universal Goal Reaching
Problem Definition:
Our universal goal reaching problem refers to a family of UMDP tasks. The state space of our UDMP is a lowdimension manifold in the ambient space. Many useful planning problems in practice are of this kind. Example universal goal reaching environments include labyrinth walking (e.g., AntMaze Duan et al. (2016)) and robot arm control (e.g., FetchReach Plappert et al. (2018)
). Their states can only transit in a neighborhood of lowdimensionality constrained by the degree of freedom of actions.
Following the notions in Sec 3, we assume that a goal in goal space which is a subset of the state space . For example, in a labyrinth walking game with continuous locomotion, the goal can be to reach a specific location in the maze at any velocity. Then, if the state
is a vector consisting of the location and velocity, a convenient way to represent the goal
would be a vector that only contains the dimensions of location, i.e., the goal space is a projection of the state space.The universal goal reaching problem has a specific transition probability and reward structure. At every time step, the agent moves into a local neighborhood based on the metric in the state space, which might be perturbed by random noise. It also receives some negative penalty (usually a constant, e.g.,
in the experiments) unless it has arrived at the vicinity of the goal. A reward is received if the goal is reached. To maximize the accumulated reward, the agent has to reach the goal in fewest steps. Usually the only nontrivial reward appears rarely, and the universal goal reaching problem falls in the category of sparse reward environments, which are hardexploration problems for RL.A Graph View:
Assume that a policy takes at most steps to move from to and the reward at each step ’s absolute value is bounded by . Let be the expected total reward along the trajectory, and for all . We can prove^{2}^{2}2When , we can approximate by its firstorder Taylor expansion .:
Thus, when and , UVFA can be approximated as:
(1) 
In this case, it is easy to show that the value iteration based on Bellman Equation implies , where is the transition probability of optimal policy .
The relationship allows us to view the MDP as a directed graph, whose nodes are the state set , and edges are sampled according to the transition probability in the MDP. The general value iteration for RL problems is exactly the shortest path algorithm in terms of on this directed graph. Besides, because the nodes form a lowdimensional manifold, nodes that are far away in the state space can only be reached by a long path.
The MDP of our universal goal reaching problem is a largescale directed graph whose nodes are in a lowdimensional manifold. This structure allows us to estimate the allpair shortest paths accurately by a landmark based coarsening of the graph.
5 Approach
In this paper, we choose deep RL algorithms such as DQN and DDPG for discrete and continuous action space, respectively. UVFA Schaul et al. (2015) is a goalconditioned extension of the original DQN, while HER (Sec 3), can produce more informative feedback for UVFA learning. Our algorithm is thus based upon HER, and the extension of this approach for DDPG is also straightforward.
5.1 Basic Idea
Our approach aims at addressing the fundamental challenges in UVFA learning. As characterized in the previous section, the UVFA estimation solves a pairwise shortest path problem, and the underlying graph has a node space of high cardinality. Note that UVFA has to memorize the distance between every stategoal pairs, through trajectory samples from the starting state to the goal. For analysis purpose, we assume the state space has dimension and contains a ball of radius . Then the lowerbound bound of the amount of the stategoal pairs is at the order of ^{3}^{3}3The volume of stategoal pair set is , where and are the state and goal sets (assuming for analysis only) and is the Cartesian product., a highorder polynomial.
The large set of stategoal pairs poses the challenge. First, it takes longer time to sample enough stategoal pairs. Particularly, at the early stage, only few stategoal samples have been collected, so learning from them requires heavy extrapolation by networks, which is well known to be unreliable. Second, memorizing all the experiences is too difficult even for large networks.
We propose a map to abstract the visited state space by landmarks and edges to connect them. This abstraction is reasonable due to the underlying structure of our graph — a lowdimensional manifold Goldberg and Harrelson (2005). We also learn local UVFA networks that only needs to be accurate in the neighborhood of landmarks. As illustrated in Figure 1, an ant robot is put in an “U” Maze to reach a given position. It should learn to model the maze as a smallscale map based on its past experiences.
This solution addresses the challenges. For the UVFA network, it only needs to remember experiences in a local neighborhood. Thus, the training procedure requires much lower sample complexity. The map decomposes a long path into piecewise short ones, and each of which is from an accurate local network.
5.2 Learning a Local UVFA with HER
Specifically, we define the following reward function for goal reaching problem:
Here is the next observation after taking action . We first learn a UVFA based on HER, which has proven its efficiency for UVFA. HER smartly generates more feedback for the agent, by replacing some unachievable goals with those achieved in the near future. HER thus allows the agent to obtain denser rewards before it can eventually reach goals that are far away.
In experiments (see Sec 6.3), we find out that the agent trained with HER does master the skill to reach goals of increasing difficulty in a curriculum way. However, the agent can seldom reach the most difficult goals constantly, while the success rate of reaching easier goals remains stable. All these observations prove that HER’s value and policy is locally reliable.
To increase the agent’s ability to reach nearby goals and get a better local value estimation at the early stage, we change the replacement strategy in HER, ensuring that the replaced goals are sampled from the near future within a fixed number of steps.
The UVFA trained in this step will be used for two purposes: (1) to estimate the distance between two local states belonging to the same landmark, or between two nearby landmarks; and (2) to decide whether two states are close enough so that we can trust the distance estimation from the network. Although the learned UVFA is imperfect globally, it is enough for the two local usages.
5.3 Building a Map by Sampling Landmarks
After training the UVFA, we will obtain a distance estimation ^{4}^{4}4If the algorithm returns a function, we will calculate the value by selecting the optimal action and calculate the function and convert to by Eq. 1, a policy for any stategoal pair , and a replay buffer that contains all the past experiences. We will build a landmarkbased map to abstract the state space based on the experiences. The pseudocode for the algorithm is shown in Algorithm 1.
Landmark Sampling
The replay buffer stores visited states. Instead of localizing few important states that play a key role in connecting the environment, instead, we seek to sample many states to cover the visited state space.
Limited by computation budget, we first uniformly sample a big set of states from the replay buffer, and then use the farthest point sampling (FPS) algorithm Arthur and Vassilvitskii (2007) to select landmarks to support the explored state space. The metric for FPS can either be the Euclidean distance between the original state representation or the pairwise value estimated by the agent.
We compare different sampling strategies in Section 6.3, and demonstrate the advantage of FPS in abstracting the visited state space and exploration.
Connecting Nearby Landmarks
We first connect landmarks that have a reliable distance estimation from the UVFA and assign the UVFAestimated distance between them as the weight of the connecting edge.
Since UVFA is accurate locally but unreliable for longterm future, we choose to only connect nearby landmarks. The UVFA is able to return a distance between any pair , so we connect the pairs with distance below a preset threshold , which should ensure that all the edges are reliable, as well as the whole graph is connected.
With these two steps, we have built a directed weighted graph which can approximate the visited state space. This graph is our map to be used for highlevel planning. Such map induces a new environment, where the action is to choose to move to another landmark. The details can be found in Algorithm 1.
5.4 Planning with the Map
We can now leverage the map and the local UVFA network to estimate the distance between any stategoal pairs, which induces a reliable policy for the agent to reach the goal.
For a given pair of , we can plan the optimal path between by selecting a serial of landmarks , so that the approximated distance will be . The policy from to can then be approximated as: . Here the summation of is the concatenation of the corresponding action sequence.
In our implementation, we run the shortest path algorithm to solve the above minimization problem. To speed up the pipeline, we first calculate the pairwise distances between each landmark and the goal when episode starts. When the agent is at state , we can choose the next subgoal by finding .
6 Experiments
6.1 FourRoom: An Illustrative Example
are different evaluation metrics of value estimation and success rate to reach the goal.
We first demonstrate the merits of our method in the FourRoom environment, where the action space is discrete. The environment is visualized in Figure 1(a). There are walls separating the space into four rooms, with narrow openings to connect them. For this discrete environment, we use DQN Mnih et al. (2013) with HER Andrychowicz et al. (2017) to learn the Q value. Here, we use the onehot representation of the xy position as the input of the network. The initial states and the goals are randomly sampled during training.
We first get from the learned Qvalue by equation , and convert to pairwise distance based on Eq. 1. To evaluate the accuracy of distance estimation, we further calculate the ground truth distance by running a shortest path algorithm on the underlying groundtruth graph of maze. Then we adapt the mean distortion error (MDE) as the evaluation metric: .
Results are shown in Figure 1(b). Our method has a much lower MDE at the very beginning stage, which means that the estimated value is more accurate.
To better evaluate our superiority for distant goals, we first convert predicted values to corresponding distances, and then plot the maximal distance during training. From Figure 1(c), we can observe that the planning module have a larger output range than DQN. We guess that this comes from the maxoperation in the BellmanFord equation, which pushes DQN to overestimate the Q value, or in other words, underestimate the distance for distant goals. However, the planner can still use piecewise correct estimations to approximate the real distance to the goal.
We also compare our method with DQN on success reaching rate, and their performances are shown in Figure 1(d).
6.2 Continuous Control
In this section, we will compare our method with HER on challenging classic control tasks and MuJoCo Todorov et al. (2012) goalreaching environments.
6.2.1 Environment Description
2DReach A green point in a 2D Umaze aims to reach the goal represented by a red point, as shown in Figure 2(a). The size of the maze is . The state space and the goal space are both in this 2D maze. At each step, the agent can move within as in x and y directions.
2DPush The green point A now need to push a blue point B to a given goal (red point) lying in the same Umaze as 2DReach, as shown in Figure 2(b). Once A has reached B, B will follow the movement of A. In this environment, the state is a dim vector that contains the location of both A and B.
BlockedFetchReach & FetchPush We need to control a gripper to either reach a location in 3d space or push an object in the table to a specific location, as shown in Figure 2(c) and Figure 2(d). Since the original FetchReach implemented in OpenAI gym Brockman et al. (2016) is very easy to solve, we further add some blocks to increase the difficulty. We call this new environment BlockedFetchReach.
PointMaze & AntMaze As shown in Figure 2(e) and Figure 2(f), a point mass or an ant is put in a Umaze. Both agents are trained to reach a random goal from a random location and tested under the most difficult setting to reach the other side of maze within 500 steps. The states of point and ant are 7dim and 30dim, including positions and velocities.
Complex AntMaze As shown in Figure 2(g), an ant is put in a complex maze. It is trained to reach a random goal from a random location and tested under the most difficult setting to reach the farthest goal (indicated as the red point) within 1500 steps.
Acrobot As shown in Figure 2(h), an acrobot includes two joints and two links. Goals are states that the endeffector is above the black line at specific joint angles and velocities. The states and goals are both 6dim vectors including joint angles and velocities.
6.2.2 Experiment Result
The results compared with HER are shown in Figure 4. Our method trains UVFA with planner and HER. It is evaluated under the test setting, using the model and replay buffer at corresponding training steps.
In the 2DReach and 2DPush task (shown in Figure 3(b)), we can see our method achieves better performance. When incorporating with control tasks, for BlockedFetchReach and FetchPush environments, the results still show that our performance is better than HER, but the improvement is not so remarkable. We guess this comes from the strict time limit of the two environments, which is only . We observe that pure HER can finally learn well, when the task horizon is not very long.
We expect that building maps would be more helpful for longrange goals, which is evidenced in the environments with longer episode length. Here we choose PointMaze and AntMaze with scale . For training, the agent is born at a random position to reach a random goal in the maze. For testing, the agent should reach the other side of the “UMaze” within 500 steps. For these two environments, the performance of planning is significantly better and remains stable, while HER can hardly learn a reliable policy. Results are shown in Figure 3(e) and Figure 3(f).
We also evaluate our method on classic control, and more complex navigation + locomotion task. Here we choose Complex Antmaze and Acrobot, and results are shown in Figure 3(h) and Figure 3(g). The advantage over baseline demonstrates our method is applicable to complicated navigation tasks as well as general MDPs.
6.2.3 Comparison with HRL
We compare our method with HRL algorithms on large AntMaze (size ), as shown in Table 1. We choose to compare with HIRO Nachum et al. (2018), which is the SOTA HRL algorithm on AntMaze, and HAC Levy et al. (2018), which also uses the hindsight experience replay. We test these algorithms with the published codes^{5}^{5}5HIRO: https://github.com/tensorflow/models/tree/master/research/efficienthrl ^{6}^{6}6HAC:https://github.com/andrewjlevy/HierarchicalActorCritcHAC, under both sparse reward setting and dense reward setting.
On sparse reward setting, our algorithm can work well and reach the goal at the very early stage (Ours sparse in Table 1). In contrast, neither HAC nor HIRO are able to reach the goal in 2M steps. HIRO doesn’t use HER to replace the unachievable goals, which makes such setting very challenging for the algorithm.
For dense reward setting, the map planner can obtain a high success rate at very early stage shown as Ours dense in Table 1. Compared with HIRO dense, we can see that a planner can reach distant goals sooner, since we don’t need to train a highlevel policy to propose subgoals for the lowlevel agent.
HAC introduced several complex hyperparameters, and we couldn’t make it work well for both settings.
0.5M  0.75M  1M  1.25M  1.5M  1.75M  2M  

Ours Sparse  0.0  0.03  0.3  0.4  0.45  0.5  0.5 
HIRO Sparse  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Ours Dense  0.0  0.09  0.45  0.5  0.7  0.8  0.9 
HIRO Dense  0.0  0.0  0.0  0.1  0.4  0.6  0.8 
6.3 Ablation Study
We study some key factors that affect our algorithm on AntMaze.
Choice of Clip Range and Landmarks There are two main hyperparameters for the planner – the number of landmarks and the edge clipping threshold . Figure 5(a) shows the evaluation result of the model trained after 0.8M steps in AntMaze. We see that our method is generally robust under different choices of hyperparameters. Here is the negative distance between landmarks. If it’s too small, the landmarks will be isolated and can’t form a connected graph. The same problem comes when the landmarks are not enough.
The Local Accuracy of HER We evaluate our model trained between 02.5M steps, for goals of different difficulties. We manually define the difficulty level of goals, as shown in Figure 4(a). Goal’s difficulty increases from Level 1 to Level 6. We plot the success rate as well as the average steps to reach these goals. We find out that, for the easier goals, the agent takes less time and less steps to master the skill. The success rate and average steps also remain more stable during the training process, indicating that our base model is more reliable and stable in the local area.
Landmark Sampling Strategy Comparison Our landmarks are dynamically sampled from the replay buffer by iterative FPS algorithm using distances estimated by UVFA, and get updated at the beginning of every episode. The FPS sampling tends to find states at the boundary of the visited space, which implicitly helps exploration. We test FPS and uniform sampling in fixstart AntMaze (The ant is born at a fixed position to reach the other side of maze for both training and testing). Figure 5(b) shows that FPS has much higher success rate than uniform sampling. Figure 5(c) shows landmarkbased graph at four training stages. Through FPS, landmarks expand gradually towards the goal (red dot), even if it only covers a small proportion of states at the beginning.
7 Conclusion
Learning a structured model and combining it with RL algorithms are important for reasoning and planning over long horizons. We propose a samplebased method to dynamically map the visited state space and demonstrate its empirical advantage in routing and exploration in several challenging RL tasks. Experimentally we showed that this approach can solve longrange goal reaching problems better than modelfree methods and hierarchical RL methods, for a number of challenging games, even if the goalconditioned model is only locally accurate. However, our method also has limitations. First, we empirically observe that some parameters, particularly the threshold to check whether we have reached the vicinity of a goal, needs handtuning. Secondly, a good state embedding is still important for the learning efficiency of our approach, since we do not include heavy component of learning state embedding. Thirdly, we find that in some environments whose intrinsic dimension is very high, especially when the topological structure is hard to abstract, samplebased method is not enough to represent the visited state space. And for those environments which is hard to obtain a reliable and generalizable local policy, this approach will also suffer from the accumulated error.
References
 [1] (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §1, §2, §3, §6.1.
 [2] (2007) Kmeans++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §5.3.
 [3] (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §6.2.1.
 [4] (2004) Geometric control of mechanical systems. Texts in Applied Mathematics, Vol. 49, Springer Verlag, New YorkHeidelbergBerlin. External Links: ISBN 0387221956 Cited by: §1.
 [5] (2004) Sparse multidimensional scaling using landmark points. Technical report Cited by: §1.
 [6] (2016) Benchmarking deep reinforcement learning for continuous control. CoRR abs/1604.06778. External Links: Link, 1604.06778 Cited by: §4.
 [7] (2018) PRMrl: longrange robotic navigation tasks by combining reinforcement learning and samplingbased planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5113–5120. Cited by: §2.
 [8] (2005) Computing the shortest path: a search meets graph theory. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pp. 156–165. Cited by: §5.1.
 [9] (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §2.

[10]
(1968)
A formal basis for the heuristic determination of minimum cost paths
. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §2.  [11] (2017) Modelbased planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177. Cited by: §2.
 [12] (2018) Robot motion planning in learned latent spaces. CoRR abs/1807.10366. External Links: Link, 1807.10366 Cited by: §2.
 [13] (1994) Probabilistic roadmaps for path planning in highdimensional configuration spaces. Vol. 1994, Unknown Publisher. Cited by: §2.
 [14] (2018) Modeling the long term future in modelbased reinforcement learning. Cited by: §2.
 [15] (2019) Towards learning abstract representations for locomotion planning in highdimensional state spaces. arXiv preprint arXiv:1903.02308. Cited by: §2.
 [16] (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §2.
 [17] (1998) Rapidlyexploring random trees: a new tool for path planning. Cited by: §2.
 [18] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
 [19] (2018) Hierarchical reinforcement learning with hindsight. arXiv preprint arXiv:1805.08180. Cited by: §1, §2, §6.2.3.
 [20] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.
 [21] (2018) Universal agent for disentangling environments and tasks. Cited by: §2.
 [22] (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Link, 1312.5602 Cited by: §6.1.
 [23] (1990) Efficient memorybased learning for robot control. Technical report . Cited by: §1.
 [24] (2018) Dataefficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §2, §6.2.3.
 [25] (2017) Value prediction network. In NIPS, Cited by: §2.
 [26] (2018) Multigoal reinforcement learning: challenging robotics environments and request for research. CoRR abs/1802.09464. External Links: Link, 1802.09464 Cited by: §4.
 [27] (2018) Temporal difference models: modelfree deep RL for modelbased control. CoRR abs/1802.09081. External Links: Link, 1802.09081 Cited by: §2.
 [28] (2018) Deeply informed neural sampling for robot motion planning. CoRR abs/1809.10252. External Links: Link, 1809.10252 Cited by: §2.
 [29] (2018) Semiparametric topological memory for navigation. arXiv preprint arXiv:1803.00653. Cited by: §2.

[30]
(2015)
Universal value function approximators.
In
International conference on machine learning
, pp. 1312–1320. Cited by: §1, §2, §3, §5.  [31] (201601) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: Document, ISSN 00280836 Cited by: §2.
 [32] (2018) Universal planning networks. CoRR abs/1804.00645. Cited by: §2.
 [33] (2011) Horde: a scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, pp. 761–768. Cited by: §2.
 [34] (2016) Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154–2162. Cited by: §2.
 [35] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.2.
 [36] (2019) Unsupervised visuomotor control through distributional planning networks. CoRR abs/1902.05542. External Links: Link, 1902.05542 Cited by: §2.
 [37] (2018) Composable planning with attributes. arXiv preprint arXiv:1803.00512. Cited by: §2.
Comments
There are no comments yet.