Reinforcement learning (RL) allows training agents for planning and control tasks by feedbacks from the environment. While significant progress has been made in the standard setting of achieving a goal known at training time, e.g., to reach a given flag as in MountainCar Moore (1990)
, very limited efforts have been exerted on the setting when goals at evaluation are unknown at training time. For example, when a robot walks in an environment, the destination may vary from time to time. Tasks of this kind are unanimous and of crucial importance in practice. We call them Universal Markov Decision Process (UMDP) problems following the convention ofLevy et al. (2018).
Pioneer work handles UMDP problems by learning a Universal Value Function Approximator (UVFA). In particular, Schaul et al. Schaul et al. (2015) proposed to approximate a goal-conditioned value function 111 is the current state and is the goal.
by a multi-layer perceptron (MLP), and Andrychowicz et al.Andrychowicz et al. (2017) proposed a framework called hindsight experience replay (HER) to smartly reuse past experience to fit the universal value function by TD-loss. However, for complicated policies of long-term horizon, the UVFA learned by networks is often not good enough. This is because UVFA has to memorize the cumulative reward between all the state-goal pairs, which is a daunting job. In fact, the cardinality of state-goal pairs grows by a high-order polynomial over the horizon of goals.
While the general UMDP problem is extremely difficult, we consider a family of UMDP problems whose state space is a low-dimension manifold in the ambient space. Most control problems are of this type and geometric control theory has been developed in the literature Bullo and Lewis (2004). Our approach is inspired by manifold learning, e.g., Landmark MDS De Silva and Tenenbaum (2004). We abstract the state space as a small-scale map, whose nodes are landmark states selected from the experience replay buffer, and edges connect nearby nodes with weights extracted from the learned local UVFA. A network is still used to fit the local UVFA accurately. The map allows us to run high-level planning using pairwise shortest path algorithm, and the local UVFA network allows us to derive an accurate local decision. For a long-term goal, we first use the local UVFA network to direct to a nearby landmark, then route among landmarks using the map towards the goal, and finally reach the goal from the last landmark using the local UVFA network.
Our method has improved sample efficiency over purely network learned UVFA. There are three main reasons. First, the UVFA estimator in our framework only needs to work well for local value estimation. The network does not need to remember for faraway goals, thus the load is alleviated. Second, for long-range state-goal pairs, the map allows propagating accurate local value estimations in a way that neural networks cannot achieve. Consider the extreme case of having a long-range state-goal pair never experienced before. A network can only guess the value by extrapolation, which is known to be unreliable. Our map, however, can reasonably approximate the value as long as there is a path through landmarks to connect them. Lastly, the map provides a strong exploration ability and can help to obtain rewards significantly earlier, especially in the sparse reward setting. This is because we choose the landmarks from the replay buffer using a farthest-point sampling strategy, which tends to select states that are closer to the boundary of the visited space. In experiments, we compared our methods on several challenging environments and have outperformed baselines.
Our contributions are: First, We propose a sample-based method to map the visited state space using landmarks. Such a graph-like map is a powerful representation of the environment, maintains both local connectivity and global topology. Second, our framework will simultaneously map the visited state space and execute the planning strategy, with the help of a locally accurate value function approximator and the landmark-based map. It is a simple but effective way to improve the estimation accuracy of long-range value functions and induces a successful policy at the early stage of training.
2 Related work
Variants of goal-conditioned decision-making problems have been studied in literature Sutton et al. (2011); Mao et al. (2018); Schaul et al. (2015); Pong et al. (2018). We focus on the goal-reaching task, where the goal is a subset of the state space. The agent receives meaningful rewards if and only if it has reached the goal, which brings significant challenges to existing RL algorithms. A significant recent approach along the line is Hindsight Experience Replay (HER) by Andrychowicz et al Andrychowicz et al. (2017). They proposed to relabel the reached states as goals to improve data efficiency. However, they used only a single neural network to represent the value, learned by DDPG Lillicrap et al. (2015). This makes it hard to model the long-range distance. Our method overcomes the issue by using a sample-based map to represent the global structure of the environment. The map allows to propagate rewards to distant states more efficiently. It also allows to factorize the decision-making for long action sequences into a high-level planning problem and a low-level control problem.
Model-based reinforcement learning algorithms usually need to learn a local forward model of the environment, and then solve the multi-step planning problem with the learned model Hafner et al. (2018); Oh et al. (2017); Silver et al. (2016); Henaff et al. (2017); Srinivas et al. (2018); Yu et al. (2019). These methods rely on learning an accurate local model and require extra efforts to generalize to the long term horizon Ke et al. (2018). In comparison, we learn a model of environment in a hierarchical manner, by a network-based local model and a graph-based global model (map). Different from previous works to fit forward dynamics in local models, our local model distills local cumulative rewards from environment dynamics. In addition, our global model, as a small graph-based map that abstracts the large state space, supports reward propagation at long range. One can compare our framework with Value Iteration Networks (VIN) Tamar et al. (2016)
. VIN focused on the 2D navigation problem. Given a predefined map of known nodes, edges, and weights, it runs the value iteration algorithm by ingeniously simulating the process through a convolutional neural networkLeCun et al. (1998). In contrast, we construct the map based upon the learned local model.
. The traditional motion planning algorithm requires the knowledge of the model. Recent work has combined deep learning and deep reinforcement learning forIchter and Pavone (2018); Qureshi and Yip (2018); Klamt and Behnke (2019); Faust et al. (2018). In particularly, PRM-RL addressed the 2D navigation problem by combining a high-level shortest path-based planner and a low-level RL algorithm. To connect nearby landmarks, it leveraged a physical engine, which depends on sophisticated domain knowledge and limits its usage to other general RL tasks. In the general RL context, our work shows that one can combine a high-level planner and a learned local model to solve RL problems more efficiently. Some recent work also utilize the graph structure to perform planning Savinov et al. (2018); Zhang et al. (2018), however, unlike our approach that discovers the graph structure in the process of achieving goals, both Savinov et al. (2018); Zhang et al. (2018)
require supervised learning to build the graph. Specifically,Savinov et al. (2018) need to learn a Siamese network to judge if two states are connected, and Zhang et al. (2018) need to learn the state-attribute mapping from human annotation.
Our method is also related to hierarchical RL research Levy et al. (2018); Kulkarni et al. (2016); Nachum et al. (2018). The sampled landmark points can be considered as sub-goals. Levy et al. (2018); Nachum et al. (2018) also used HER-like relabeling technique to make the training more efficient. These work attack more general RL problems without assuming much problem structure. Our work differs from previous work in how high-level policy is achieved. In their methods, the agent has to learn the high-level policy as another RL problem. In contrast, we exploit the structure of our universal goal reaching problem and find the high-level policy by solving a pairwise shortest path problem in a small-scale graph, thus more data-efficient.
Universal Markov Decision Process (UMDP) extends an MDP with a set of goals . UMDP has reward function , where is the state space and is the action space. Every episode starts with a goal selected from by the environment and is fixed for the whole episode. We aim to find a goal conditioned policy to maximize the expected cumulative future return , which called goal-conditioned value, or universal value. Universal Value Function Approximators (UVFA) Schaul et al. (2015) use neural network to model where is the optimal policy, and apply Bellman equation to train it in a bootstrapping way. Usually, the reward in UMDP is sparse to train the network. For a given goal, the agent can receive non-trivial rewards only when it can reach the goal. This brings a challenge to the learning process.
Hindsight Experience Replay (HER) Andrychowicz et al. (2017) propose goal-relabeling to train UVFA in sparse reward setting. The key insight of HER is to “turn failure to success”, i.e. to make a failed trajectory become success, by replacing the original failed goals with the goals it has achieved. This strategy gives more feedback to the agent and improves the data efficiency for sparse reward environments. Our framework relies on HER to train an accurate low-level policy.
4 Universal Goal Reaching
Our universal goal reaching problem refers to a family of UMDP tasks. The state space of our UDMP is a low-dimension manifold in the ambient space. Many useful planning problems in practice are of this kind. Example universal goal reaching environments include labyrinth walking (e.g., AntMaze Duan et al. (2016)) and robot arm control (e.g., FetchReach Plappert et al. (2018)
). Their states can only transit in a neighborhood of low-dimensionality constrained by the degree of freedom of actions.
Following the notions in Sec 3, we assume that a goal in goal space which is a subset of the state space . For example, in a labyrinth walking game with continuous locomotion, the goal can be to reach a specific location in the maze at any velocity. Then, if the state
is a vector consisting of the location and velocity, a convenient way to represent the goalwould be a vector that only contains the dimensions of location, i.e., the goal space is a projection of the state space.
The universal goal reaching problem has a specific transition probability and reward structure. At every time step, the agent moves into a local neighborhood based on the metric in the state space, which might be perturbed by random noise. It also receives some negative penalty (usually a constant, e.g.,in the experiments) unless it has arrived at the vicinity of the goal. A reward is received if the goal is reached. To maximize the accumulated reward, the agent has to reach the goal in fewest steps. Usually the only non-trivial reward appears rarely, and the universal goal reaching problem falls in the category of sparse reward environments, which are hard-exploration problems for RL.
A Graph View:
Assume that a policy takes at most steps to move from to and the reward at each step ’s absolute value is bounded by . Let be the expected total reward along the trajectory, and for all . We can prove222When , we can approximate by its first-order Taylor expansion .:
Thus, when and , UVFA can be approximated as:
In this case, it is easy to show that the value iteration based on Bellman Equation implies , where is the transition probability of optimal policy .
The relationship allows us to view the MDP as a directed graph, whose nodes are the state set , and edges are sampled according to the transition probability in the MDP. The general value iteration for RL problems is exactly the shortest path algorithm in terms of on this directed graph. Besides, because the nodes form a low-dimensional manifold, nodes that are far away in the state space can only be reached by a long path.
The MDP of our universal goal reaching problem is a large-scale directed graph whose nodes are in a low-dimensional manifold. This structure allows us to estimate the all-pair shortest paths accurately by a landmark based coarsening of the graph.
In this paper, we choose deep RL algorithms such as DQN and DDPG for discrete and continuous action space, respectively. UVFA Schaul et al. (2015) is a goal-conditioned extension of the original DQN, while HER (Sec 3), can produce more informative feedback for UVFA learning. Our algorithm is thus based upon HER, and the extension of this approach for DDPG is also straightforward.
5.1 Basic Idea
Our approach aims at addressing the fundamental challenges in UVFA learning. As characterized in the previous section, the UVFA estimation solves a pair-wise shortest path problem, and the underlying graph has a node space of high cardinality. Note that UVFA has to memorize the distance between every state-goal pairs, through trajectory samples from the starting state to the goal. For analysis purpose, we assume the state space has dimension and contains a ball of radius . Then the lower-bound bound of the amount of the state-goal pairs is at the order of 333The volume of state-goal pair set is , where and are the state and goal sets (assuming for analysis only) and is the Cartesian product., a high-order polynomial.
The large set of state-goal pairs poses the challenge. First, it takes longer time to sample enough state-goal pairs. Particularly, at the early stage, only few state-goal samples have been collected, so learning from them requires heavy extrapolation by networks, which is well known to be unreliable. Second, memorizing all the experiences is too difficult even for large networks.
We propose a map to abstract the visited state space by landmarks and edges to connect them. This abstraction is reasonable due to the underlying structure of our graph — a low-dimensional manifold Goldberg and Harrelson (2005). We also learn local UVFA networks that only needs to be accurate in the neighborhood of landmarks. As illustrated in Figure 1, an ant robot is put in an “U” Maze to reach a given position. It should learn to model the maze as a small-scale map based on its past experiences.
This solution addresses the challenges. For the UVFA network, it only needs to remember experiences in a local neighborhood. Thus, the training procedure requires much lower sample complexity. The map decomposes a long path into piece-wise short ones, and each of which is from an accurate local network.
5.2 Learning a Local UVFA with HER
Specifically, we define the following reward function for goal reaching problem:
Here is the next observation after taking action . We first learn a UVFA based on HER, which has proven its efficiency for UVFA. HER smartly generates more feedback for the agent, by replacing some unachievable goals with those achieved in the near future. HER thus allows the agent to obtain denser rewards before it can eventually reach goals that are far away.
In experiments (see Sec 6.3), we find out that the agent trained with HER does master the skill to reach goals of increasing difficulty in a curriculum way. However, the agent can seldom reach the most difficult goals constantly, while the success rate of reaching easier goals remains stable. All these observations prove that HER’s value and policy is locally reliable.
To increase the agent’s ability to reach nearby goals and get a better local value estimation at the early stage, we change the replacement strategy in HER, ensuring that the replaced goals are sampled from the near future within a fixed number of steps.
The UVFA trained in this step will be used for two purposes: (1) to estimate the distance between two local states belonging to the same landmark, or between two nearby landmarks; and (2) to decide whether two states are close enough so that we can trust the distance estimation from the network. Although the learned UVFA is imperfect globally, it is enough for the two local usages.
5.3 Building a Map by Sampling Landmarks
After training the UVFA, we will obtain a distance estimation 444If the algorithm returns a function, we will calculate the value by selecting the optimal action and calculate the function and convert to by Eq. 1, a policy for any state-goal pair , and a replay buffer that contains all the past experiences. We will build a landmark-based map to abstract the state space based on the experiences. The pseudo-code for the algorithm is shown in Algorithm 1.
The replay buffer stores visited states. Instead of localizing few important states that play a key role in connecting the environment, instead, we seek to sample many states to cover the visited state space.
Limited by computation budget, we first uniformly sample a big set of states from the replay buffer, and then use the farthest point sampling (FPS) algorithm Arthur and Vassilvitskii (2007) to select landmarks to support the explored state space. The metric for FPS can either be the Euclidean distance between the original state representation or the pairwise value estimated by the agent.
We compare different sampling strategies in Section 6.3, and demonstrate the advantage of FPS in abstracting the visited state space and exploration.
Connecting Nearby Landmarks
We first connect landmarks that have a reliable distance estimation from the UVFA and assign the UVFA-estimated distance between them as the weight of the connecting edge.
Since UVFA is accurate locally but unreliable for long-term future, we choose to only connect nearby landmarks. The UVFA is able to return a distance between any pair , so we connect the pairs with distance below a preset threshold , which should ensure that all the edges are reliable, as well as the whole graph is connected.
With these two steps, we have built a directed weighted graph which can approximate the visited state space. This graph is our map to be used for high-level planning. Such map induces a new environment, where the action is to choose to move to another landmark. The details can be found in Algorithm 1.
5.4 Planning with the Map
We can now leverage the map and the local UVFA network to estimate the distance between any state-goal pairs, which induces a reliable policy for the agent to reach the goal.
For a given pair of , we can plan the optimal path between by selecting a serial of landmarks , so that the approximated distance will be . The policy from to can then be approximated as: . Here the summation of is the concatenation of the corresponding action sequence.
In our implementation, we run the shortest path algorithm to solve the above minimization problem. To speed up the pipeline, we first calculate the pairwise distances between each landmark and the goal when episode starts. When the agent is at state , we can choose the next subgoal by finding .
6.1 FourRoom: An Illustrative Example
are different evaluation metrics of value estimation and success rate to reach the goal.
We first demonstrate the merits of our method in the FourRoom environment, where the action space is discrete. The environment is visualized in Figure 1(a). There are walls separating the space into four rooms, with narrow openings to connect them. For this discrete environment, we use DQN Mnih et al. (2013) with HER Andrychowicz et al. (2017) to learn the Q value. Here, we use the one-hot representation of the x-y position as the input of the network. The initial states and the goals are randomly sampled during training.
We first get from the learned Q-value by equation , and convert to pairwise distance based on Eq. 1. To evaluate the accuracy of distance estimation, we further calculate the ground truth distance by running a shortest path algorithm on the underlying ground-truth graph of maze. Then we adapt the mean distortion error (MDE) as the evaluation metric: .
Results are shown in Figure 1(b). Our method has a much lower MDE at the very beginning stage, which means that the estimated value is more accurate.
To better evaluate our superiority for distant goals, we first convert predicted values to corresponding distances, and then plot the maximal distance during training. From Figure 1(c), we can observe that the planning module have a larger output range than DQN. We guess that this comes from the max-operation in the Bellman-Ford equation, which pushes DQN to overestimate the Q value, or in other words, underestimate the distance for distant goals. However, the planner can still use piece-wise correct estimations to approximate the real distance to the goal.
We also compare our method with DQN on success reaching rate, and their performances are shown in Figure 1(d).
6.2 Continuous Control
In this section, we will compare our method with HER on challenging classic control tasks and MuJoCo Todorov et al. (2012) goal-reaching environments.
6.2.1 Environment Description
2DReach A green point in a 2D U-maze aims to reach the goal represented by a red point, as shown in Figure 2(a). The size of the maze is . The state space and the goal space are both in this 2D maze. At each step, the agent can move within as in x and y directions.
2DPush The green point A now need to push a blue point B to a given goal (red point) lying in the same U-maze as 2DReach, as shown in Figure 2(b). Once A has reached B, B will follow the movement of A. In this environment, the state is a -dim vector that contains the location of both A and B.
BlockedFetchReach & FetchPush We need to control a gripper to either reach a location in 3d space or push an object in the table to a specific location, as shown in Figure 2(c) and Figure 2(d). Since the original FetchReach implemented in OpenAI gym Brockman et al. (2016) is very easy to solve, we further add some blocks to increase the difficulty. We call this new environment BlockedFetchReach.
PointMaze & AntMaze As shown in Figure 2(e) and Figure 2(f), a point mass or an ant is put in a U-maze. Both agents are trained to reach a random goal from a random location and tested under the most difficult setting to reach the other side of maze within 500 steps. The states of point and ant are 7-dim and 30-dim, including positions and velocities.
Complex AntMaze As shown in Figure 2(g), an ant is put in a complex maze. It is trained to reach a random goal from a random location and tested under the most difficult setting to reach the farthest goal (indicated as the red point) within 1500 steps.
Acrobot As shown in Figure 2(h), an acrobot includes two joints and two links. Goals are states that the end-effector is above the black line at specific joint angles and velocities. The states and goals are both 6-dim vectors including joint angles and velocities.
6.2.2 Experiment Result
The results compared with HER are shown in Figure 4. Our method trains UVFA with planner and HER. It is evaluated under the test setting, using the model and replay buffer at corresponding training steps.
In the 2DReach and 2DPush task (shown in Figure 3(b)), we can see our method achieves better performance. When incorporating with control tasks, for BlockedFetchReach and FetchPush environments, the results still show that our performance is better than HER, but the improvement is not so remarkable. We guess this comes from the strict time limit of the two environments, which is only . We observe that pure HER can finally learn well, when the task horizon is not very long.
We expect that building maps would be more helpful for long-range goals, which is evidenced in the environments with longer episode length. Here we choose PointMaze and AntMaze with scale . For training, the agent is born at a random position to reach a random goal in the maze. For testing, the agent should reach the other side of the “U-Maze” within 500 steps. For these two environments, the performance of planning is significantly better and remains stable, while HER can hardly learn a reliable policy. Results are shown in Figure 3(e) and Figure 3(f).
We also evaluate our method on classic control, and more complex navigation + locomotion task. Here we choose Complex Antmaze and Acrobot, and results are shown in Figure 3(h) and Figure 3(g). The advantage over baseline demonstrates our method is applicable to complicated navigation tasks as well as general MDPs.
6.2.3 Comparison with HRL
We compare our method with HRL algorithms on large AntMaze (size ), as shown in Table 1. We choose to compare with HIRO Nachum et al. (2018), which is the SOTA HRL algorithm on AntMaze, and HAC Levy et al. (2018), which also uses the hindsight experience replay. We test these algorithms with the published codes555HIRO: https://github.com/tensorflow/models/tree/master/research/efficient-hrl 666HAC:https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-, under both sparse reward setting and dense reward setting.
On sparse reward setting, our algorithm can work well and reach the goal at the very early stage (Ours sparse in Table 1). In contrast, neither HAC nor HIRO are able to reach the goal in 2M steps. HIRO doesn’t use HER to replace the unachievable goals, which makes such setting very challenging for the algorithm.
For dense reward setting, the map planner can obtain a high success rate at very early stage shown as Ours dense in Table 1. Compared with HIRO dense, we can see that a planner can reach distant goals sooner, since we don’t need to train a high-level policy to propose subgoals for the low-level agent.
HAC introduced several complex hyper-parameters, and we couldn’t make it work well for both settings.
6.3 Ablation Study
We study some key factors that affect our algorithm on AntMaze.
Choice of Clip Range and Landmarks There are two main hyper-parameters for the planner – the number of landmarks and the edge clipping threshold . Figure 5(a) shows the evaluation result of the model trained after 0.8M steps in AntMaze. We see that our method is generally robust under different choices of hyper-parameters. Here is the negative distance between landmarks. If it’s too small, the landmarks will be isolated and can’t form a connected graph. The same problem comes when the landmarks are not enough.
The Local Accuracy of HER We evaluate our model trained between 02.5M steps, for goals of different difficulties. We manually define the difficulty level of goals, as shown in Figure 4(a). Goal’s difficulty increases from Level 1 to Level 6. We plot the success rate as well as the average steps to reach these goals. We find out that, for the easier goals, the agent takes less time and less steps to master the skill. The success rate and average steps also remain more stable during the training process, indicating that our base model is more reliable and stable in the local area.
Landmark Sampling Strategy Comparison Our landmarks are dynamically sampled from the replay buffer by iterative FPS algorithm using distances estimated by UVFA, and get updated at the beginning of every episode. The FPS sampling tends to find states at the boundary of the visited space, which implicitly helps exploration. We test FPS and uniform sampling in fix-start AntMaze (The ant is born at a fixed position to reach the other side of maze for both training and testing). Figure 5(b) shows that FPS has much higher success rate than uniform sampling. Figure 5(c) shows landmark-based graph at four training stages. Through FPS, landmarks expand gradually towards the goal (red dot), even if it only covers a small proportion of states at the beginning.
Learning a structured model and combining it with RL algorithms are important for reasoning and planning over long horizons. We propose a sample-based method to dynamically map the visited state space and demonstrate its empirical advantage in routing and exploration in several challenging RL tasks. Experimentally we showed that this approach can solve long-range goal reaching problems better than model-free methods and hierarchical RL methods, for a number of challenging games, even if the goal-conditioned model is only locally accurate. However, our method also has limitations. First, we empirically observe that some parameters, particularly the threshold to check whether we have reached the vicinity of a goal, needs hand-tuning. Secondly, a good state embedding is still important for the learning efficiency of our approach, since we do not include heavy component of learning state embedding. Thirdly, we find that in some environments whose intrinsic dimension is very high, especially when the topological structure is hard to abstract, sample-based method is not enough to represent the visited state space. And for those environments which is hard to obtain a reliable and generalizable local policy, this approach will also suffer from the accumulated error.
-  (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §1, §2, §3, §6.1.
-  (2007) K-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §5.3.
-  (2016) OpenAI gym. External Links: Cited by: §6.2.1.
-  (2004) Geometric control of mechanical systems. Texts in Applied Mathematics, Vol. 49, Springer Verlag, New York-Heidelberg-Berlin. External Links: Cited by: §1.
-  (2004) Sparse multidimensional scaling using landmark points. Technical report Cited by: §1.
-  (2016) Benchmarking deep reinforcement learning for continuous control. CoRR abs/1604.06778. External Links: Cited by: §4.
-  (2018) PRM-rl: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5113–5120. Cited by: §2.
-  (2005) Computing the shortest path: a search meets graph theory. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 156–165. Cited by: §5.1.
-  (2018) Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §2.
A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §2.
-  (2017) Model-based planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177. Cited by: §2.
-  (2018) Robot motion planning in learned latent spaces. CoRR abs/1807.10366. External Links: Cited by: §2.
-  (1994) Probabilistic roadmaps for path planning in high-dimensional configuration spaces. Vol. 1994, Unknown Publisher. Cited by: §2.
-  (2018) Modeling the long term future in model-based reinforcement learning. Cited by: §2.
-  (2019) Towards learning abstract representations for locomotion planning in high-dimensional state spaces. arXiv preprint arXiv:1903.02308. Cited by: §2.
-  (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, pp. 3675–3683. Cited by: §2.
-  (1998) Rapidly-exploring random trees: a new tool for path planning. Cited by: §2.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
-  (2018) Hierarchical reinforcement learning with hindsight. arXiv preprint arXiv:1805.08180. Cited by: §1, §2, §6.2.3.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §2.
-  (2018) Universal agent for disentangling environments and tasks. Cited by: §2.
-  (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Cited by: §6.1.
-  (1990) Efficient memory-based learning for robot control. Technical report . Cited by: §1.
-  (2018) Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §2, §6.2.3.
-  (2017) Value prediction network. In NIPS, Cited by: §2.
-  (2018) Multi-goal reinforcement learning: challenging robotics environments and request for research. CoRR abs/1802.09464. External Links: Cited by: §4.
-  (2018) Temporal difference models: model-free deep RL for model-based control. CoRR abs/1802.09081. External Links: Cited by: §2.
-  (2018) Deeply informed neural sampling for robot motion planning. CoRR abs/1809.10252. External Links: Cited by: §2.
-  (2018) Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653. Cited by: §2.
Universal value function approximators.
International conference on machine learning, pp. 1312–1320. Cited by: §1, §2, §3, §5.
-  (2016-01) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: Cited by: §2.
-  (2018) Universal planning networks. CoRR abs/1804.00645. Cited by: §2.
-  (2011) Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768. Cited by: §2.
-  (2016) Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154–2162. Cited by: §2.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §6.2.
-  (2019) Unsupervised visuomotor control through distributional planning networks. CoRR abs/1902.05542. External Links: Cited by: §2.
-  (2018) Composable planning with attributes. arXiv preprint arXiv:1803.00512. Cited by: §2.