Prioritizing Starting States for Reinforcement Learning

11/27/2018 ∙ by Arash Tavakoli, et al. ∙ Imperial College London 0

Online, off-policy reinforcement learning algorithms are able to use an experience memory to remember and replay past experiences. In prior work, this approach was used to stabilize training by breaking the temporal correlations of the updates and avoiding the rapid forgetting of possibly rare experiences. In this work, we propose a conceptually simple framework that uses an experience memory to help exploration by prioritizing the starting states from which the agent starts acting in the environment, importantly, in a fashion that is also compatible with on-policy algorithms. Given the capacity to restart the agent in states corresponding to its past observations, we achieve this objective by (i) enabling the agent to restart in states belonging to significant past experiences (e.g., nearby goals), and (ii) promoting faster coverage of the state space through starting from a more diverse set of states. While, using a good measure of priority to identify significant past transitions, we expect case (i) to more considerably help exploration in certain problems (e.g., sparse reward tasks), we hypothesize that case (ii) will generally be beneficial, even without any prioritization. We show empirically that our approach improves learning performance for both off-policy and on-policy deep reinforcement learning methods, with the most notable improvement in a significantly sparse reward task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online reinforcement learning (RL) algorithms have demonstrated an impressive potential for tackling a wide range of complex tasks, with the majority of their success primarily being in simulated environments (Mnih et al., 2015; Lillicrap et al., 2016; Jaderberg et al., 2017; Silver et al., 2017). Scaling up RL algorithms to learn control policies for real practical systems (e.g., robotic manipulation), nevertheless, is often more difficult due to the sample inefficiency of these algorithms. While richer, realistic environments facilitate the transfer of learned policies to reality (Tan et al., 2018), they are accompanied by increased cost of simulation. To be able to explore and learn faster in such simulated environments is therefore an important step towards bringing the application of RL to real systems.

Experience replay (Lin, 1992) has recently gained popularity in off-policy deep RL algorithms, such as Deep Q-Networks (DQN) (Mnih et al., 2015), as a means to improve sample efficiency over their on-policy counterparts. As such, on-policy algorithms are often sample inefficient as past transitions need to be thrown away soon, if not immediately, after they are experienced. Moreover, sample efficiency is also related to the ability to explore faster in complex domains. Yet, developing a generic exploration method that can easily be adapted to any RL algorithm remains an unsolved quest.

In this paper, we propose a conceptually simple and easily extendable framework that can, in principle, be applied to any existing on-policy or off-policy RL algorithm. Our approach is to prioritize over starting states from which an agent starts acting in the environment. By starting from significant regions that the agent has already encountered in its past experience, our approach can help improve exploration and, thus, sample efficiency in complex simulated domains.

Given the capacity to reset a simulator’s state to those corresponding to agent’s past observations, we draw inspiration from the idea of a restart distribution as in (Kakade and Langford, 2002) and propose a practical procedure for creating and adapting such distribution based on agent’s past experiences. We maintain a restart distribution through a buffer from which an agent can draw starting states. By enabling the agent to restart from important regions of the environment rather than from a fixed reset state or a randomly-selected state from a designated set (as in most OpenAI Gym domains), we aim to explore faster and show that this particular approach is more effective in sparse reward tasks. We achieve this by prioritizing over the agent’s past observations to identify important regions of the environment for restarting the agent. This approach tends to go hand in hand with diversifying the starting states that can also help exploration through faster coverage of the state space.

In this work, we present the following variants for prioritizing starting states. In each variant, we consider maintaining a fixed proportion between sampling from our restart distribution (or states buffer) and from the environment’s starting state distribution.

  • Our simplest variant is to store and sample states randomly from a uniform distribution in the same fashion as experience replay was used to handle transitions in

    (Mnih et al., 2015). That is, we do not prioritize over the buffered states, nor select what states to be stored in memory. In this case, we hypothesize that any performance improvement should mainly be due to the diversification of the starting states that in turn will help exploration through faster coverage of the state space.

  • We next consider sampling starting states from regions with high expected learning progress as measured by the temporal-difference (TD) error, similar to prioritized transitions in prioritized experience replay (Schaul et al., 2016).

  • Finally, we consider a prioritization that can be specifically adapted for challenging sparse reward domains where we use an episodic memory that only stores states from trajectories that obtain higher returns than those previously experienced. This is similar in spirit to the criterion used by Oh et al. (2018) for selecting trajectories to imitate. We show that such a trajectory-based buffer can be crucial in achieving robust learning from even an individual good experience, specifically beneficial for on-policy methods which cannot straightforwardly adopt experience replay to reuse past transitions.

We evaluate our proposed framework on a wide range of domains, and demonstrate faster learning due to exploration through diversifying and prioritizing starting states. We demonstrate improved performance on numerous domains consisting of standard continuous control benchmarks with dense rewards, a video game with sparse rewards, and a simulated robotic manipulation task with a significantly sparse reward signal. In our study, we always evaluate the agent’s performance based on the original starting state distribution, as we assume that to be the actual performance metric.

2 Related Work

The idea of directly influencing the distribution of starting states to learn good policies has already drawn attention in the past. Kakade and Langford (2002) studied the notion of exploiting the access to a generative model (Kearns et al., 2002) of the environment to allow training on a restart distribution (i.e., a fixed, proposal starting state distribution) different from that of the environment. If properly chosen, this is proven to improve learning performance on the original starting state distribution. Nevertheless, no practical procedure is given to choose this new distribution, only suggesting to use a more uniform one over the state space. Also, enabling any such distribution assumes a priori knowledge of what constitutes a valid state. In our work, we provide a practical procedure for creating and adapting the starting state distribution over the course of training without such knowledge.

To improve learning of model-free RL algorithms, Popov et al. (2017) proposed to use expert demonstrations by modifying the starting state distribution to be uniform among the states visited by the provided trajectories. More recently, and concurrent to our work, Salimans and Chen (2018) reported achieving high levels of performance on the infamous Atari game of Montezuma’s Revenge by merely resetting a standard deep RL agent from manually designated starting states taken from a single expert demonstration. These works resemble our trajectory-centric states buffer, with the main difference being that, in our case, the agent progressively updates its best trajectories in the buffer and samples starts from them, thereby not relying on expert demonstrations, or manually-designated starts as is the case in the latter.

Recent work in curriculum for RL presents a method for adaptive generation of curricula in the form of starting state distributions that start close to the goal state and gradually move away with agent’s progress (Florensa et al., 2017). This method considers a specific class of goal-oriented tasks with clear goal states and assumes a priori access to such states. Contrary to this work, our framework is generally not limited to domains with clear goal states and does not require any prior knowledge of the task. Nevertheless, a similar behavior to curriculum generation in this manner could potentially emerge when using our approach with an appropriate priority measure whereby a first encounter of a goal state would bias the future starting state distributions towards the goal.

Furthermore, in this work we provide an alternative perspective on how past experience can be harvested to assist learning, and remarkably, in a fashion that is compatible with both off-policy and on-policy learning algorithms. This is in contrast to the perspective of replaying past experience to improve the performance of RL agents (Lin, 1992; Mnih et al., 2015; Schaul et al., 2016), an approach that cannot straightforwardly be adopted by on-policy methods.

3 Background

3.1 Preliminaries

We consider the RL framework (Sutton and Barto, 1998) in which a learning agent interacts with a stochastic environment over a sequence of discrete time steps in the standard fashion: at each time step, the agent chooses an action based on its current state, to which the environment responds with a reward and the next state. We model the environment as a Markov decision process (MDP) which comprises: a state space , an action space , a starting state distribution with density , a state transition kernel , and an expected immediate reward function , for all .

In general, the agent’s decision-making procedure is characterized by a stochastic policy

. In case of parameterized policies, such as those represented by artificial neural networks, we denote the policy by

where

is the vector of policy parameters, and where typically

. The agent uses its policy to interact with the MDP to sample a trajectory where

is the trajectory’s horizon, which is in general a random variable. Throughout this paper we assume that

is finite, and that terminations could occur either due to terminal states in episodic tasks (i.e., concrete episodes) or due to an arbitrary condition, such as timeouts, as could be the case for both continuing and episodic tasks (i.e., partial episodes).

3.2 Assumptions

Several of our discussions in this paper are considered under the more generic assumption of learning from partial episodes and, therefore, are only relevant to bootstrapping methods. This includes, for instance, any algorithm that uses TD learning, such as Sarsa, Q-learning, and most actor-critic methods. Nevertheless, the main proposition of this paper applies also to Monte Carlo (non-bootstrapping) methods, in which case the episodes are strictly concrete.

We assume access to the capacity to restart the agent in states corresponding to its past observations—generally the case given a natural and common type of simulator of the environment. As in (Kearns et al., 2002), our assumption is considerably weaker than having knowledge of the environment’s model. However, similar to (Kakade and Langford, 2002), it is a stronger assumption than having only irreversible experience, in which the agent must follow a single trajectory, with no ability to reset to obtain another trajectory from another state. Note that our assumption on reversibility is weaker than having knowledge of the MDP’s starting state distribution.

Lastly, we assume throughout this paper that the agent’s observations are not aliased. While such aliasing is generally problematic for RL agents, in our case it can further hurt performance of the agent under the original starting state distribution, as it can bias the agent’s policy towards behaviors that may be less suitable for the more frequently-occurring underlying true state.

4 Prioritized Starting States

In this work, assuming access to the capacity to restart the agent in states corresponding to its past observations (as stated in Section 3.2), we propose using a starting state buffer from which the agent can prioritize and draw states to start from in the next episode. We consider several variants of prioritization of starting states and present a generic and practical procedure for continual evolution of the starting state distribution for improved sample efficiency. Our main contribution is a flexible framework for gradually increasing the diversity of the starting states by storing the agent’s previously encountered states in a buffer and enabling prioritized sampling of starting states.

4.1 Motivation

In control problems it is known that even finding an optimal partial policy, i.e., a policy that is optimal for the relevant states but can specify arbitrary actions for the irrelevant ones111The states that are unreachable from any of the MDP’s designated start states and under any optimal policy., using an on-policy trajectory-sampling control method in general requires exploring all state-action pairs an ‘infinite’ number of times (Sutton and Barto, 1998). Similarly, a known problem with policy gradient methods is that the original performance metric, which is what we ultimately seek to optimize, is insensitive to policy improvement at unlikely states despite the fact that policy improvement at these unlikely states might be necessary in order for the agent to achieve near-optimal performance (Kakade and Langford, 2002). For such cases, the diversification of starting states can indeed help better explore the state space beyond the originally reachable states.

In environments with sparse rewards, where the odds of stumbling upon an informative experience could be significantly low, it is critical for the agent to be able to maximally utilize its good experiences. Using the proposed approach, such trajectories can be recorded and used for sampling starting states, effectively increasing the chances of experiencing success. Moreover, a curriculum of starting states can be generated in this way from past experience by prioritizing the stored states. Employing the past experiences in this manner is especially important in on-policy learning methods which cannot straightforwardly use experience replay and which currently discard all recent experiences immediately after performing a single or multiple iterative updates.

4.2 Role of starting states in the performance objective

In this section, we concern ourselves with the following question: “How does modifying the starting state distribution affect the original performance metric and ultimately the learned policy?”. In order to answer this, we consider the case for tabular and approximate solution methods separately. In tabular methods, the learned values at each state are decoupled from one another such that an update at one state affects no other. Let us now consider the control problem in which the agent’s goal is to maximize its value from the set of environment’s start states. As per the principle of optimality, a policy achieves the optimal value from a state , if and only if, for any state reachable from , the policy achieves the optimal value. Therefore, by letting the agent to also start in states outside the designated set, we are, in principle, guaranteed to better optimize for the original set through better optimizing for the states that are reachable from the original set.

On the contrary, with approximation, an update at one state could affect many others as, by assumption, we have far more states than weights. Thus, making one state’s estimate more accurate often means making others’ less accurate

(Sutton and Barto, 1998)

. Now let us consider, for instance, the prediction problem of approximating the action values of a given policy using a common loss function:

As shown, the overall loss is weighted according to the (discounted) state distribution —which in episodic tasks depends on the policy as well as the starting state distribution. In effect, this results in the approximation of the value function to be more accurate at states that have higher density. The same rationale holds for the approximate control methods, such as policy gradient methods and DQN. We established that, for approximate methods, changing the starting state distribution to a more diverse one, as in our case, could indeed bias the objective function. But does it result in different optimal policies? Unfortunately, the policy that maximizes our new objective within some restricted class of policies may potentially have poor performance according to the original objective. By using a parameterization that well affords the domain’s underlying complexity, we may presume that an optimal policy that maximizes the modified objective with more diverse start states maximize the original objective as well. While this assumption may seem impractical for larger problems, considering the relative simplicity of the current domains of interest w.r.t. the common high-capacity parametric representations (Li et al., 2018), it is often admissible (see, e.g., (Florensa et al., 2017)).

4.3 Methods

For the set of buffered states at training step and the set of environment’s starting states , where typically , we enable sampling starting states from the increasingly more diverse set instead of the conventional approach of sampling from the fixed, original set . Assuming no prior knowledge of the environment’s model, the original set of starting states and the corresponding density are unknown. Nevertheless, as we assume access to a generative model of the environment, we can therefore sample on demand from the original starting state distribution. By maintaining a balance between sampling starting states from a distribution over and the original distribution over , we can ensure to further diversify the starting states by effectively sampling from a new distribution that encompasses both and

. We achieve this by sustaining a fixed ratio between the experienced states originating from the environment’s starting states and the buffered states. This ratio is, therefore, a hyperparameter of our approach. In general, we believe this ratio needs to be selected to favor more experiences from the environment’s starting states to let the agent to focus the majority of it’s estimator’s resources on optimizing for the original performance metric. In our experiments, for instance, we only let 10% or 20% of experiences to stem from the buffered states.

The distribution could simply be uniform over the buffered states or, alternatively, it could be biased towards important regions as identified by an appropriate priority measure. We experiment with both uniform sampling from the buffer, and with prioritization based on the state-value estimation’s TD-error (van Seijen and Sutton, 2013; Schaul et al., 2016)

. When using TD-error prioritization, we calculate the probability of sampling a state in the same way as proposed in

(Schaul et al., 2016) for calculating the probability of sampling the transitions, and using their proportional prioritization scheme. It is important to note that prioritization of starting states does not introduce learning bias in the same way as prioritization in the context of experience replay. For the experience replay, an introduction of bias on the sampling of buffered transitions directly alters the perceived state transition and reward distributions in a stochastic environment. This, however, is not the case for our approach, as the introduction of bias on the choice of starting state distribution does not change the perceived transition probabilities of the MDP, as every experienced transition is still sampled directly from the environment instead of a buffer.

Additionally, we experiment with a trajectory-centric prioritization scheme, where states are stored and prioritized on the basis of their episodic returns. In the simplest form (i.e., what we consider) all states belonging to the same trajectory are prioritized similarly. Refer to Section 5.1 for more details.

We can further diversify the agent’s experience by prematurely terminating the trajectories that originate from the buffered states after a short, fixed horizon. Doing so is possible for bootstrapping methods via partial-episode bootstrapping (PEB) (Pardo et al., 2018), where the the agent continues to bootstrap from early terminations, thus allowing it to learn long-term policies from short partial episodes. In our experiments in Section 5.3, we consider a short, fixed horizon of 10 steps for when interactions originate from buffered starts and compare its performance against having full-length interactions (i.e., using the environment’s time limit of 1000 steps).

5 Experiments

In this section, we empirically demonstrate the following using our proposed method of prioritized starting states:

  • We show that, using a trajectory-centric storage and prioritization scheme, our proposed framework can be effective in domains characterized with sparse rewards. We illustrate this on a MuJoCo (Todorov et al., 2012) control task from the OpenAI Gym collection (Brockman et al., 2016), one that is characterized with a significantly sparse reward signal.

  • Prioritization of starting states can be easily adopted in any on-policy or off-policy deep RL algorithm, by only maintaining a buffer of starting states. We demonstrate this empirically using a popular on-policy algorithm, Proximal Policy Optimization (PPO) (Schulman et al., 2017), on a set of MuJoCo-based continuous control benchmarking domains (Brockman et al., 2016; Duan et al., 2016) with dense rewards, and using the off-policy Deep Q-Network (DQN) algorithm (Mnih et al., 2015) on the Super Mario All-Stars video game (provided by OpenAI’s Gym Retro (Pfau et al., 2018)).

5.1 Sparse Reward MuJoCo Control Task

We first demonstrate our proposed method in a modified, sparse reward Thrower domain from the OpenAI’s Gym MuJoCo collection. This task is particularly challenging as the agent only gets a reward of 1 for successfully throwing the ball in the goal region, and 0 otherwise. The domain is depicted in Figure 1 (left).

We propose using a trajectory-centric starting state buffer to solve such sparse reward tasks. In particular, we configure our buffer to only add states that belong to trajectories that obtain highest sum of rewards (top 50 trajectories in our case). When such trajectories originate from previously buffered starts, we also concatenate the original chunk of the trajectory prior to the buffer-sampled starting state with the new trajectory based on that state, thus preventing progressive bias towards the ends of the trajectories. In other words, we store trajectories with appropriate augmentation to effectively assimilate full trajectories from the environment’s starting states. We run each episode originating from the buffered states for the full length (i.e., environment’s termination or the default time limit of 100 steps). We sample starting states by first sampling uniformly from a set of stored trajectories, and then sampling uniformly from the sampled trajectory’s visited states.

We evaluate our proposed method as applied to the PPO algorithm. In each rollout, to collect on-policy data for the updates, we alternate between sampling starting states from the past successful trajectories in our trajectory-centric states buffer and from the environment’s starting state distribution. We find that using this approach, our method can enable PPO to learn robustly across 16 unique seeds, where for 50% of the experiments, the original PPO completely fails to learn the task. Interestingly, while learning more slowly for 8 seeds, our method consistently achieves the same final performance over all of the runs. In this significantly sparse reward MuJoCo domain, the odds of exploring a successful experience are so low that, across 190 trials with unique seeding, only 16 seeds ever managed to hit the goal during the exploration. Experimental results are shown in Figure 1 (right).

Figure 1: Performance comparisons of PPO with and without our trajectory-centric starting state buffer, specifically suited for tasks with (significantly) sparse rewards. The graphs are averaged over 16 seeds. While the standard PPO fails to generalize from the significantly sparse experience of throwing the ball in the goal region (i.e., the green area in the figure on the left) and entirely fails on 50% of the runs, using our approach it manages to achieve a 100% success rate.

5.2 Super Mario All-Stars Video Game

We then evaluate the performance of our proposed method on a video game using the existing off-policy DQN algorithm. We trained the agent on the first level of the Super Mario All-Stars game with somewhat sparse rewards where only collection of coins or consumables, destruction of blocks or enemies, and arrival at the flag are rewarded. We also restrict the available action set of the agent so as to prevent the agent from traversing the pipes. The environment terminates upon reaching a time limit, end-of-level flag, or out-of-bounds condition triggered by falling down the chasms.

Implementation of our starting state buffer for this environment shows a practical limitation of the approach where a full simulation state save has significantly larger memory footprint than the size of the corresponding agent’s observation that would be stored in the experience replay buffer. However, it is possible to sparsely record states into the buffer to both conserve available memory and expand the buffer horizon past the DQN’s experience replay memory limit.

For our experiments, we used prioritized experience replay (Schaul et al., 2016) with a capacity of transitions and a starting state buffer with 5000 states, recording candidate starting states every 500 steps (i.e., effectively having a horizon of 2.5M transitions in the environment. At the beginning of every episode the choice to use the buffer is based on the ratio of 20% for the number of states experienced originating from the buffered states and the total number of visited states so far.

Figure 2 (right) shows the training performance comparison between the baseline DQN and one using our starting state buffer. From our experimental results, DQN using our starting state buffer achieves approximately 10% better performance than the baseline. While due to limited computational resources we did not experiment with prioritization of the buffered starts, we believe that using an appropriate priority measure (e.g., a proxy for curiosity) could significantly improve performance.

Figure 2: Performance evaluation of applying our starting state buffer to DQN on the Super Mario All-Stars video game. Here we only consider the uniform-sampling variant of our buffer, using a ratio of 20%, and with a buffer capacity of 5000 states. We add a new state to our starting state buffer every 500 steps. The results are averaged over 4 seeds.

5.3 Continuous Control Benchmarks

We finally evaluate the performance of our method using PPO on a representative set of standard continuous control benchmarks from the Gym MuJoCo collection. These domains are characterized with highly dense, shaped reward signals on which PPO generally performs well. We chose PPO here as the state-of-the-art, on-policy algorithm for continuous control tasks. We used the OpenAI Baselines (Hesse et al., 2017) implementation of PPO and, mainly, with the hyperparameters reported by Schulman et al. (2017). We run PPO with a lower learning rate than that reported in (Schulman et al., 2017) which appeared to perform more stably across the tasks and achieved much better performance on the Humanoid-v2 domain. Additionally, we use PPO with partial-episode bootstrapping (Pardo et al., 2018) as our baseline. This is due to the fact that the time limits in our benchmarks here are not environmental terminations and, thus, a bootstrapping agent such as PPO should continue to bootstrap from such terminations.

For all variants of our approach in this experiment, we used the ratio of 10% and a buffer size of 20K. For prioritizing the buffered states in our method, we adopted the same procedure as the proportional variant of the prioritized experience replay, indeed without the importance sampling correction, and with the prioritization hyperparameters and . When not explicitly indicated the horizon, the agent is using the environment’s default episode length of 1000 for interactions originating from the buffered starts.

Figure 3: Performance evaluations of applying our starting state buffer to PPO. We compare uniform sampling (blue graphs) against the TD-error prioritized variant of our approach (TD-SSB) with . We also compare our approach using a short horizon for interactions originating from the buffered starts (dashed graphs) against using the environment’s time limit of 1000 steps (solid graph). For all the experiments here, we are using a ratio of 10% and a buffer size of 20K states. The performances are averaged over 5 seeds.

Figure 3 shows the results of our evaluations. Our experimental results show that sampling starting states from the agent’s past experiences can in many cases offer improvements, either in terms of sample efficiency or final performance, and in some cases both. While we distinguish between these criteria here, we understand that in several cases (such as Walker2d), the performances do not converge to a plateau during the training course, and thus it is not possible to compare performances based on this criteria and perhaps sample-efficiency would be a more appropriate criterion.

In general, it can be seen that the performances of our variants are better or on par with the standard PPO. On the Humanoid domain, which is most complex among those here, we see that using the starting state buffer offers the most significant improvements. Notably, the variants with the short horizon (dashed graphs) achieve the best performance across three tasks: Reacher, HalfCheetah, and Humanoid. It is noteworthy that, this variant of our approach effectively experiences highest diversification of starting states, as per our fixed ratio hyperparameter which sustains a fixed proportion of states experienced from the environment’s starting states and those originated from our buffered starts. While using a shorter horizon of 10 steps seems to significantly improve performance in some instances, the difference between our TD-prioritized and uniform buffers are not consistent or significant across the tasks here.

6 Conclusion

In this work, we proposed a framework of prioritized starting states from which an agent can start acting in the environment. We achieve this by maintaining a buffer of agent’s previously encountered states from which we enable prioritized sampling of starting states. Our proposed framework can be easily adopted by any existing off-policy or on-policy RL algorithm. We applied our method on two popular RL algorithms, the on-policy PPO and the off-policy DQN algorithm, and showed empirically that different variants of our approach can effectively improve upon the performance in almost all domains. Furthermore, we demonstrated how our approach can be used to robustly learn in problems that involve sparse rewards, where a single informative trajectory can be of vital importance to the learning progress, especially of significance for on-policy algorithms which cannot straightforwardly adopt experience replay.

In future work, we aim to explore other prioritization signals which may be better suited for prioritizing starting states. We believe that it would be interesting to extend our framework to existing methods for intrinsic motivation (Schmidhuber, 2010; Ostrovski et al., 2017; Bellemare et al., 2016) and curiosity based exploration methods (Pathak et al., 2017) to further help in sparse reward tasks.

Acknowledgments

We thank Harm van Seijen for helpful suggestions and comments on the paper, and Fabio Pardo for letting us use his flexible experimental setup. The research presented in this paper has been supported by EPSRC, Samsung, and computation resources provided by Microsoft via the Microsoft Azure for Research Award program.

References

  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Jaderberg et al. (2017) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations, 2017.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359, 2017.
  • Tan et al. (2018) Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332, 2018.
  • Lin (1992) Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3):293–321, 1992.
  • Kakade and Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274, 2002.
  • Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.
  • Oh et al. (2018) Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee.

    Self-imitation learning.

    In International Conference on Machine Learning, volume 80, pages 3878–3887, 2018.
  • Kearns et al. (2002) Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning, 49(2-3):193–208, 2002.
  • Popov et al. (2017) Ivaylo Popov, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. Data-efficient deep reinforcement learning for dexterous manipulation. arXiv preprint arXiv:1704.03073, 2017.
  • Salimans and Chen (2018) Tim Salimans and Richard Chen. Learning Montezuma’s Revenge from a single demonstration. https://blog.openai.com/learning-montezumas-revenge-from-a-single-demonstration, 2018.
  • Florensa et al. (2017) Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. In Conference on Robot Learning, volume 78, pages 482–495, 2017.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement Learning: an Introduction. MIT Press, 1998.
  • Li et al. (2018) Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018.
  • van Seijen and Sutton (2013) Harm van Seijen and Richard Sutton. Planning by prioritized sweeping with small backups. In International Conference on Machine Learning, volume 28, pages 361–369, 2013.
  • Pardo et al. (2018) Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. In International Conference on Machine Learning, volume 80, pages 4045–4054, 2018.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: a physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
  • Pfau et al. (2018) Vicki Pfau, Alex Nichol, Christopher Hesse, Larissa Schiavo, John Schulman, and Oleg Klimov. OpenAI Gym Retro. https://github.com/openai/retro, 2018.
  • Hesse et al. (2017) Christopher Hesse, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. OpenAI Baselines. https://github.com/openai/baselines, 2017.
  • Schmidhuber (2010) Jürgen Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
  • Ostrovski et al. (2017) Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with neural density models. In International Conference on Machine Learning, volume 70, pages 2721–2730, 2017.
  • Bellemare et al. (2016) Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
  • Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, volume 70, pages 2778–2787, 2017.