Exploration in Deep Reinforcement Learning: A Survey

by   Pawel Ladosz, et al.

This paper reviews exploration techniques in deep reinforcement learning. Exploration techniques are of primary importance when solving sparse reward problems. In sparse reward problems, the reward is rare, which means that the agent will not find the reward often by acting randomly. In such a scenario, it is challenging for reinforcement learning to learn rewards and actions association. Thus more sophisticated exploration methods need to be devised. This review provides a comprehensive overview of existing exploration approaches, which are categorized based on the key contributions as follows reward novel states, reward diverse behaviours, goal-based methods, probabilistic methods, imitation-based methods, safe exploration and random-based methods. Then, the unsolved challenges are discussed to provide valuable future research directions. Finally, the approaches of different categories are compared in terms of complexity, computational effort and overall performance.



page 1

page 2

page 3

page 4


Exploration in Deep Reinforcement Learning: A Comprehensive Survey

Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Lea...

Perturbation-based exploration methods in deep reinforcement learning

Recent research on structured exploration placed emphasis on identifying...

GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms

In continuous action domains, standard deep reinforcement learning algor...

Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning

Maintaining long-term exploration ability remains one of the challenges ...

DRILL– Deep Reinforcement Learning for Refinement Operators in 𝒜ℒ𝒞

Approaches based on refinement operators have been successfully applied ...

Achieving Sample-Efficient and Online-Training-Safe Deep Reinforcement Learning with Base Controllers

Application of Deep Reinforcement Learning (DRL) algorithms in real-worl...

Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems

Traditional exploration methods in RL require agents to perform random a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In numerous real-world problems, the outcomes of a certain event are only visible after a significant number of other events have occurred. These types of problems are called sparse reward problems since the reward is rare and without a clear link to previous actions. We note that sparse reward problems are common in a real world. For example, during search and rescue missions, the reward is only given when an object is found, or during delivery, the reward is only given when an object is delivered. In sparse reward problems, thousands of decisions might need to be made before the outcomes are visible. Here, we present a review on a group of techniques that can solve this issue, namely exploration in reinforcement learning.

In reinforcement learning, an agent is given a state and a reward from the environment. The task of the agent is to determine an appropriate action. In reinforcement learning, the appropriate action is such that it maximises the reward, or it could be said that the action is exploitative. However, solving problems with just exploitation may not be feasible owing to reward sparseness. With reward sparseness, the agent is unlikely to find a reward quickly, and thus, it has nothing to exploit. Thus, an exploration algorithm is required to solve sparse reward problems.

The most common technique for exploration in reinforcement learning is random exploration Sutton and Barto (2018). In this type of approach, the agent decides what to do randomly regardless of its progress. The most commonly-used technique of this type, called -greedy, uses the time decaying parameter to reduce exploration over time. This can theoretically solve the sparse reward problem given a sufficient amount of time. However, this is often impractical in real-world applications because learning times can be very large. However, we note that even just with random exploration, deep reinforcement learning has shown some impressive performance in Atari games Mnih et al. (2015), Mujoco simulator Lillicrap et al. (2016), controllers tuning Lee and Bang (2020), autonomous landing Polvara et al. (2017), self-driving cars Kiran et al. (2021) and healthcare Yu et al. (2019).

Another solution for exploration could be reward shaping. In reward shaping, the designer ’artificially’ imposes a reward more often. For example, for search and rescue missions, agents can be given a negative reward every time they do not find the victim. However, reward shaping is a challenging problem that is heavily dependent on the experience of the designer. Punishing the agent too much could lead to the agent not moving at all Irpan (2018), while rewarding it too much may cause the agent to repeat certain actions infinitely Clark and Amodei (2016). Thus, with the issues of random exploration and reward shaping, there is a need for more sophisticated exploration algorithms.

While exploration in reinforcement learning was considered as early as 1991 Schmidhuber (1991b, a), it is still under development. Recently, exploration has shown a significant gain in performance compared to non-exploratory algorithms: Diversity is all you need (DIYAN) Eysenbach et al. (2019) improved on MuJoCo benchmarks; random network distillation (RND) Burda et al. (2018) and pseudocounts Bellemare et al. (2016) were the first to score on difficult Montezuma’s Revenge problem; and Agent57 Badia et al. (2020a) is the first to beat humans in all 57 Atari games.

This review focuses on exploratory approaches which fulfil at least one of the following criteria: (i) determines the exploration degree based on the agent’s learning, (ii) actively decides to take certain actions in hopes of finding new outcomes, and (iii) motivates itself to continue exploring despite a lack of environmental rewards. In addition, this review focuses on approaches that have been applied to deep reinforcement learning. Note that this review is intended for beginners in exploration for deep reinforcement learning; thus, the focus is on the breadth of approaches and their relatively simplified description. Note also that, throughout the paper, we will use ’reinforcement learning’ as it is a more general term rather than ’deep reinforcement learning’.

Several review articles exist in the field of reinforcement learning. Aubert et al. Aubret et al. (2019) presented an overview of intrinsic motivation in reinforcement learning, Li Li (2018) presented a comprehensive overview of techniques and applications, Nguyen et al. (2018) considered an application to multi-agent problems, Levine (2018) provided a tutorial and extensive comparison with probabilistic inference methods and Lazaridis et al. (2020) provided an extensive description of the key breakthrough methods in reinforcement learning, including ones in exploration. However, none of the aforementioned reviews focused on exploration or considered it in great detail. The only other review focused on exploration is from 1999 and is now outdated and inaccurate Mcfarlane (1999).

The contributions of this study are as follows. First, the systematic overview of exploration in deep reinforcement learning is presented. As mentioned above, no other modern review exists with this focus. Second, a categorization of exploration in reinforcement learning is provided. The categorization is devised to provide a good way of comparing different approaches. Finally, future challenges are identified and discussed.

2 Preliminaries

2.1 Introduction to Reinforcement Learning

2.1.1 Markov Decision Process

We consider a standard reinforcement setting in which an agent interacts with a stochastic and fully observable environment by sequentially choosing actions in a discrete time step to maximise cumulative rewards. This series of processes is called Markov decision process (MDP). An MDP has a tuple of , where is a set of states, is a set of actions the agent can select,

is a transition probability that satisfies the Markov property given as:


is a set of rewards, and is a discount factor. At each time step , an agent receives states from the environment and selects the best possible actions according to policy , which maps from states to actions . The agent receives a reward from the environment to take an action . The goal of the agent is to maximise the discounted expected reward from each state .

2.1.2 Value-Based Methods

Given that the agent follows policy , a state-value function is defined as . Similarly, the action-value function,

, is an expected estimate value for a given state

for taking an action . Q-learning is a typical type of off-policy learning that updates a target policy using samples generated by any stochastic behaviour policy in an environment. Following the Bellman equation and temporal difference (TD) for the action-value function, the Q-learning algorithm is recursively updated using the following equation:


where follows the target policy and is the learning rate. While updating Q-learning, the next actions are sampled from the behaviour policy which follows an -greedy exploration strategy, and among them, the action that makes the largest Q-value, , is selected.

2.1.3 Policy-Based Methods

In contrast to value-based methods, policy-based methods directly update the policy parameterized by . In reinforcement learning, because the goal is to maximise the expected return throughout states, the objective function for the policy is defined as . Williams et al. Williams (1992) suggested the REINFORCE algorithm which updates the policy network by taking a gradient ascent in the direction of . The gradient of the objective function is expressed as:


where denotes the state distribution. A general overview of reinforcement learning can be found in Arulkumaran et al. (2017).

2.2 Exploration

Exploration can be defined as the activity of searching and finding out about something 32. In the context of reinforcement learning, ”something” is a reward function and the ”searching and finding” is an agent’s attempt to try to maximise the reward function. Exploration in reinforcement learning is of particular importance because a reward function is often complex and agents are expected to improve over their lifetime. Exploration can take various forms such as randomly taking certain actions and seeing the output, following the best known solution, or actively considering moves that are good for novel discoveries.

Problems that can be solved by exploration are common in nature. Exploration is the act of searching for a solution to a problem. We note that exploration is the most useful in problems in which a route to the actual solution (i.e. reward) is obstructed by the local minima (maxima) or areas of flat rewards. These conditions mean that discovering the true nature of rewards is challenging. The following examples are intuitive illustrations of those problems: (i) search and rescue–the agent needs to explore to find a target (victim); the agent is only rewarded when it finds the victim; otherwise, the reward is 0; and (ii) delivery–trying to deliver an object in the unknown areas; the agent is only rewarded when the appropriate drop-off point has been found; otherwise, the reward is 0. Exploration could be considered as a ubiquitous problem that is highly relevant to many domains with ongoing research.

2.3 Challenging Problems

In this section, some of the challenging problems for exploration in reinforcement learning are described, namely noisy-TV and sparse reward problems.

2.3.1 Noisy-TV

In a noisy-TV Burda et al. (2018) problem, the agent is stuck in exploring an infinite number of states which lead to no reward. This phenomenon can be easily explained with an example. Imagine a state consisting of a virtual TV where the agent can operate the remote, but operating the remote controller leads to no reward. A new random image is generated on the TV every time a remote is operated. Thus, the agent will experience novelty all the time. This keeps the agent’s attention high infinitely but clearly leads to no meaningful progress. This kind of behaviour can also be described as a couch potato problem.

2.3.2 Sparse Reward Problems

Sparse rewards are a classical problem in exploration. In the sparse reward problem, the reward is relatively rare. In other words, there is a long gap between an action and a reward. This is problematic for reinforcement learning because for a long time (or at all times) it has no reward to learn from. The agent cannot learn any useful behaviours and eventually converges to a trivial solution. As an example, consider a maze where the agent has to complete numerous steps before reaching the end and being rewarded. The larger the maze is, the less likely it is for the agent to see the reward. Eventually, the maze will be so large that the agent will never see the reward; thus, it will have no opportunity to learn.

2.4 Benchmarks

In this section, the most commonly used benchmarks for reinforcement learning are briefly introduced and described. We highlight four benchmarks: Atari Games, VizDoom, Minecraft, and Mujoco.

2.4.1 Atari Games

The Atari games benchmark are a set of 57 Atari games combined under the Atari Learning Environment (ALE) Bellemare et al. (2013). In Atari games, the state space is normally either images or random-access memory (RAM) snapshots. The action space consists of five joystick actions (up, down, left, right, and action button). Atari games can be largely split into two groups: easy (54 games) and difficult exploration (3 games) Aytar et al. (2018). In the easy exploration problem, the reward is relatively easy to find. In hard exploration problems, the reward is not often given, and the association between states and rewards is complex.

2.4.2 VizDoom

VizDoom Kempka et al. (2016) is a benchmark based on the Doom game. The game has a first-person perspective (i.e., view from characters’ eyes), and the image seen by the character is normally used as a state space. The action space is normally eight directional control and two action buttons (picking up key cards and opening doors). Note that more actions can be added, if needed. One of the key advantages of VizDoom is the availability of easy-to-use tools for editing scenarios and low computational burden.

2.4.3 Malmo

Malmo Johnson et al. (2016) is a benchmark based on the game Minecraft. In Minecraft, environments are built using same-shaped blocks, similar to how Lego bricks are used for building. Similar to VizDoom, it is also from the first-person perspective, and the image is the state space. The key advantage of Malmo is its flexibility in terms of the environment structure, domain size, custom scripts, and reward functions.

2.4.4 Mujoco

MuJoCo Todorov et al. (2012) represents multi-joint dynamics with contact. Mujoco is a popular benchmark used for physics-based simulations. In reinforcement learning, Mujoco is typically used to simulate walking robots. These are typically cheetah, ant, humanoids, and their derivatives. The task of reinforcement learning is to control various joint angles and forces to develop walking behaviour. Normally, the task is to walk as far as possible or to reach a specific goal.

3 Exploration in Reinforcement Learning

Figure 1: Overview of exploration in reinforcement learning.

Exploration in reinforcement learning can be split into two main streams: efficiency and safe exploration. In efficiency, the idea is to make exploration more sample efficient so that the agent can explore in as few steps as possible. In safe exploration, the focus is on ensuring safety during exploration. We suggest splitting efficiency-based methods further into imitation-based and self-taught methods. In imitation-based learning, the agent learns how to utilise a policy from an expert to improve exploration. In self-taught methods, learning is performed from scratch. Self-taught methods can be further divided into planning, intrinsic rewards, and random methods. In planning methods, the agent plans its next action to gain a better understanding of the environment. In random methods, the agent does not make conscious plans; rather, it explores and then sees a consequence of this exploration. We distinguish intrinsic reward methods into two categories: (i) reward novel states–reward agents for visiting novel states; and (ii) reward diverse behaviours-reward agents for discovering novel behaviours. Note that intrinsic rewards are a part of a larger notion of intrinsic motivation. For an extensive review of intrinsic motivation, see Aubret et al. (2019) and Schmidhuber (2010). In planning methods, two distinguished categories are considered: (i) goal-based: an agent is given an exploratory goal to reach; and (ii) probability- probabilistic models are used for an environment. Review of the entire categorizations is represented in Fig. 1. From the following, each category is described in detail. The main objective of the categorization is to highlight the key contribution of each approach. Note that a certain approach could be a combination of various techniques. For example, Go-explore Ecoffet et al. (2019) utilizes reward novel states methods, but the main contribution is best described by goal-based methods.

3.1 Reward Novel States

Figure 2: Overview of the reward novel state methods. In general, in reward novel states, the agent is given additional reward for discovering novelty. This additional reward is generated from intrinsic reward module .

In this section, approaches on reward novel state are discussed and compared. Reward novel state approaches give agents a reward for discovering new states. This reward is called an intrinsic reward. As can be observed in Fig. 2, the intrinsic reward () supplements rewards given by the environment ( called an extrinsic reward). By rewarding novel states, agents will incorporate exploration into their behaviours Schmidhuber (2010).

These approaches were generalised in Schmidhuber (2010). In general, there are two necessary components: ”an adaptive predictor or compressor or model of the growing data history as the agent is interacting with its environment to provide an intrinsic reward, and a general reinforcement learner to learn behaviours” Schmidhuber (2010). In this division, the reinforcement learner is asked to invent things which predictor does not know yet. In our review, the former is simply referred to as an intrinsic reward module, and the latter is referred to as an agent.

There are different ways of classifying intrinsic rewards

Oudeyer and Kaplan (2007); Aubret et al. (2019). Here, we largely follow the classification of Aubret et al. (2019) with the following categories: (i) prediction error methods, (ii) count-based methods and (iii) memory methods.

3.1.1 Prediction Error Methods

In prediction error methods, the error of a prediction model when predicting a previously visited state is used to compute the intrinsic reward. For a certain state, if a model’s prediction is inaccurate, it means that a given state has not been seen often and the intrinsic reward is high. One of the key questions that needs to be addressed is how to use the model’s error to compute the intrinsic reward. To this end, Achiam et al. Achiam and Sastry (2017) compared two intrinsic reward functions: (i) how big the error is in a prediction model and (ii) the learning progress. The first method has shown better performance and is therefore recommended, which can be formalised as:


where represents a state, is an environmental model, and are two consecutive time steps, is an optional model for state representation, and is an optional reward scaling function.

The simplest method of this type was described in Schmidhuber (1991b, a). The intrinsic reward is measured as the Euclidean distance between the prediction of a state from a model and that state. This simple idea was revisited in Li et al. (2019)

. Generative adversarial networks (GAN 

Goodfellow et al. (2014)), distinguishing real from fake states as a prediction error method, were proposed in Hong et al. (2019). Since then many other approaches were devised which can be further divided into (i) state representation prediction, (ii) a priori knowledge and (iii) uncertainty about the environment.

State representation prediction methods

In state representation prediction methods, the state is represented in a higher-dimensional space. Then, a model is tasked with predicting the next state representation given the previous state representation. The larger the error is in the prediction, the larger the intrinsic reward is. One way of providing state representation is using an autoencoder

Stadie et al. (2015). Both pre-trained and online trained autoencoders were considered and showed similar performance. Improvements to autoencoder-based approaches were proposed in Bougie and Ichise (2020a, b), where a slow-trained autoencoder was added. Thus, the intrinsic reward decays slower and the agent explores for longer while increasing the chance of finding the optimal reward.

Another method of providing state representation involves utilising fixed networks with random weights. Then, another network is used to predict the outputs of randomly initialised networks as shown in Fig.  3. The most popular approach of this type is called random network distillation (RND) Burda et al. (2018). A similar approach was considered in Osband et al. (2018).

Figure 3: RND overview. The predictor is trying to predict output of a randomly parameterized target.

A state representation method derived from inverse dynamic features (IDF) was used in Pathak et al. (2017). In IDF, the representation comes from forcing an agent to predict the action as illustrated in Fig. 4. IDF was compared against the state prediction method and random representation in Burda et al. (2019) with the following conclusions: IDF had the best performance and it scales the best to the unseen environments. IDF was utilised in Raileanu and Rocktäschel (2020), where the Euclidean distance between two consecutive state representations was used as an intrinsic reward, as shown in Fig. 4. Intuitively, the more significant the transition is, the larger the change is in IDF’s state representation. In another study, RND and IDF were combined into a single intrinsic reward Li et al. (2020b).

Figure 4: IDF and rewarding impact driven exploration (RIDE) overview. In IDF, the features are extracted based on the network predicting the next action. In RIDE, the intrinsic reward is based on the difference in state representation. (adapted from Pathak et al. (2019))

A compact representation using information theory was proposed in Kim et al. (2019a). Information theory is used to represent states that are close to the representation space in the environment space. Information theory can also be used to create a bottleneck latent representation Kim et al. (2019b). Bottleneck latent representation occurs when mutual information between the input to the network and latent representation is minimised.

A priori knowledge methods

In some types of problems, it makes sense to use certain parts of the state space as an error and use it for computing the intrinsic reward. Those parts could be depth point cloud, position, and sound, and they rely on a priori knowledge from the designer.

Depth point cloud prediction error was used in Mirowski et al. (2017). The scalability of this approach was analysed by Dhiman et al. (2018). It was found that the performance was good in the same environment with different starting positions, but it did not scale to a new scenario. Positions in a 3D space can also be used Li et al. (2020a). An approach using the position was proposed in Stanton and Clune (2018). The environment is split into the x-y grid where each node’s intrinsic reward is placed. When the episode terminates, the rewards are restored to a default value.

Sound as a source of intrinsic reward was used in Dean et al. (2020). To model sounds, the model is trained to recognise when the sound and the corresponding frame match. If the model indicates misalignment between frames and sounds, it means that the state is novel.

Uncertainty about the environment methods

In these methods, the intrinsic reward is based on the uncertainty the agent has. If the agent is exploring highly uncertain areas of the environment, the reward is high. Uncertainty can be utilized using the following techniques Bayesian, ensembles of models and information-theoretic.

Bayesian approaches are generally intractable for large problem spaces; thus, approximations are used. Kotler et al. Kolter and Ng (2009)

presented a close to optimal approximation method using the Dirichlet probability distribution over state, action, and next state triplet. Another approximation could be to use ensembles of models as proposed in

Pathak et al. (2019). The intrinsic reward is given based on model disagreement as shown in Fig. 5. The models were initialised with different random weights and were trained on different mini-batches to maintain diversity.

In information-theoretic approaches, the intrinsic reward is computed using the information gained from agent actions. The higher the gain is, the more the agent learns, and the higher the intrinsic reward is. The general framework for these types of approaches was presented in Still and Precup (2012); Still (2009). One of the most popular information-theoretic approach is called variational information maximization exploration (VIME) Houthooft et al. (2016)

. In this approach, the information gain is approximated as a Kullback–Leibler (KL) divergence between the weight distribution of the Bayesian neural network, before and after seeing new observations. In

Mohamed and Rezende (2015), maximising mutual information between a sequence of actions leads to a state that is rewarded. Rewarding this mutual information gain means maximising the information contained in the action sequence about a state. Mutual information gain was combined with the state prediction error into a single intrinsic reward in De Abril and Kanai (2018); Chien and Hsu (2020).

Figure 5: Overview of self-supervised exploration via the disagreement method. The intrinsic reward is based on disagreement between models. (adapted from Pathak et al. (2019))

The key advantages of prediction error methods are that they rely only on a model of the environment. Thus, there is no need for buffers or complex approximation methods. Each of the four different categories of methods has unique advantages and challenges.

While predicting the state directly requires little to no a priori knowledge, the model needs to learn how to recognise different states. Additionally, they struggle when many states are present in the environment. State representation methods can cope with large state spaces at the cost of increased designer burden and reduced accuracy. Moreover, in a state representation method, the agent cannot affect the state representation, which can often lead to different states being represented similarly. Utilising a priori knowledge relies on defining a special element of the state space as a source of an error for computing the intrinsic reward. These methods do not suffer from problems with the speed of prediction and state recognition. However, they rely on the designer experience to define parts of the state space appropriately. Finally, in uncertainty about the environmental approaches, the agent’s uncertainty is used to generate the intrinsic reward. The key advantage of this approach is its high scalability and automatic transition between exploration and exploitation. Prediction error methods have also shown the ability to solve the couch-potato (noisy-TV) problem by storing observations in a memory buffer Savinov et al. (2019). An intrinsic reward is given only when observation is sufficiently far away (in terms of time steps) from the observations stored in the buffer. This mitigates the couch potato problem since repeatedly visiting states close to each other is not rewarded.

3.1.2 Count-Based Methods

In count-based methods, each state is associated with the visitation count number . If the state has a low count, the agent will be given a high intrinsic reward to encourage revisiting. The method of computing the reward based on the count was discussed in Ménard et al. (2020). It has been shown that 1/ guarantees a faster convergence rate than the commonly used 1/.

In problems with large number of states, counting visits to states is difficult because it requires saving the count for each state. To solve this problem, count is normally done on a reduced-size state representation.

Count on state representation methods

In count on state representation methods, the states are represented in a reduced form to alleviate memory requirements. This allows storing the count and a state with minimal memory in a table, even in the case of a large state space.

One of the popular methods of this type was proposed in Tang et al. (2017), where static hashing was used. Here, a technique called SimHash Charikar (2002) was used, which represents images as a set of numbers called a hash. To generate an even more compact representation, in Choi et al. (2019), the state was represented as the learned x-y position of an agent. This was achieved using an attentive dynamic model (ADM). Successor state representation (SSR) Machado et al. (2020) is a method which combines count and representation. The SSR is based on the count and order between the states. Intuitively, the SSR can be used as a count replacement.

It is also possible to approximate count on state representation by using a function. For example, Bellemare et al. Bellemare et al. (2016) proposed an approximation based on a density model. The density models include context tree switching (CTS) Bellemare et al. (2016)

, Gaussian mixture models 

Zhao and Tresp (2019) or PixelCNN Ostrovski et al. (2017). Martin et al. Martin et al. (2017) proposed an improvement in the approximate count by making counts on the feature space rather than raw inputs.


Count-based methods approximate the intrinsic reward by counting the number of times a given state has been visited. To reduce computational efforts of count-based methods, usually counts are associated with state representations rather than states. This, however, relies on being able to efficiently represent states. State representations can still require a lot of memory and careful design.

3.1.3 Memory Methods

In these methods, an intrinsic reward is given on how easy it is to distinguish a state from all others. The easier it is to distinguish from the others, the more novel the given state is. As comparing states directly is computationally expensive, several approximation methods have been devised. Here, we categorize them into comparison models and experience replay.

Models can be trained for comparing state-to-state to reduce the computational load. One example method is to use exemplar model Malisiewicz et al. (2011) developed in Fu et al. (2017). Exemplar models are a set of classifiers, each of which is trained to distinguish a specific state from the others. Training multiple classifiers is generally computationally expensive. To further reduce the computational cost, the following two strategies are proposed: updating a single exemplar with a each new data point and sampling exemplars from a buffer.

Instead of developing models for comparison, a limited size of experience replay was combined with prediction error methods in Badia et al. (2020b). To devise intrinsic rewards, two rewards are combined: (i) intrinsic episodic experience replay is used to store states and compare them to others; and (ii) intrinsic motivation RND Burda et al. (2018) is used to determine the state’s long-term novelty. Additionally, multiple policies are trained, each with a different ratio between the extrinsic and intrinsic reward. A meta learner to automatically choose different ratios of extrinsic and intrinsic rewards at each step was proposed in Badia et al. (2020a).


In memory-based approaches, the agent derives an intrinsic reward by comparing its current state with states stored in the memory. The comparison model method has the advantage of small memory requirements, but requires careful model parameter tuning. On the other hand, using a limited experience buffer does not suffer from model inaccuracies and has shown a great performance in difficult exploratory Atari games.

3.1.4 Summary

The reward novel state-based approaches are summarised in Table  1. The table utilizes the following legend: Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]). Prediction error methods are the most commonly-used methods. In general, they have shown very good performance (for example, RND Burda et al. (2018) with 6,500 in Montezuma’s Revenge). However, they normally require a hand-designed state representation method for computational efficiency. This requires problem-specific adaptations, thus reducing the applicability of those approaches. Count-based methods are computationally efficient but they can either require memory to store counts or complex models Bellemare et al. (2016). Also, counting states in continuous-state domains is challenging and requires combining continuous states into discontinuous chunks. Recently, memory methods have shown good performance in games such as Montezuma Revenge, scoring as much as 11,000 Badia et al. (2020b). Memory methods require a careful balance of how much data to remember for comparison. Otherwise computing the comparison can take a long time.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
Pathak et al. (2017) A3C Prediction error Vizdoom: very sparse 0.6 (A3C 0) Vizdoom image P MB D/ D
Stadie et al. (2015) autoencoder DQN Prediction error Atari: Alien 1436 (DQN 300) Atari images Q MB D/ D
Savinov et al. (2019) pretrained discrimnator (non-online) PPO Prediction error Vizdoom: very sparse 1 (PPO 0); Dmlab: very sparse 30 (PPO+ICM 0) Vizdoom images/ Mujoco joints angles Ac MB C/ C
Burda et al. (2018) PPO Prediction error Atari: Montezuma Revenge 7500 (PPO 2500) Atari images P MB D/ D
Bougie and Ichise (2020a) PPO Prediction error Atari: Montezuma Revenge 20 rooms found (RND 14) images Ac MB D/ D
Hong et al. (2019) DQN Prediction error Atari: Montezuma Revenge 200 (DQN 0) Enumarated state id/ Atari Image Q MB D/ D
Kim et al. (2019a) TRPO Prediction error Atari: Frostbite 6000 (ICM 3000) Atari Images Ac MB D/ D
Stanton and Clune (2018) agent position, reward grid A2C Prediction error Atari: Montezuma Revenge 3200 (A2C 0) Atari images Ac MB D/ D
Achiam and Sastry (2017) TRPO Prediction error Mujoco: halfcheetah 80 (VIME 40); Atari: Venture 400 (VIME 0) Atari RAM states/ Mujoco joints angles Ac MB C/ C
Li et al. (2020b) A2C Prediction error Atari: Asterix 500000 (RND 10000) Atari images Ac MB D/ D
Kim et al. (2019b) PPO Prediction error Atari: Montezuma Revenge with distraction 1500 (RND 0) Atari images Ac MB D/ D
Chien and Hsu (2020) DQN Prediction error PyDial: 85 (CME 80); OpenAI: Mario 0.8 (CME 0.8) Images Q MB D/ D
Li et al. (2019) DDPG Prediction error Robot: FetchPush 1 (DDPG 0) Robot joints angles Ac MB C/ C
Raileanu and Rocktäschel (2020) IMPALA Prediction error Vizdoom: 0.95 (ICM 0.95) Vizdoom Images Ac MB D/ D
Mirowski et al. (2017) A3C Prediction error DM Lab: Random Goal 96 (LSTM-A3C 65) DM Lab images Ac MB C/ C
Tang et al. (2017) TRPO Count-based Atari: Montezuma Revenge 238 (TRPO 0); Mujoco: swimmergather 0.3 (VIME 0.15) Atari images/ Mujoco joints angles P MF C/ C
Martin et al. (2017) Blob-PROST features SARSA-e Count-based Atari: Montezuma Revenge 2000 (SARSA 200) Blob-PROST features Q MB D/ D
Machado et al. (2020) DQN Count-based Atari: Montezuma Revenge 1396 (Psuedo counts 1671) Atari images Q MF D/ D
Ostrovski et al. (2017) DQN and Reactor Count-based Atari: Gravitar 1500 (Reactor 1000) Atari images Ac MB D/ D
Badia et al. (2020b) R2D2 Memory Atari: Pitfal 15000 (R2D2 -0.5) Atari images P MB D/ D
Badia et al. (2020a) R2D2 Memory Beat humans in all 57 atari games Atari images P MB D/ D
Fu et al. (2017) state encoder TRPO Memory Mujoco: SparseHalfCheetah 173.2 (VIME 98); Atari: Frostbite 4901 (TRPO 2869); Doom: MyWayHome 0.788 (VIME 0.443) Atari images/ Mujoco joints angles Ac MB C/ C
Table 1: Comparison of reward novel state approaches

3.2 Reward Diverse Behaviours

In reward diverse behaviours, the agent collects as many different experiences as possible, as shown in Fig. 6. This makes exploration an objective rather than a reward finding. These types of approaches can also be called diversity and can be split into evolution strategies and policy learning.

Figure 6: Overview of reward diverse behaviour-based methods. The key idea is for the agent to experience as many things as possible, in which either evolution or policy learning can be used to generate a set of diverse experiences

3.2.1 Evolution Strategies

Reward diverse behaviours were initially used with an evolutionary-based approach. In evolutionary approaches, a group of sample solutions (population) is tested and evolves over time to get closer to the optimal solution. Note that evolutionary approaches are generally not considered as the part of reinforcement learning but can be used to solve the same type of problems Such et al. (2017); Salimans et al. (2017).

One of the earliest methods called novelty search was devised in Lehman and Stanley (2011) and Risi et al. (2009). In novelty search, the agent is encouraged to generate numerous different behaviours using a metric called diversity measure. The diversity measure must be hand-designed for each environment, limiting transferability between different domains. Recently, novelty search has been combined with other approaches, such as reward maximization Conti et al. (2018) and reward novel state method Gravina et al. (2018). In Conti et al. Conti et al. (2018), the novelty-search policy is combined with a reward maximisation policy to encourage diverse behaviours and search for the reward. Gravina et al. Gravina et al. (2018) compared three ways of combining novelty search and reward novel state: (i) novelty search, (ii) sum of reward novel state and novelty search, and (iii) sequential optimisation where the second one performed the best in a simulated robot environment. More detailed reviews of exploration in evolution strategies can be found in Mouret and Doncieux (2012) and Pugh et al. (2016).


Initially, novelty search was used as a stand-alone technique; however, recently, combining it with other techniques Gravina et al. (2018); Conti et al. (2018) has shown more promise. Such a combination is more beneficial (in terms of reward) as diverse behaviours are more directed toward highly scoring ones.

3.2.2 Policy Learning

Recently, diversity measures have been applied in policy learning approaches. The diversity among policies was measured in Hong et al. (2018). Diversity is computed by measuring the distance between policies (either KL divergence or simple mean squared error). Very promising results for diversity are presented in Eysenbach et al. (2019), as shown in Fig. 7. To generate diverse policies, the objective function consists of (i) maximising the entropy of skills, (ii) inferring behaviours from the current state, and (iii) maximising randomness within a skill. A similar approach was proposed in Cohen et al. (2019) with a new entropy-based objective function. A combination of diversity with a goal-based approach was proposed in Pong et al. (2019)

. In this study, the agent learns diverse goals and goals useful for rewards using the skew-fit algorithm. In the skew-fit algorithm, the agent skews the empirical goal distribution so that rarely visited states can be visited more frequently. The algorithm was tested using both simulations and real robots.

Figure 7: An overview of Diversity is all you need (DIAYN), where the agent is encouraged to have as many diverse policies as possible. (adapted from Pathak et al. (2019))

In Gangwani et al. (2019), the agent stores a set of successful policies in an experience replay and then minimises the difference between the current policy and the best policies from storage. To allow exploration at the same time, the entropy of parameters between policies is maximised. The results show an advantage over evolution strategies and PPO in sparse reward Mujoco problems.


Diversity in policy-based approaches is a relatively new concept that is still being developed. Careful design of a diversity criterion shows very promising performance, beating standard reinforcement learning with significant margins Eysenbach et al. (2019).

3.2.3 Summary

Reward diverse behaviour methods are summarised in Table  2. In evolution strategies approaches, a diverse population is used, whereas in policy learning, a diverse policy is found. Evolution strategies have the potential to find solutions that are not envisioned by designers as they search for the neural network structure as well as diversity. Evolution strategies suffer from the low sample efficiency, making the training either computationally expensive or slow. Policy learning is not able to go beyond pre-specified structures but can also show some remarkable results Eysenbach et al. (2019). Another advantage of the policy learning method is suitability to both continuous and discrete state-space problems.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
Conti et al. (2018) domain specific behaviours Reinforce Evolution strategies Atari: Frostbite 3785 (DQN 1000) Atari RAM state/ Mujoco joints angles Ac MF C/ C
Gravina et al. (2018) NS population based Evolution strategies Robotic navigation: 400 successes six range finders, pie-slice goal-direction sensor Ac MB C/ C
Lehman and Stanley (2011) measure of policies distance NEAT Evolution strategies maze: 295 (maximum achievable) six range finders, pie-slice goal-direction sensor Ac MF D/ D
Risi et al. (2009) measure of policies distance NEAT Evolution strategies T-maze: solved after 50,000 evaluations enumarated state id Ac MF D/ D
Cohen et al. (2019) SAC Policy learning Mujoco: Hopper 3155 (DIAYN 3120) Mujoco joint angles Ac MB C/ C
Pong et al. (2019) RIG Policy learning Door Opening (distance to the objective): 0.02 (RIG + DISCERN-g 0.04) Robots joint angles Ac MB C/ C
Eysenbach et al. (2019) SAC Policy learning Mujoco: half cheetah 4.5 (TRPO 2) Mujoco joints angles Ac MF C/ C
Hong et al. (2018) DQN, DDPG, A2C Policy learning Atari: Venture 900 (others 0); Mujoco: SparseHalfCheetah 80 (Noisy-DDPG 5) Atari images/ Mujoco joints angles Ac/ Q MB C/ C
Gangwani et al. (2019) Itself Policy learning Mujoco: SparseHalfCheetah 1000 (PPO 0) Robot joints angles Ac MF C/ C
Table 2: Comparison of reward diverse behaviour-based approaches

3.3 Goal-Based Methods

Figure 8: Illustration of goal-based methods. In goal-based methods, the agent’s task is to reach a specific goal (or state). Then, this goal is explored using another exploration method (left) or to generate an exploratory target goal (right). The key concept is to guide agents directly to unexplored areas.

In goal-based methods, the states of interest for exploration are used to guide the agent’s exploration. In this way, the exploration can immediately focus on largely unknown areas. In those types of methods, the agent requires a goal generator, a policy to find a goal, and an exploration strategy (see Fig. 8). The goal generator is responsible for creating goals for the agent. The policy is used to achieve the desired goals. An exploration strategy is used to explore once a goal has been achieved or while trying to achieve goals.

Here, we split goal-based methods into two categories: goals to explore from and exploratory goal methods.

3.3.1 Goals to Explore from Methods

The main technique used for these methods are (i) memorize visited states and trajectories - storing the past states in a buffer and choosing an exploratory goal from the buffer; and (ii) learn from the goal - assuming the goal state is known but a path to it is unknown.

One of the most famous approach when goal is chosen from a buffer of this type is called the go-explore Ecoffet et al. (2019). The states and trajectories are saved in a buffer and are selected probabilistically. Once the state to explore from has been found, the agent is teleported there and explores it randomly. In Ecoffet et al. (2020), teleportation was replaced with policy learning. Go-exploration was extended to continuous domains in Matheron et al. (2020). Concurrently, similar concepts were developed in Guo and Brunskill (2019); Guo et al. (2019); Oh et al. (2018); Guo et al. (2020). In these approaches, a trajectory from the past is selected as an agent to exploit or explore randomly. If exploration is selected, a sample state from the trajectory is selected as a goal to explore based on the visitation count.

Another goal method was proposed in Liu et al. (2020), where the least visited state was selected as a goal from the x-y grid on an Atari game. This reduces the computational effort of remembering where the agent has been significantly.

Learn from goal methods assume that the agent knows how the state with maximum reward looks like, but does not know how to get there. In this case, it is plausible to utilise this knowledge, as described in Edwards et al. (2018); Florensa et al. (2017). In Edwards et al. (2018), the model was trained to predict the backward steps in reinforcement learning. With such a model, the agent can ’imagine’ states before the goal and thus can explore from the goal state. Similarly, another scenario, in which the agent can start at the reward position, can be conceived; then, it can also explore the starting position from the goal Florensa et al. (2017).


Memorise visited states and trajectories methods have shown some remarkable results in hard exploration benchmarks such as Montezuma’s revenge and pitfall. By utilising a reward state as a goal, as outlined in learn from the goal methods, the exploration problem can be mitigated, as the agent knows where to look for the reward.

3.3.2 Exploratory Goal Methods

In this subsection, an exploratory goal is given to the agent to try to reach. Exploration occurs when the agent attempts to reach the goal. The following techniques are considered: (i) meta-controllers, (ii) goals in the region of the highest uncertainty, and (iii) sub-goal methods.


In meta-controllers, the algorithm consists of two parts: a controller and a worker. The controller has a high-level overview and provides goals that the worker is trying to find.

One of the simple approaches is to generate and sample goals randomly Forestier et al. (2017). The random goal selection mechanism was refined in Colas et al. (2019) with goal selection based on the learning progress. A similar approach in two phases was proposed by Pere et al. Péré et al. (2018). First, the agent explores randomly to learn the environment representation. Second, the goals are randomly sampled from the learned representation. An approach in which both goal creation and selection mechanisms are devised by a meta-controller was proposed in Vezhnevets et al. (2017). In this work, a meta-controller proposes goals within a certain horizon for a worker to find.

In Hester and Stone (2013), a multi-arm bandit-based method to choose one strategy from a group of hand-designed strategies was proposed. At each episode, every ten steps, the agent chooses a strategy based on its performance. The goal selection mechanism from a group of hand-designed goals is also discussed in Kulkarni et al. (2016). The low-level controller is trained on a state-action space, and the meta-controller is trained on a goal-state space. An approach in which each subtask is learned by one learner was proposed in Riedmiller et al. (2018). To allow any sub-task learner to perform its task from all states, the starting points for learning are shared between sub-task learners.


In sub-goal methods, the algorithms find the sub-goals for the agent to reach. In general, sub-goal methods can be split into: (i) bottleneck states which lead to many others as exploratory goals, (ii) progress towards the main goal which is likely to lead to the reward and (iii) uncertainty based sub-goals.

One of the early methods of discovering bottleneck states was described in Ghafoorian et al. (2013) using an ant colony optimisation method. Bottleneck states are said to be the states often visited by ants when exploring (by measuring pheromone levels). To discover bottleneck states, Machado et al. (2017)

proposed the use of proto-value functions based on the eigenvalue of representations. This allows the computation of eigenvector centrality 

Zaki and Meira (2014), which has a high value if the node has many connections. This was later improved in Machado et al. (2018) by replacing the handcrafted adjacency matrix with successor representations.

To design sub-goals which lead to a reward, Fang et al. Fang et al. (2020) proposed progressively generating sub-tasks that are closer to the main task. To this end, two components are used: the learning progress estimator and task generator. The learning progress estimator determines the learning progress on the main task. The task generator then uses the learning progress to generate sub-tasks closer to the main tasks.

In uncertainty based methods, sub-goals the goals for the agent are positioned at the most uncertain states. One of the earliest attempts of this type was proposed by Guestrin et al. (2002)

. Here, the upper and lower bounds of the reward are estimated. Then, states with high uncertainty regarding the reward are used as exploratory goals. Clustering states using k-means and visiting least-visited clusters were proposed in

Abel et al. (2016). Clustering can also help to solve the couch potato problem, as described in Kovač et al. (2020). In this approach, the states are clustered using Gaussian mixture models. The agent avoids the couch potato problem by clustering all states from a TV into a single avoidable cluster.


There are two main categories of exploratory goal methods: meta-controllers, and sub-goals. The key advantage of meta-controllers is that they allow the agent to set its own goals without excessively rewarding itself. However, training the controller is a challenge, which was not fully solved yet. In sub-goals methods, what constitutes a goal is defined by human designers. This puts a significant burden on the designer to provide suitable and meaningful goals.

3.3.3 Summary

The goal-based methods are summarised in Table 3. Goals to explore from methods have shown very good performance recently Ecoffet et al. (2020); Guo et al. (2020) in difficult exploratory games such as Montezuma’s Revenge. The key challenges of these methods are the need to store states and trajectories as well as how to navigate to the goal. This issue is partially mitigated in Guo et al. (2020) by using the agent’s position as the state representation, however, this is highly problem-specific. Exploratory goal methods are limited as devising an exploratory goal becomes more challenging with increasing sparsity of the reward. This is somewhat mitigated in Colas et al. (2019) or Fang et al. (2020), but these approaches rely on the ability to parametrize the task.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
Guo et al. (2019) A2C and PPO Goals to Explore from Atari: Montezume Revenage 20158 (A2C+CoEX 6600) Atari images P MF C/ C
Guo and Brunskill (2019) DQN, DDPG Goals to Explore from

Mujoco: Fetch Push 0.9 after 400 epoch (DDPG 0.5)

Mujoco joints angles Q MB C/ C
Florensa et al. (2017) goal position TRPO Goals to Explore from Mujoco: Key Insertion 0.55 (TRPO 0.01) Mujoco joints angles Ac MF C/ C
Edwards et al. (2018) goal state information DDQN Goals to Explore from Gridworld 0 (DDQN -1) Enumarated state id Q MF D/ D
Matheron et al. (2020) state storage method DDPG Goals to Explore from Maze: reach reward after 146k (TD3 never) x-y position Ac MB C/ C
Oh et al. (2018) A2C Goals to Explore from Atari: Montezuma Revenge 2500 (A2C 0) Atari images Ac MF D/ D
Guo et al. (2020) access to agent position Itself Goals to Explore from Atari: Pitfall 11,000 (PPO 0); Robot manipulation task: 40 (PPO 0) Atari images/ agent positions/ robotics joint angles Ac MF C/ C
Ecoffet et al. (2019) teleportation ability itself Goals to Explore from Atari: Montezuma Revenge 46000 (RND 11000) Atari images Ac MB D/ D
Ecoffet et al. (2020) access to agent position itself Goals to Explore from Atari: Montezuma Revenge 46000 (RND 11000) Atari images Ac MB D/ D
Hester and Stone (2013) Strategies set texpl- ore-vanir Exploratory Goal Sensor Goal: -53 (greedy -54) Enumarated state id Ac MB D/ D
Machado et al. (2017) hadcrafted features itself Exploratory Goal 4-room domain: 1 Enumarated state id Ac MB D/ D
Machado et al. (2018) itself Exploratory Goal 4-room domain: 1 Enumarated state id Ac MB D/ D
Abel et al. (2016) DQN Exploratory Goal Malmo: Visual Hill Climbing 170 (DQN+boosted 60) Image/ Vehicle positions Q MB C/ C
Forestier et al. (2017) randomly generated goals Itself Exploratory Goal Minecraft: mountain car 84% explored (-greedy 3%) State Id Ac MB C/ C
Colas et al. (2019) M-UVFA Exploratory Goal OpenAI: Goal Fetch Arm 0.8 (M-UVFA 0.78) Robot joints angles Ac MB C/ C
Péré et al. (2018) IMGEP Exploratory Goal Mujoco (KLC): ArmArrow 7.4 (IMGEP with handcrafted features 7.7) Mujoco joints angles Ac MF C/ C
Ghafoorian et al. (2013) Q-learning Exploratory Goal Taxi Driver: Found goal after 50 episodes (Q-learning after 200) State Id Q MF D/ D
Riedmiller et al. (2018) rewards for axuillary tasks Itself Exploratory Goal Block stacking: 140 (DDPG 0) Robot joints angles Ac MB C/ C
Fang et al. (2020) tasks parameterization itself Exploratory Goal GirdWorld: 1 (GoalGAN 0.6) Robot joints angles, images Ac MB C/ C
Table 3: Comparison of Goal-based approaches

3.4 Probabilistic Methods

In probabilistic approaches, the agent holds a probability over states, actions, values, rewards or their combination and chooses the next action based on that probability. Probabilistic methods can be split into optimistic and uncertain methods Osband and Van Roy (2017). The main difference between them is how they model a probability and how the agent utilises the probability, as shown in Fig. 9. In optimistic methods, the estimation needs to depend on a reward, either implicitly or explicitly. Then, the upper bound of the estimate is used to make the action. In uncertainty-based methods, the estimate is the uncertainty about the environment, such as the value function and state prediction. In the uncertainty-based method, the agent takes actions that minimise environmental uncertainty. Note that uncertainty methods can use estimations from optimistic methods but they utilise them differently.

Figure 9: Overview of probabilistic methods. The agent uses uncertainty over the environment model to either behave optimistically (left) or follow the most uncertain solution (right). Both should lead to a reduction in the uncertainty of the agent.

3.4.1 Optimistic Methods

In optimistic approaches, the agent follows optimism under the uncertainty principle. In other words, the agent follows the upper confidence bound of the reward estimate. The use of Gaussian process (GP) as a reward model was presented in Jung and Stone (2010). The GP readily provides uncertainty, which can be used for reward estimation. The linear Gaussian algorithm can also be used as a model of the reward Xie et al. (2016). Bootstrapped deep-Q networks (DQN) and Thomson sampling were utilised in D’Eramo et al. (2019). Bootstrapped DQNs naturally provide a distribution over rewards and values so that optimistic decisions can be taken.

It is also possible to hold a set of value functions and samples during exploration Osband et al. (2016b, 2019). The most optimistic value function is used by the agent for an episode. At the end of the episode, the distribution of the value functions was updated.


In optimistic approaches, the agent attempts to utilise optimism under the uncertainty principle. To utilize this principle the agent needs to be able to model the reward. It is possible to do this modeling by either modelling reward directly or by approximating value functions. Value function approximation can be advantageous as reward sparsity increases. With increased reward sparsity, the agent can utilize the partial reward from value functions for learning.

3.4.2 Uncertainty Methods

In uncertainty-based methods, the agent holds a probability distribution over actions and/or states which represent the uncertainty of the environment. Then, it chooses an action that minimises the uncertainty. Here, five subcategories are distinguished: parameter uncertainty, value uncertainty, network ensemble, and information-theoretic.

Parameter uncertainty

In parameter uncertainty, the agent holds uncertainty over the parameters defining a policy. Then, the agent samples from those and follows this policy for a certain time and updates the parameters based on the performance. One of the simplest approaches is to hold a distribution over the parameters of the network Tang and Agrawal (2018). Here, the network parameters were sampled from the weight distribution. Colas et al. Colas et al. (2018) split the exploration into two phases: (i) explore randomly and (ii) compare experiences to an expert-created imitation to determine the good behavior.

In Janz et al. (2019)

, the successor state representation was utilised as a model of the environment. The exploration was performed by sampling parameters from the Bayesian linear regression model which predicts successor representation.

Policy and Q-value uncertainty

In policy and Q-value uncertainty, the agent holds uncertainty over Q-values/actions and samples the appropriate action. Some of the simplest approaches rely on optimisation to determine the distribution parameters. For example, in Stulp (2012)

, the cross-entropy method (CEM) was used to control the variance of a Gaussian distribution from which actions were drawn. Alternatively, policies can be sampled 

Akiyama et al. (2010). In this study, a set of sampling policies sampled from a base policy were used. At the end of the episode, the best policy was chosen as an update to the base policy.

The most prevalent approach of this type is to use the Bayesian framework. In Strens (2000), the hypothesis is generated once and then followed for a certain number of steps, which saves computational time. This idea was further developed in Guez et al. (2012)

, where Bayesian sampling was combined with a tree-based state representation for further efficiency gains. To enable Bayesian uncertainty approaches to deep learning,

O’Donoghue et al. (2018) derived Bayesian uncertainty such that it can be computed using the Bellman principle and the output of the neural network.

To minimize the uncertainty about policy and/or Q-values, information-theoretic approaches can be used. Agents choose actions that will result in maximal information gain, thus reducing uncertainty about the environment. An example of this approach, called information-directed sampling (IDS), is discussed in Nikolov et al. (2019). In IDS, the information gain function is expressed as a ratio between regret and how informative the action is.

Network ensembles

In the network ensemble method, the agent uses several models (initialised with different parameters) to approximate the distribution. Sampling one model from the ensemble to follow was discussed in Osband et al. (2016a). In this study, a DQN with multiple heads, each estimating Q-value, was proposed. At each episode, one head was chosen randomly for use.

It is difficult to determine the model convergence by sampling one model at a time. Therefore, multiple models to approximate the distribution over states were devised in Pearce et al. (2018). In this approach, Q-values estimated by different models were computed and fitted into a Gaussian distribution. A similar approach was developed in Shyam et al. (2019), using the information gain among the environmental models to decide where to go. Another ensemble model was presented in Henaff (2019). Exploration is achieved by finding a policy which results in the highest disagreement among the environmental models.


In parameter sampling, the policy is parameterized (i.e. represented by the neural network), and the probability over parameters is devised. The agent samples the parameters and continues the update-exploitation cycle. In contrast, in policy and Q-value sampling methods, the probability distribution is not based on policy parameters but on actions and Q-values. The advantage of doing this over parameter sampling is faster updates because the policy can be adjusted dynamically. The disadvantage is that estimating the exact probability is intractable, and thus, simplifications need to be made. Another method is to use network ensembles to approximate the distribution over the action/states. This agent can either sample from the distribution or choose one model to follow. While more computationally intensive, this approach can also be updated instantaneously.

3.4.3 Summary

Tabular summary of optimistic and uncertainty approaches is shown in Table 4 and have been extensively compared in Osband and Van Roy (2017). The article concludes that the biggest issue for optimistic exploration is that the confidence sets are built independent of each other. Thus, an agent can have multiple states with high confidence. This results in unnecessary exploration as the agent visits states which do not lead to the reward. Remedying this issue would be computationally intractable. In uncertainty methods, the confidence bounds are built depending on each other; thus, it does not have this problem.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
D’Eramo et al. (2019) bDQN, SARSA Optimistic Mujoco: acrobot -100 (Thomson -120) Mujoco joints angles Q MF C/ C
Osband et al. (2016b) LSVI Optimistic Tetris: 5000 (LSVI 4000) Hand tuned 22 features Ac MF D/ D
Jung and Stone (2010) Optimistic Mujoco: Inverted Pendulum 0 (SARSA -10) State Id Ac MB D/ C
Xie et al. (2016) MPC Optimistic Robotics hand simulation: complete each of 10 poses joints angles Ac MB C/ C
Osband et al. (2019) LSVI Optimistic Cartpole Swing up: 600 (DQN 0) State Id Ac MF D/ D
Nikolov et al. (2019) bDQN and C51 Uncertainty 55 atari games: 1058% of reference human performance Atari images Q MB D/ D
Colas et al. (2018) a set of goal policies DDPG Uncertainty Mujoco: Half Cheetah 6000 (DDP 5445) Mujoco joints angles P MF C/ C
Osband et al. (2016a) DQN Uncertainty Atari: James Bond 1000 (DQN 600) Atari imgaes Q MB D/ D
Tang and Agrawal (2018) DDPG Uncertainty Mujoco: sparse mountaincar 0.2 (NoisyNet 0) Mujoco joints angles Ac MF D/ C
Strens (2000) Dynamic Programming Uncertainty Maze: 1864 (QL SEMI-UNIFORM 1147) Enumarated state id Ac MB D/ D
Akiyama et al. (2010) initial policy guess LSPI Uncertainty Ball bating 2-DoF simulation: 67 (Passive learning:61) Robot angles Ac MB D/ C
Henaff (2019) DQN Uncertainty Maze: -4 (UE2 -14) Enumarated state id Q MB D/ D
Guez et al. (2012) guess of a prior Policy learning Uncertainty Dearden Maze: 965.2 (SBOSS 671.3) Enumarated state id Ac MB D/ D
Pearce et al. (2018) guess of a prior DQN Uncertainty Cart pole: 200 Enumarated state id Q MB C/ C
O’Donoghue et al. (2018) prior distribution DQN Uncertainty Atari: Montezuma Revenge 3000 (DQN 0) Atari images Q MB D/ D
Shyam et al. (2019) SAC Uncertainty Chain: 100% explored (bootstrapped-DQN 30%) Enumarated state id/ Mujoco joints angles Ac MB C/ C
Stulp (2012) Uncertainty Ball batting: learned after 20 steps Robot joints angles Ac MB C/ C
Janz et al. (2019) DQN Uncertainty 49 Atari games: 77.55% superhuman (Bootstrapped DQN 67.35%) Atari images Q MB D/ D
Table 4: Comparison of probabilistic approaches

3.5 Imitation-Based Methods

In imitation learning, the exploration is ’kick-started’ with demonstrations from different sources (usually humans). This is similar to how humans learn because we are initially guided in what to do by society and teachers. Thus, it is plausible to see imitation learning as a supplement to standard reinforcement learning. Note that demonstrations do not have to be perfect; rather, they just need to be a good starting point. Imitation learning can be categorized to imitation in experience replay and imitation with exploration strategy as illustrated in Fig. 


Figure 10: Overview of imitation-based methods. In imitation-based methods, the agent receives demonstrations from expert on how to behave. These are then used in two ways: (i) directly learning on demonstrations or (ii) using demonstrations as a start for other exploration techniques.

3.5.1 Imitations in Experience Replay Methods

One of the most common techniques is combining samples from demonstrations with samples collected by an agent in a single experience replay. This guarantees that imitations can be used throughout the learning process while using new experiences.

In Vecerik et al. (2017), the demonstrations were stored in a prioritised experience replay alongside the agent’s experience. The transitions from demonstrations have a higher probability of being selected. Deep Q learning from demonstration (DQfD) Hester et al. (2018) differs in two aspects from Vecerik et al. (2017). First, the agent was pre-trained on demonstrations only. Second, the ratio between the samples from the agent’s run and demonstrations was controlled by a parameter. A similar work with R2D2 was reported in Gulcehr et al. (2020). Storing states in two different replays was presented in Nair et al. (2018). Every time the agent samples for learning, it samples a certain amount from each buffer.


Using one or two experience replays seems to have negligible impact on performance. However, storing in one experience replay is conceptually and implementation-wise easier. Moreover, it allows agents to stop using imitation experiences when they are not needed anymore.

3.5.2 Imitation with Exploration Strategy Methods

Instead of using experience replays, imitations and exploration strategies can be combined directly. In such an approach, imitations are used as a ’kick-start’ for exploration.

A single demonstration was used as a starting point for exploration in Salimans and Chen (2018). The agent randomly explores from a state alongside a single demonstration run. The agent trained from a mediocre demonstration can score highly in Montezuma’s Revenge. The auxiliary reward approach was proposed in Aytar et al. (2018). The architecture can combine several YouTube videos into a single embedding space for training. The auxiliary reward is added to every frame from the demonstration video. The agent that can ask for help from the demonstrator was proposed in Subramanian et al. (2016). If the agent detects an unknown environment, the human demonstrator is asked to show the agent how to navigate.


Using imitations as a starting point for exploration has shown impressive performance in difficult exploratory games. In particular, Aytar et al. (2018) and Salimans and Chen (2018) scored highly in Montezuma’s Revenge. This is the effect of overcoming the initial burden of exploration through demonstrations. Approach from Aytar et al. (2018) can score highly in Montezuma’s revenge with just a single demonstration, making it very sample efficient. Meanwhile, the approach from Aytar et al. (2018) can combine data from multiple sources, making it more suitable for problems with many demonstrations.

3.5.3 Summary

A comparison of the imitation methods is presented in Table 5. Imitations in experience replay allow the agent to seamlessly and continuously learn from demonstration experiences. However, imitations with exploration strategies have the potential to find good novel strategies around existing ones. Imitations with exploration strategies have shown a great capability to overcome initial exploration difficulty. Imitations with exploration strategies achieve better performance than using imitations in experience replay only.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
Hester et al. (2018) imitation trained policy DQN Imitations in Experience Replay Atari: Pitfall 50.8 (Baseline 0) Atari Images Q MF D/ D
Vecerik et al. (2017) demonstrations DDPG Imitations in Experience Replay Peg insertion: 5 ( DDPG -15) Robot joints angles Ac MF C/ C
Nair et al. (2018) demonstrations DDPG Imitations in Experience Replay Brick stacking: Pick and Place 0.9 (Behaviour cloning 0.8) Robot joints angles Ac MF C/ C
Gulcehr et al. (2020) demonstrations R2D2 Imitations in Experience Replay Hard-eight: Drawbridge 12.5 (R2D2:0) Vizdoom Images Ac MF D/ D
Aytar et al. (2018) youtube embbeding IMPALA Imitation with Exploration Strategy Atari: Montezuma’s Revenge 80k (DqfD 4k) Atari Images Ac MF D/ D
Salimans and Chen (2018) single demonstration PPO Imitation with Exploration Strategy Atari: Montezuma Revenge with distraction 74500 (Playing by youtube 41098) Atari images Ac MF D/ D
Table 5: Comparison of imitation-based approaches

3.6 Safe Exploration

Figure 11: Illustration of safe exploration methods. In safe exploration methods, attempts are made to prevent unsafe behaviours during exploration. Here, three techniques are highlighted: (i) human designer knowledge–the agent’s behaviours are limited by human-designed boundaries; (ii) prediction models–the agent learns unsafe behaviours and how to avoid them; and (iii) auxiliary rewards–agents are punished in dangerous states.

In safe exploration, the problem of preventing agents from unsafe behaviours is considered. This is an important aspect of exploration research, as the agent’s safety needs to be ensured. Safe exploration can be split into three categories: (i) human designer knowledge, (ii) prediction model, and (iii) auxiliary reward as illustrated in Fig. 11. For more details about safe exploration in reinforcement learning, the reader is invited to read Garcıa and Fernandez (2015).

3.6.1 Human Designer Knowledge Methods

Human-designated safety boundaries are used in human designer knowledge methods. Knowledge from the human designer can be split into baseline behaviours, direct human intervention and prediction models.

Baseline behaviours impose an impassable safety baseline. Garcia et al. Garcia and Fernandez (2012) proposed the addition of a risk function (which determines unsafe states) and baseline behaviour (which decides what to do in unsafe states). In Dalal et al. (2018), the agent was constrained by an additional pre-trained module to prevent unsafe actions as shown in Fig. 12, while in Garcelon et al. (2020), agents are expected to perform no worse than the a priori known baseline. Classifying which object is dangerous and how to avoid them before the training of an agent was proposed in Hunt et al. (2020). The agent learns how to avoid certain objects rather than states; thus, this approach can be generalised to new scenarios.

Figure 12: Overview of safe exploration in continuous action spaces Dalal et al. (2018). The additional model is modifying the actions of the original policy.

The human intervention approach was discussed in Saunders et al. (2018). During the initial phases of exploration, humans in the loop stop disasters. Then, a supervised trained network of data collected from humans is used as a replacement for humans.

In the prediction model, the human designed safety model determines if the agent’s next action leads to an unsafe position and avoids it. In Turchetta et al. (2016)

, a rover traversing a terrain of different heights was considered. The Gaussian process model provides estimates of the height at a given location. If the height is lower than the safe behaviour limit, the robot can explore safely. A heuristic safety model using a priori knowledge was proposed in

Gao et al. (2019). To this end, they proposed an algorithm called action pruning, which uses the heuristics to prevent agent from committing to unsafe actions.


In human designer knowledge methods, the barriers to unsafe behaviours are placed by a human designer. Baseline behaviours and human intervention methods guarantee certain performance in certain situations but they will only work in pre-defined situations. Prediction model methods require a model of the environment. This can be either in the form of a mathematical model Turchetta et al. (2016) or heuristic rules Gao et al. (2019). Prediction models have a higher chance of working on previously unseen environments and have a higher chance of adaptability than baseline behaviours and human intervention methods.

3.6.2 Auxiliary Reward Methods

In auxiliary rewards, the agent is punished for putting itself into a dangerous situation. This approach requires the least human intervention, but it generates the weakest safety behaviours.

One of the methods is to find states in which an episode terminates and discourages an agent from approaching using an intrinsic fear Lipton et al. (2016). The approach counts back a certain number of states from death and applies the distance-to-death penalty. Additionally, they made a simple environment in which a highest positive reward was next to the negative reward. The DQN eventually jumps to the negative rewards. The authors state ”We might critically ask, in what real-world scenario, we could depend upon a system [DQN] that cannot solve [these kinds of problems]”. A similar approach, but with more stochasticity, was later proposed in Fatemi et al. (2019).

Allowing the agent to learn undesirable states from previous experiences autonomously was discussed in Karimpanal et al. (2020). The states and their advantage values were stored in a common buffer. Then, frequently visited states with the lowest advantage have additional negative rewards associated with them.


Auxiliary rewards can be an effective method of discouraging agents from unsafe behaviours. For example, in Lipton et al. (2016), half of the agent’s death was prevented. Moreover, some approaches, such as  Karimpanal et al. (2020), have shown the ability to fully automatically determine undesirable states and avoid them. This, however, assumes that when the agent perishes, it has a low score; this may not always be the case.

3.6.3 Summary

An overview of the safety approaches is shown in Table  6. Safety is a vital aspect of reinforcement learning for practical applications in many domains. There are three general approaches: human designer knowledge, prediction models, and auxiliary rewards. Human designer knowledge guarantees safe behaviour in certain states. However, the agent struggles to learn new safe behaviours. Auxiliary reward approaches can adjust to new scenarios, but they require time to train and design of the negative reward.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
Garcelon et al. (2020) Baseline safe policy Policy-based UCRL2 Human Designer Knowledge stochastic inventory control: never breaching the safety baseline amount of products in inventory P MB D/ D
Garcia and Fernandez (2012) baseline behaviour Human Designer Knowledge car parking problem 6.5 angles and positions of respective controllable vehicles Ac MB D/ D
Hunt et al. (2020) pretrained safety network PPO Human Designer Knowledge Point mass environment: 0 unsafe actions (PPO 3000) bird’s eye view of the problem Ac MB D/ D
Saunders et al. (2018) human intervention data DQN Human Designer Knowledge Atari: Space Invaders 0 catastrophes (DQN 800000) Atari images Q MB D/ D
Dalal et al. (2018) pretrained safety model DDPG Human Designer Knowledge spaceship: Arena 1000 (DDPG 300) x-y position Ac MB C/ C
Gao et al. (2019) environmental knowledge PPO Human Designer Knowledge Pommerman: 0.8 (Baseline 0) Agent, enemy agents and bombs positions Ac MB D/ D
Turchetta et al. (2016) Bayesian optimisation Human Designer Knowledge Simulated rover: 80% exploration (Random 0.98%) x-y position Ac MB C/ C
Fatemi et al. (2019) DQN Auxiliary Reward Bridge: optimal after 14k episodes (ten times faster then competitor) card types/ atari images Q MB D/ D
Lipton et al. (2016) DQN Auxiliary Reward Atari: Asteroids total death 40,000 (DQN 80,000) Atari images Ac MB C/ C
Karimpanal et al. (2020) Q-learning and DDPG Auxiliary Reward Navigation environment: -3 (PQRL -3.5) enumarated state id Q MF C/ C
Table 6: Comparison of Safe approaches

3.7 Random-Based Methods

In random-based approaches, improvements to simple random exploration are discussed. Random exploration tends to be inefficient as it often revisits the same states. To solve this problem, the following approaches are considered: (i) reduced states/actions for exploration methods, (ii) exploration parameters methods, and (iii) network parameter noise methods, as illustrated in Fig. 13.

Figure 13: Overview of random based methods. In random-based methods, simple random exploration is modified for improved efficiency. In modifying the states for exploration, the number of actions to be taken randomly is reduced. In modifying the exploration parameters, the exploration is automatically decided. In the network parameter noise, the noise is imposed on the policy parameters.

3.7.1 Exploration Parameters Methods

In this section, exploration is parameterized (for example, in -greedy). Then, the parameters are modified according to the agent’s learning progress.

One technique to adapt the exploration rate is by simply considering a reward and adjusting the random exploration parameter accordingly, as described in Patrascu and Stacey (1999). Using a pure reward can lead to problems with sparse rewards. To solve this problem, in Tokic (2010), was made to depend on the error of the value-function estimates instead of the reward. It is also possible to determine the amount of random exploration using the environmental model entropy, as discussed in Usama and Chang (2019). The learning rate Shani et al. (2019) can also depend on exploration in which a parameter that is functionally equivalent to the learning rate is introduced. If the agent is exploring a lot, the value of slows down the learning to account for uncertainty. Khamassi et al. Khamassi et al. (2017) used long-term and short-term reward averages to control exploration and exploitation. When the short-term average is below the long-term average, exploration should be increased.

Chang et al. Chang (2004) used multiple agents (ants) to adjust the exploration parameters. At each step, the ants chose their actions randomly, but were skewed by pheromone values left by other ants.

Another approach of this type could be reducing states for exploration based on some predefined metric. An approach using the adaptive resonance theorem (ART) Grossberg (1987) was presented in Teng and Tan (2012) and was later extended in Wang et al. (2018). In ART, knowledge about actions can be split into: (i) positive chunk which leads to positive rewards, (ii) negative chunk which leads to negative results, and (iii) empty chunk which is not yet taken. In this approach, the action is randomly chosen from positive and no chunks; thus, the agent is exploring either new things or ones with the positive reward. Wang et al. Wang et al. (2018) extended this to include the probability of selecting the remaining actions based on how well they are known.


Different parameters can be changed based on learning progress. Initially, approaches used learning progress, reward, or value of states to determine the rate of exploration. The challenge with these approaches is determining the parameters controlling the exploration. However, it is also possible to adjust the learning rate based on exploration Shani et al. (2019). The advantage is that the agent avoids learning uncertain information, but it slows down the training. Finally, reducing states for exploration can make exploration more sample efficient, but it struggles to account for unseen states that occurs after the eliminated states.

3.7.2 Random Noise

In random noise approaches, random noise is used for exploration. The random noise can be either imposed on networks parameters or be produced based on states met during exploration.

The easiest method of including the noise is to include a fixed amount of noise Rückstieß et al. (2010). This paper reviews the usage of small perturbations in the parameter space. In Shibata and Sakashita (2015)

, chaotic networks were used to induce the noise in the network. It is also possible to adjust the noise strength using backpropagation, as described in

Fortunato et al. (2018) where the noise is created by a constant noise source multiplied by a gradient-adaptable parameter. Another way of the using the noise is by comparing the decision made by the noisy and noiseless policy Plappert et al. (2018). Exploration is imposed, if decisions are sufficiently different.

In Rückstieß et al. (2008), the problem of assigning rewards when the same state is present multiple times is discussed. In such a problem, the agent will be likely to take different actions for the same state, making credit assignment difficult. To solve this problem, a random action generation function dependent on the input state was developed; if the state is the same, the random action is the same.


Network parameter noise was first developed for evolutionary approaches, such as Rückstieß et al. (2010). Recently, the noise of parameters has been used in policy-based methods. In particular, good performance was achieved in Fortunato et al. (2018) which was able to achieve 50% improvement averaged over 52 Atari games.

3.7.3 Summary

A comparison of the random-based approach is presented in Table  7. The key advantage of reduced states for exploration methods is that the exploration can be very effective, but it needs to hold the memory of where it has been. Exploration parameter methods solves a trade-off between exploration and exploitation well; however, the agent can still get stuck in exploring unnecessary states. The random noise approaches are very simple to implement and show promising results, but they rely on careful tuning of parameters by designers.

[para, flushleft] Legend: A - action space, Ac - action, R - reference, MB - model based, MF - model free, D - discrete, C - continuous, Q - Q values, V - values, P - policy, O - output, S - state space, U - underlying algorithm and Top score on a key benchmark explanation - [benchmark]:[scenario] [score] ([baseline approach] [score]).

R Prior Knowledge U Method Top score on a key benchmark Input Types O MB/ MF A/ S
Wang et al. (2018) ART Exploration parameters minefield navigation (successful rate): 91% (Baseline 91%) Vehicles positions Q MB C/ C
Shani et al. (2019) DDQN and DDPG Exploration parameters Atari: Frostbite 2686 (DDPG 1720); Mujoco: HalfCheetah 4579 (DDPG 2255) Atari images, Mujoco joints angles Ac/ Q MF C/ C
Patrascu and Stacey (1999) Fuzzy ART MAP architecture Exploration parameters Changing world environment (grid with two alternating paths to reward) 0.9 Enumerated state id Ac MB D/ D
Usama and Chang (2019) DQN Exploration parameters VizDoom: Defend the centre 12.2 (-greedy 11.8) Images Q MB C/ C
Tokic (2010) V-learning Exploration parameters Multi-arm bandit: 1.42 (Softmax 1.38) Enumarated state id V MF D/ D
Khamassi et al. (2017) Q-learning Exploration parameters Nao simulator: Engagement 10 (Kalman-QL 5) Robot joints angles Q MF D/ D
Shibata and Sakashita (2015) Actor-critic Random noise area with randomly positioned obstacle: 0.6 out of 1 Enumarated state id Ac MF D/ D
Plappert et al. (2018) measure of policies distance DQN, DDPG and TRPO Random noise Atari: BeamRdier 9000 (-greedy 5000); Mujoco: Half cheetah 5000 (-greedy 1500) Atari images/Mujoco joints angles Ac/ Q MF C/ C
Shibata and Sakashita (2015) Random noise Multi Arm Bandit Problem: 1 (Optimal) Stateless Ac MB C/ C
Fortunato et al. (2018) Random noise Atari: 57 games 633 points (Dueling DQN 524) Atari images Ac MF C/ C
Table 7: Comparison of Random based approaches

4 Future Challenges

In this section, we discuss the following future challenges on exploration in reinforcement learning: evaluation, scalability, exploration-exploitation dilemma, intrinsic reward, noisy TV problems, safety, and transferability.


Currently, evaluating and comparing different exploration algorithms is challenging. This issue arises from three reasons: lack of a common benchmark, lack of a common evaluation strategy, and lack of good metrics to measure exploration.

Currently, four major benchmarks used by the community are VizDoom Kempka et al. (2016), Minecraft Johnson et al. (2016), Atari Games Badia et al. (2020a) and Mujoco Todorov et al. (2012). Each benchmark is characterised by different complexities in terms of state space, reward sparseness, and action space. Moreover, each benchmark offers several scenarios with various degrees of complexity. Such a wealth of benchmarks is desirable for exposure of agents to various complexities; however, the difference in complexity between different benchmarks is well-understood. This leads to difficulty in comparing algorithms using different benchmarks. There have been attempts to solve the evaluation issues using a common benchmark, for example, in Osband et al. (2020). However, this study is not commonly adopted yet.

Regarding the evaluation strategy, most algorithms use a reward after a certain number of steps. Note that in the context of this paragraph, steps could also mean episodes, iterations and epochs. This makes the reporting of results inconsistent in two aspects: (i) the number of steps in which the algorithm was tested and (ii) how the reward is reported. The first makes comparisons between algorithms difficult because performance can vary widely depending on when the comparison is made. The second concern is how rewards are reported. Most authors choose to report the average reward the agent has scored; however, sometimes comparison with the average human performance is used (without clear indication of what average human performance means exactly). Moreover, sometimes the distinction between the average reward or maximum reward is not clearly made.

Finally, it is arguable if a reward is an appropriate measure for evaluation Stadie et al. (2015). One of the key issues is that it fails to account for the speed of learning, which should be higher if exploration is more efficient Stadie et al. (2015). Attempts have been made to address this issue in Stadie et al. (2015), but as of the time of writing this review paper, this new metric is not widely adopted. Another issue with rewards is that it does not provide any information regarding the goodness of exploratory behaviour. This is even more difficult in continuous action space problems where computing novelty is considerably more challenging.


Exploration in reinforcement learning does not scale well to real-world problems. This is caused by two limitations: training time and inefficient state representation. Currently, even the fastest training requires millions of samples in complex environments. Note that even the most complex environments currently used in reinforcement learning are still relatively simple compared to the real world. In the real world, collecting millions of samples for training is unrealistic owing to wear and tear of physical devices. To cope with the real world, either a sim-to-real gap needs to be reduced or exploration needs to become more sample efficient.

Another limitation is efficient state representation so that memorising states and actions is possible in large domains. For example, Go-Explore Ecoffet et al. (2019) does not scale up well if the environment is large. This problem was discussed in Jaegle et al. (2019)

by comparing how the brain stores memories and computes novelty. It states that the human brain is much faster at determining scene novelty and has a much larger capacity. To achieve this, the brain uses an agreement between multiple neurons. The more neurons indicate that the given image is novel, the higher the novelty is. Thus, the brain does not need to remember full states; instead, it trains itself to recognise the novelty. This is currently unmatched in reinforcement learning in terms of the representation efficiency.

Exploration-exploitation dilemma

The exploration–exploitation dilemma is an ongoing research topic not only in reinforcement learning but also in a general problem. Most current exploration approaches have a built-in solution to exploration-exploitation, but not all methods do. This is particularly true in goal-based methods that rely on hand-designed solutions. Moreover, even in approaches that solve it automatically, the balance is still mostly decided by the designer-provided threshold. One potential way of solving this problem is to train a set of skills (policies) during exploration and combine skills in greater goal-oriented policies OpenAI (2021). This is similar to how humans solve problems by learning smaller skills and then using them later to exploit them as a larger policy.

Intrinsic reward

Reward novel states and diverse behaviour approaches can be improved in two ways: (i) the agent should be more free to reward itself and (ii) better balance between long-term and short-term novelty should be achieved.

In most intrinsic reward approaches, the exact reward formulation is performed by an expert. Designing a reward that guarantees good exploration is a challenging and time-consuming task. Moreover, there might be ways of rewarding agents which were not conceived by designers. Thus, it could be beneficial if an agent is not only trained in the environment but is also trained on how to reward itself. This would be closer to human behaviour where the self-rewarding mechanism was developed through evolution.

Balancing the long-term novelty and short-term novelty is another challenge. In this problem, the agent tries to balance two factors: revisiting states often to find something new or abandoning states quickly to try to find something new. This is currently a hand-designed parameter, but its tuning is time-consuming. Recently, there has been a fix proposed in Badia et al. (2020a) where meta-learning decides the appropriate balance, but at the cost of computational complexity for training.

Noisy-TV problem

The noisy-TV (or couch potato problem) remains largely unsolved. While using memory can be used to solve it, they are limited by memory requirements. Thus, it can be envisioned that if the noisy sequence is very long and the state space is complex, memory approaches will also struggle to solve it. One method that has shown some promise is the use of clustering Kovač et al. (2020) to cluster noisy states and avoid that cluster. However, this requires the design of correct clusters.

Optimal exploration

One area which is rarely considered in the current exploration in reinforcement learning research is how to explore optimally. For optimal exploration, the agent does not revisit states unnecessarily and explores the most promising areas first. This problem and the proposed solution are described in detail in Zhang et al. (2019). The solution uses a demand matrix, which is an by matrix of states and actions, indicating state-action exploration counts. It then defines the exploration cost for exploration policy, which is the number of steps each state-action pair needs to be explored. Note that the demand matrix does not need to be known a priori and can be updated online. This aspect needs further developments.

Safe exploration

Safe exploration is of paramount importance for real-world applications. However, so far, there have been very few approaches to cope with this issue. Most of them rely heavily on hand-designed rules to prevent catastrophes. Moreover, it has been shown in Lipton et al. (2016) that current reinforcement learning is struggling to prevent catastrophes even with carefully engineered rewards. Thus, there exists a need for the agent to recognise unsafe situations and act accordingly. Moreover, what constitutes an unsafe situation is not well defined beyond hand-designed rules. This leads to problems with regard to the scalability and transferability of safe exploration in reinforcement learning. A more rigorous definition of an unsafe situation would be beneficial to address this problem.


Most exploratory approaches are currently limited to the domain on which they were trained. When faced with new environments (e.g., increased state space and different reward functions), exploration strategies do not seem to perform well Dhiman et al. (2018); Raileanu and Rocktäschel (2020). Coping with this issue would be helpful in two scenarios. First, it would be beneficial to be able to teach the agent behaviours in smaller scenarios and then allow it to perform well on larger scenarios to alleviate computational issues. Second, in some domains, defining state spaces suitable for exploration is challenging and may vary in size significantly between tasks (e.g., search for a victim of an accident).

5 Conclusions

This paper presents a review of the exploration in reinforcement learning. The following methods were discussed: reward novel states, reward diverse behaviours, goal-based methods, uncertainty, imitation-based methods, safe exploration, and random methods.

In reward novel state methods, the agent is given a reward for discovering a novel or surprising state. This reward can be computed using prediction error, count, or memory. In prediction error methods, the reward is given based on the accuracy of the agent’s internal environmental model. In count-based methods, the reward is given based on how often a given state is visited. In memory-based methods, the reward is computed based on how different a state is compared to other states in a buffer.

In reward diverse behaviour methods, the agent is rewarded for discovering as many diverse behaviours as possible. Note here that we use word behaviour loosely as a sequence of actions or a policy. Reward diverse behaviour methods can be divided into: evolutionary strategies and policy learning. In evolution strategies, diversity among the population of agents is encouraged. In policy learning, the diversity of policy parameters is encouraged.

In goal-based methods, the agent is given the goal of either exploring from or exploring while trying to reach the goal. In the first method, the agent chooses the goal to get to and then explore from it. This results in a very efficient exploration as the agent visits predominantly unknown areas. In the second method, called the exploratory goal, the agent is exploring while travelling toward a goal. The key idea of this method is to provide goals which are suitable for exploration.

In probabilistic methods, the agent holds an uncertainty model about the environment and uses it to make its next move. The uncertainty method has two subcategories: optimistic and uncertainty methods. In optimistic methods, the agent follows the optimism under uncertainty principle. This means that the agent will sample the most optimistic understanding of the reward. In uncertainty methods, the agent will sample from internal uncertainty to move toward the least known areas.

Imitation-based methods rely on using demonstrations to help exploration. In general, there are two methods: combining demonstrations with experience replay and combining them with an exploration strategy. In the first method, samples from demonstrations and collected by the agent are combined into one buffer for the agent to learn from. In the second method, the demonstrations are used as a starting point for other exploration techniques such as the reward novel state.

Safe exploration methods were devised to ensure the safe behaviour of the agents during exploration. In safe exploration, the most prevalent method is to use human designer knowledge to develop boundaries for the agent. Furthermore, it is possible to train a model that predicts and stops agents from making a disastrous move. Finally, the agent can be discouraged from visiting dangerous states with a negative reward.

Random exploration methods improve standard random exploration. These improvements include modifying the states for exploration, modifying exploration parameters, and putting the noise on network parameters. In modifying states for exploration, certain states and actions are removed from the random choice if they have been sufficiently explored. In modifying exploration parameter methods, the parameters affecting when to randomly explore are automatically chosen based on the agent’s learning progress. Lastly, in the network parameter noise approach, random noise is applied to the parameters to induce exploration before the weight convergence.

Finally, the best approaches in terms of ease of implementation, computational cost and overall performance are highlighted. The easiest methods to implement are reward novel states, reward diverse behaviours and random-based approaches. Basic implementation of those approaches can be used with almost any other existing reinforcement learning algorithms; they might require a few additions and tuning to work. In terms of computational efficiency, random-based, reward novel states and reward divers behaviours generally require the least resources. Particularly, random-based approaches are computationally efficient as the additional components are lightweight. Currently, best-performing methods are goal-based and reward novel states methods where goal-based methods have achieved high scores in difficult exploratory problems such as Montezuma’s revenge. However, goal-based methods tend to be the most complex in terms of implementation. Overall, reward novel states methods seem like a good compromise between ease of implementation and performance.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.


  • D. Abel, A. Agarwal, F. Diaz, A. Krishnamurthy, and R. E. Schapire (2016)

    Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains

    External Links: 1603.04119, Link Cited by: §3.3.2, Table 3.
  • J. Achiam and S. Sastry (2017) Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning. pp. 1–13. External Links: 1703.01732, Link Cited by: §3.1.1, Table 1.
  • T. Akiyama, H. Hachiya, and M. Sugiyama (2010)

    Efficient exploration through active learning for value function approximation in reinforcement learning

    Neural Networks 23 (5), pp. 639–648. External Links: Document, ISSN 08936080, Link Cited by: §3.4.2, Table 4.
  • K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath (2017) Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine 34 (6), pp. 26–38. External Links: Document, arXiv:1708.05866v2, ISSN 10535888 Cited by: §2.1.3.
  • A. Aubret, L. Matignon, and S. Hassas (2019) A survey on Intrinsic Motivation in Reinforcement Learning. arXiv (Im). External Links: 1908.06976 Cited by: §1, §3.1, §3.
  • Y. Aytar, T. Pfaff, D. Budden, T. Le Paine, Z. Wang, and N. De Freitas (2018) Playing Hard Exploration Games by Watching Youtube. Conference on Neural Information Processing Systems, NeurIPS 2018, pp. 2930–2941. External Links: 1805.11592, ISSN 10495258 Cited by: §2.4.1, §3.5.2, §3.5.2, Table 5.
  • A. P. Badia, B. Piot, S. Kapturowski, P. Sprechmann, A. Vitvitskyi, D. Guo, and C. Blundell (2020a) Agent57: Outperforming the Atari Human Benchmark. External Links: 2003.13350, Link Cited by: §1, §3.1.3, Table 1, §4, §4.
  • A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, and C. Blundell (2020b) Never Give Up: Learning Directed Exploration Strategies. 8th International Conference on Learning Representations, ICLR 2020. External Links: 2002.06038, Link Cited by: §3.1.3, §3.1.4, Table 1.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The Arcade Learning Environment: An Evaluation Platform For General Agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    External Links: Document, 1207.4708, ISSN 10769757 Cited by: §2.4.1.
  • M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying Count-Based Exploration and Intrinsic Motivation. Conference on Neural Information Processing Systems, NeurIPS 2016. External Links: Document, 1606.01868, ISBN 978-84-85395-69-9, ISSN 10495258, Link Cited by: §1, §3.1.2, §3.1.4.
  • N. Bougie and R. Ichise (2020a) Fast And Slow Curiosity for High-Level Exploration in Reinforcement Learning. Applied Intelligence. External Links: Document, ISSN 15737497 Cited by: §3.1.1, Table 1.
  • N. Bougie and R. Ichise (2020b) Towards High-Level Intrinsic Exploration in Reinforcement Learning. International Joint Conference on Artificial Intelligence (IJCAI-20). External Links: 1810.12894 Cited by: §3.1.1.
  • Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros (2019) Large-Scale Study of Curiosity-Driven Learning. 7th International Conference on Learning Representations, ICLR 2019. External Links: 1808.04355, Link Cited by: §3.1.1.
  • Y. Burda, H. Edwards, A. Storkey, and O. Klimov (2018) Exploration by Random Network Distillation. 7th International Conference on Learning Representations, ICLR 2019. External Links: Document, 1810.12894, Link Cited by: §1, §2.3.1, §3.1.1, §3.1.3, §3.1.4, Table 1.
  • H. S. Chang (2004) An Ant System Based Exploration-Exploitation for Reinforcement Learning. Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics 4, pp. 3805–3810. External Links: Document, ISBN 0780385667, ISSN 1062922X Cited by: §3.7.1.
  • M. S. Charikar (2002) Similarity Estimation Techniques From Rounding Algorithms.

    Conference Proceedings of the Annual ACM Symposium on Theory of Computing

    , pp. 380–388.
    External Links: Document, ISBN 1581134959, ISSN 07349025 Cited by: §3.1.2.
  • J. Chien and P. Hsu (2020) Stochastic Curiosity Maximizing Exploration. 2020 International Joint Conference on Neural Networks (IJCNN). External Links: Document, ISBN 9781728169262 Cited by: §3.1.1, Table 1.
  • J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee (2019) Contingency-aware Exploration in Reinforcement Learning. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–19. External Links: 1811.01483 Cited by: §3.1.2.
  • J. Clark and D. Amodei (2016) Faulty reward functions in the wild. Note: https://openai.com/blog/faulty-reward-functions/ Cited by: §1.
  • A. Cohen, L. Yu, X. Qiao, and X. Tong (2019) Maximum Entropy Diverse Exploration: Disentangling Maximum Entropy Reinforcement Learning. External Links: 1911.00828, Link Cited by: §3.2.2, Table 2.
  • C. Colas, P. Founder, O. Sigaud, M. Chetouani, and P. Y. Oudeyer (2019) CURIOUS: Intrinsically Motivated Modular Multi-goal Reinforcement Learning.

    36th International Conference on Machine Learning, ICML 2019

    , pp. 2372–2387.
    External Links: 1810.06284, ISBN 9781510886988 Cited by: §3.3.2, §3.3.3, Table 3.
  • C. Colas, O. Sigaud, and P. Y. Oudeyer (2018) GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms. Proceedings of the 35th International Conference on Machine Learning, ICML 2018. Cited by: §3.4.2, Table 4.
  • E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. O. Stanley, and J. Clune (2018) Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. Conference on Neural Information Processing Systems, NeurIPS 2018, pp. 5027–5038. External Links: 1712.06560, ISSN 10495258 Cited by: §3.2.1, §3.2.1, Table 2.
  • C. D’Eramo, A. Cini, and M. Restelli (2019) Exploiting action-value uncertainty to drive exploration in reinforcement learning. Proceedings of the International Joint Conference on Neural Networks, IJCNN 2019. External Links: Document, ISBN 9781728119854 Cited by: §3.4.1, Table 4.
  • G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa (2018) Safe Exploration in Continuous Action Spaces. External Links: 1801.08757, Link Cited by: Figure 12, §3.6.1, Table 6.
  • I. M. De Abril and R. Kanai (2018) Curiosity-Driven Reinforcement Learning with Homeostatic Regulation. Proceedings of the International Joint Conference on Neural Networks 2018-July (1). External Links: Document, 1801.07440, ISBN 9781509060146 Cited by: §3.1.1.
  • V. Dean, S. Tulsiani, and A. Gupta (2020) See, Hear, Explore: Curiosity via Audio-Visual Association. External Links: 2007.03669, Link Cited by: §3.1.1.
  • V. Dhiman, S. Banerjee, B. Griffin, J. M. Siskind, and J. J. Corso (2018) A Critical Investigation of Deep Reinforcement Learning for Navigation. External Links: 1802.02274, Link Cited by: §3.1.1, §4.
  • A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune (2019) Go-Explore: a New Approach for Hard-Exploration Problems. pp. 1–37. External Links: 1901.10995, Link Cited by: §3.3.1, Table 3, §3, §4.
  • A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune (2020) First Return Then Explore. pp. 1–46. External Links: 2004.12919, Link Cited by: §3.3.1, §3.3.3, Table 3.
  • A. D. Edwards, L. Downs, and J. C. Davidson (2018) Forward-Backward Reinforcement Learning. External Links: 1803.10227, Link Cited by: §3.3.1, Table 3.
  • [32] (2020) Exploration. Note: https://dictionary.cambridge.org/dictionary/english/explorationAccessed: 2020-04-09 Cited by: §2.2.
  • B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2019) Diversity Is All You Need. 7th International Conference on Learning Representations, ICLR 2019. External Links: Link Cited by: §1, §3.2.2, §3.2.2, §3.2.3, Table 2.
  • K. Fang, Y. Zhu, S. Savarese, and L. Fei-Fei (2020) Adaptive Procedural Task Generation for Hard-Exploration Problems. Under Review at ICLR 2021. External Links: 2007.00350, Link Cited by: §3.3.2, §3.3.3, Table 3.
  • M. Fatemi, S. Sharma, H. van Seijen, and S. E. Kahou (2019) Dead-ends and Secure Exploration in Reinforcement Learning. 36th International Conference on Machine Learning, ICML 2019, pp. 3315–3323. External Links: ISBN 9781510886988 Cited by: §3.6.2, Table 6.
  • C. Florensa, D. Held, M. Wulfmeier, M. Zhang, and P. Abbeel (2017) Reverse curriculum generation for reinforcement learning. (CoRL). External Links: 1707.05300, Link Cited by: §3.3.1, Table 3.
  • S. Forestier, Y. Mollard, and P. Y. Oudeyer (2017) Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning. External Links: 1708.02190, Link Cited by: §3.3.2, Table 3.
  • M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg (2018) Noisy networks for exploration. 6th International Conference on Learning Representations, ICLR 2018, pp. 1–21. External Links: 1706.10295 Cited by: §3.7.2, §3.7.2, Table 7.
  • J. Fu, J. D. Co-Reyes, and S. Levine (2017) Ex2: Exploration With Exemplar Models for Deep Reinforcement Learning. Conference on Neural Information Processing Systems, NeurIPS 2017, pp. 2578–2588. External Links: 1703.01260, ISSN 10495258 Cited by: §3.1.3, Table 1.
  • T. Gangwani, Q. Liu, and J. Peng (2019) Learning Self-Imitating Diverse Policies. 7th International Conference on Learning Representations, ICLR 2019. Cited by: §3.2.2, Table 2.
  • C. Gao, B. Kartal, P. Hernandez-Leal, and M. E. Taylor (2019) On Hard Exploration for Reinforcement Learning: a Case Study in Pommerman. Fifteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. External Links: 1907.11788, Link Cited by: §3.6.1, §3.6.1, Table 6.
  • E. Garcelon, M. Ghavamzadeh, A. Lazaric, and M. Pirotta (2020) Conservative Exploration in Reinforcement Learning. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. External Links: 2002.03218, Link Cited by: §3.6.1, Table 6.
  • J. Garcia and F. Fernandez (2012) Safe Exploration of State and Action Spaces in Reinforcement Learning. Journal of Artificial Intelligence Research 45, pp. 515–564. External Links: Document, ISSN 10769757 Cited by: §3.6.1, Table 6.
  • J. Garcıa and F. Fernandez (2015) A Comprehensive Survey on Safe Reinforcement Learning. Journal of Machine Learning Research 16. External Links: ISSN 21622388 Cited by: §3.6.
  • M. Ghafoorian, N. Taghizadeh, and H. Beigy (2013) Automatic Abstraction in Reinforcement Learning Using Ant System Algorithm. AAAI Spring Symposium - Technical Report SS-13-05, pp. 9–14. External Links: ISBN 9781577356028 Cited by: §3.3.2, Table 3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Conference on Neural Information Processing Systems, NeurIPS 2014 27, pp. . External Links: Link Cited by: §3.1.1.
  • D. Gravina, A. Liapis, and G. N. Yannakakis (2018) Quality Diversity Through Surprise.

    IEEE Transactions on Evolutionary Computation

    , pp. 1–14.
    External Links: Document, 1807.02397, ISSN 1089778X Cited by: §3.2.1, §3.2.1, Table 2.
  • S. Grossberg (1987) Competitive Learning: From Interactive Activation To Adaptive Resonance. Cognitive Science 11 (1), pp. 23–63. External Links: Document, ISSN 03640213 Cited by: §3.7.1.
  • C. Guestrin, R. Patrascu, and D. Schuurmans (2002) Algorithm-directed Exploration for Model-Based Reinforcement Learning In Factored Mdps. Machine Learning International Workshop, pp. 235–242. External Links: ISBN 1-55860-873-7, ISSN 0268-1161, Link Cited by: §3.3.2.
  • A. Guez, D. Silver, and P. Dayan (2012) Efficient Bayes-adaptive Reinforcement Learning Using Sample-based Search. Conference on Neural Information Processing Systems, NeurIPS 2012, pp. 1025–1033. External Links: 1205.3109, ISBN 9781627480031, ISSN 10495258 Cited by: §3.4.2, Table 4.
  • C. Gulcehr, T. L. Paine, B. Shahriari, M. Denil, M. Hoffman, H. Soyer, R. Tanburn, S. Kapturowski, N. Rabinowitz, D. Williams, G. Barth-Maron, Z. Wang, and N. de Freitas (2020) Making Efficient Use of Demonstrations To Solve Hard Exploration Problems. 8th International Conference on Learning Representations, ICLR 2020. Cited by: §3.5.1, Table 5.
  • Y. Guo, J. Choi, M. Moczulski, S. Bengio, M. Norouzi, and H. Lee (2019) Self-Imitation Learning via Trajectory-Conditioned Policy for Hard-Exploration Tasks. pp. 1–22. External Links: 1907.10247, Link Cited by: §3.3.1, Table 3.
  • Y. Guo, J. Choi, M. Moczulski, S. Feng, S. Bengio, M. Norouzi, and H. Lee (2020) Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards. Conference on Neural Information Processing Systems, NeurIPS 2020. Cited by: §3.3.1, §3.3.3, Table 3.
  • Z. D. Guo and E. Brunskill (2019) Directed Exploration for Reinforcement Learning. External Links: 1906.07805, Link Cited by: §3.3.1, Table 3.
  • M. Henaff (2019) Explicit explore-exploit algorithms in continuous state spaces. Conference on Neural Information Processing Systems, NeurIPS 2019. External Links: 1911.00617, ISSN 10495258 Cited by: §3.4.2, Table 4.
  • T. Hester, T. Schaul, A. Sendonaris, M. Vecerik, B. Piot, I. Osband, O. Pietquin, D. Horgan, G. Dulac-Arnold, M. Lanctot, J. Quan, J. Agapiou, J. Z. Leibo, and A. Gruslys (2018) Deep q-learning From Demonstrations. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 3223–3230. External Links: 1704.03732, ISBN 9781577358008 Cited by: §3.5.1, Table 5.
  • T. Hester and P. Stone (2013) Learning Exploration Strategies in Model-Based Reinforcement Learning. Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems (AAAI. Cited by: §3.3.2, Table 3.
  • W. Hong, M. Zhu, M. Liu, W. Zhang, M. Zhou, Y. Yu, and P. Sun (2019) Generative Adversarial Exploration for Reinforcement Learning. ACM International Conference Proceeding Series. External Links: Document, ISBN 9781450376563 Cited by: §3.1.1, Table 1.
  • Z. W. Hong, T. Y. Shann, S. Y. Su, Y. H. Chang, and C. Y. Lee (2018) Diversity-Driven Exploration Strategy for Deep Reinforcement Learning. 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings. Cited by: §3.2.2, Table 2.
  • R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel (2016) VIME: Variational Information Maximizing Exploration. Conference on Neural Information Processing Systems, NeurIPS 2016 0, pp. 1117–1125. External Links: 1605.09674, ISSN 10495258 Cited by: §3.1.1.
  • N. Hunt, N. Fulton, S. Magliacane, N. Hoang, S. Das, and A. Solar-Lezama (2020) Verifiably Safe Exploration for End-to-End Reinforcement Learning. External Links: 2007.01223, Link Cited by: §3.6.1, Table 6.
  • A. Irpan (2018) Deep reinforcement learning doesn’t work yet. Note: https://www.alexirpan.com/2018/02/14/rl-hard.html Cited by: §1.
  • A. Jaegle, V. Mehrpour, and N. Rust (2019) Visual Novelty, Curiosity, And Intrinsic Reward in Machine Learning And The Brain. Current Opinion in Neurobiology 58, pp. 167–174. External Links: Document, 1901.02478, ISSN 18736882, Link Cited by: §4.
  • D. Janz, J. Hron, P. Mazur, K. Hofmann, J. M. Hernández-Lobato, and S. Tschiatschek (2019) Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning. Conference on Neural Information Processing Systems, NeurIPS 2019 33. External Links: 1810.06530, ISSN 10495258 Cited by: §3.4.2, Table 4.
  • M. Johnson, K. Hofmann, T. Hutton, and D. Bignell (2016) The Malmo Platform For Artificial Intelligence Experimentation. IJCAI International Joint Conference on Artificial Intelligence 2016-Janua, pp. 4246–4247. External Links: ISSN 10450823 Cited by: §2.4.3, §4.
  • T. Jung and P. Stone (2010) Gaussian Processes for Sample Efficient Reinforcement Learning With Rmax-Like Exploration. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6321 LNAI (PART 1), pp. 601–616. External Links: Document, 1201.6604, ISBN 364215879X, ISSN 03029743 Cited by: §3.4.1, Table 4.
  • T. G. Karimpanal, S. Rana, S. Gupta, T. Tran, and S. Venkatesh (2020) Learning Transferable Domain Priors for Safe Exploration in Reinforcement Learning. pp. 1–10. External Links: Document, 1909.04307, ISBN 9781728169262 Cited by: §3.6.2, §3.6.2, Table 6.
  • M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski (2016) ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning. IEEE Conference on Computatonal Intelligence and Games, CIG 0. External Links: Document, 1605.02097, ISBN 9781509018833, ISSN 23254289 Cited by: §2.4.2, §4.
  • M. Khamassi, G. Velentzas, T. Tsitsimis, and C. Tzafestas (2017) Active Exploration And Parameterized Reinforcement Learning Applied to A Simulated Human-Robot Interaction Task. 2017 1st IEEE International Conference on Robotic Computing, IRC 2017, pp. 28–35. External Links: Document, ISBN 9781509067237 Cited by: §3.7.1, Table 7.
  • H. Kim, J. Kim, Y. Jeong, S. Levine, and H. O. Song (2019a) EMI: Exploration with Mutual Information. 36th International Conference on Machine Learning, ICML 2019, pp. 5837–5851. External Links: 1810.01176, ISBN 9781510886988 Cited by: §3.1.1, Table 1.
  • Y. Kim, W. Nam, H. Kim, J. H. Kim, and G. Kim (2019b) Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty. 36th International Conference on Machine Learning, ICML 2019, pp. 5861–5874. External Links: ISBN 9781510886988 Cited by: §3.1.1, Table 1.
  • B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Sallab, S. Yogamani, and P. Perez (2021) Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems, pp. 1–18. External Links: Document, 2002.00444, ISSN 15580016 Cited by: §1.
  • J. Z. Kolter and A. Y. Ng (2009) Near-Bayesian Exploration in Polynomial Time. pp. 513–520. External Links: ISBN 9781605585161 Cited by: §3.1.1.
  • G. Kovač, A. Laversanne-Finot, and P. Oudeyer (2020) GRIMGEP: Learning Progress for Robust Goal Sampling in Visual Deep Reinforcement Learning. (CoRL), pp. 1–15. External Links: 2008.04388v1, Link Cited by: §3.3.2, §4.
  • T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum (2016) Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. Conference on Neural Information Processing Systems, NeurIPS 2016. External Links: Document, NIHMS150003, ISBN 0924-6703, ISSN 1530888X Cited by: §3.3.2.
  • A. Lazaridis, A. Fachantidis, and I. Vlahavas (2020) Deep reinforcement learning: a state-of-the-art walkthrough. Journal of Artificial Intelligence Research 69. External Links: ISSN 2331-8422 Cited by: §1.
  • S. Lee and H. Bang (2020) Automatic Gain Tuning Method of a Quad-Rotor Geometric Attitude Controller Using A3C. International Journal of Aeronautical and Space Sciences 21 (2), pp. 469–478. External Links: Document, ISSN 20932480, Link Cited by: §1.
  • J. Lehman and K. O. Stanley (2011) Abandoning Objectives: Evolution Through the Search for Novelty Alone. Evolutionary Computation 19 (2), pp. 189–222. External Links: Document, ISSN 10636560 Cited by: §3.2.1, Table 2.
  • S. Levine (2018) Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv. External Links: 1805.00909, ISSN 23318422 Cited by: §1.
  • B. Li, T. Lu, J. Li, N. Lu, Y. Cai, and S. Wang (2020a) ACDER: Augmented Curiosity-Driven Experience Replay. IEEE International Conference on Robotics and Automation, ICRA 2020, pp. 4218–4224. External Links: Document, ISBN 9781728173955, ISSN 10504729 Cited by: §3.1.1.
  • B. Li, T. Lu, J. Li, N. Lu, Y. Cai, and S. Wang (2019) Curiosity-driven Exploration for Off-policy Reinforcement Learning Methods. IEEE International Conference on Robotics and Biomimetics, ROBIO 2019 (December), pp. 1109–1114. External Links: Document, ISBN 9781728163215 Cited by: §3.1.1, Table 1.
  • J. Li, X. Shi, J. Li, X. Zhang, and J. Wang (2020b) Random Curiosity-Driven Exploration in Deep Reinforcement Learning. Neurocomputing 418, pp. 139–147. External Links: Document, ISSN 0925-2312, Link Cited by: §3.1.1, Table 1.
  • Y. Li (2018) Deep reinforcement learning. External Links: Document, 1911.10107, ISBN 9781948087667 Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous Control with Deep Reinforcement Learning. 4th International Conference on Learning Representations, ICLR 2016. External Links: 1509.02971 Cited by: §1.
  • Z. C. Lipton, K. Azizzadenesheli, A. Kumar, L. Li, J. Gao, and L. Deng (2016) Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear. External Links: 1611.01211, Link Cited by: §3.6.2, §3.6.2, Table 6, §4.
  • E. Z. Liu, R. Keramati, S. Seshadri, K. Guu, P. Pasupat, E. Brunskill, and P. Liang (2020) Learning Abstract Models for Strategic Exploration and Fast Reward Transfer. External Links: 2007.05896, Link Cited by: §3.3.1.
  • M. C. Machado, M. G. Bellemare, and M. Bowling (2017) A Laplacian Framework for Option Discovery in Reinforcement Learning. 34th International Conference on Machine Learning, ICML 2017 5, pp. 3567–3582. External Links: 1703.00956, ISBN 9781510855144 Cited by: §3.3.2, Table 3.
  • M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell (2018) Eigenoption Discovery Through The Deep Successor Representation. 6th International Conference on Learning Representations, ICLR 2018. External Links: arXiv:1710.11089v3 Cited by: §3.3.2, Table 3.
  • M. C. Machado, M. G. Bellemare, and M. Bowling (2020) Count-Based Exploration with the Successor Representation. AAAI Conference on Artificial Intelligence. External Links: Document, 1807.11622, ISSN 2374-3468, Link Cited by: §3.1.2, Table 1.
  • T. Malisiewicz, A. Gupta, and A. A. Efros (2011) Ensemble of Exemplar-SVMs for Object Detection and Beyond.

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 89–96.
    External Links: Document, ISBN 9781457711015 Cited by: §3.1.3.
  • J. Martin, S. S. Narayanan, T. Everitt, and M. Hutter (2017) Count-based exploration in feature space for reinforcement learning. IJCAI International Joint Conference on Artificial Intelligence, pp. 2471–2478. External Links: Document, arXiv:1706.08090v1, ISBN 9780999241103, ISSN 10450823 Cited by: §3.1.2, Table 1.
  • G. Matheron, N. Perrin, and O. Sigaud (2020) PBCS: Efficient Exploration and Exploitation Using a Synergy Between Reinforcement Learning and Motion Planning. ICANN 2020 12397 LNCS, pp. 295–307. External Links: Document, 2004.11667, ISBN 9783030616151, ISSN 16113349, Link Cited by: §3.3.1, Table 3.
  • R. Mcfarlane (1999) A Survey of Exploration Strategies in Reinforcement Learning. pp. 1–10. External Links: Link Cited by: §1.
  • P. Ménard, O. D. Domingues, A. Jonsson, E. Kaufmann, E. Leurent, and M. Valko (2020) Fast Active Learning for Pure Exploration in Reinforcement Learning. pp. 1–36. External Links: 2007.13442, Link Cited by: §3.1.2.
  • P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell (2017) Learning to Navigate in Complex Environments. 5th International Conference on Learning Representations, ICLR 2017. External Links: 1611.03673 Cited by: §3.1.1, Table 1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level Control Through Deep Reinforcement Learning. Nature 518 (7540), pp. 529–533. External Links: Document, ISSN 14764687, Link Cited by: §1.
  • S. Mohamed and D. J. Rezende (2015) Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning. Conference on Neural Information Processing Systems, NeurIPS 2015, pp. 2125–2133. External Links: 1509.08731, ISSN 10495258 Cited by: §3.1.1.
  • J. B. Mouret and S. Doncieux (2012) Encouraging Behavioral Diversity In Evolutionary Robotics: An Empirical Study. Evolutionary Computation 20 (1), pp. 91–133. External Links: Document, ISBN 1063-6560, ISSN 10636560 Cited by: §3.2.1.
  • A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Overcoming Exploration in Reinforcement Learning with Demonstrations. Proceedings - IEEE International Conference on Robotics and Automation, pp. 6292–6299. External Links: Document, 1709.10089, ISBN 9781538630815, ISSN 10504729 Cited by: §3.5.1, Table 5.
  • T. T. Nguyen, N. D. Nguyen, and S. Nahavandi (2018) Deep Reinforcement Learning For Multi-Agent Systems: A Review Of Challenges, Solutions and Applications. arXiv 50 (9), pp. 3826–3839. External Links: ISSN 23318422 Cited by: §1.
  • N. Nikolov, J. Kirschner, F. Berkenkamp, and A. Krause (2019) Information-Directed Exploration for Deep Reinforcement Learning. 7th International Conference on Learning Representations, ICLR 2019. External Links: 1812.07544 Cited by: §3.4.2, Table 4.
  • B. O’Donoghue, I. Osband, R. Munos, and V. Mnih (2018) The Uncertainty Bellman Equation And Exploration. 35th International Conference on Machine Learning, ICML 2018 9, pp. 6154–6173. External Links: 1709.05380, ISBN 9781510867963 Cited by: §3.4.2, Table 4.
  • J. Oh, Y. Guo, S. Singh, and H. Lee (2018) Self-Imitation Learning. Proceedings of the 35th International Conference on Machine Learning. Cited by: §3.3.1, Table 3.
  • OpenAI (2021) Asymmetric Self-Play for Automatic Goal Discovery in Robotic Manipulation. External Links: 2101.04882, Link Cited by: §4.
  • I. Osband, J. Aslanides, and A. Cassirer (2018) Randomized Prior Functions for Deep Reinforcement Learning. Conference on Neural Information Processing Systems (NeurIPS 2018) (NeurIPS). Cited by: §3.1.1.
  • I. Osband, C. Blundell, A. Pritzel, and B. V. Roy (2016a) Deep exploration via bootstrapped dqn. Conference on Neural Information Processing Systems, NeurIPS 2016. External Links: ISSN 00437131 Cited by: §3.4.2, Table 4.
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, B. van Roy, R. Sutton, D. Silver, and H. van Hasselt (2020) Behaviour Suite for Reinforcement Learning. 8th International Conference on Learning Representations, ICLR 2020. External Links: 1908.03568, ISSN 23318422 Cited by: §4.
  • I. Osband, B. Van Roy, D. J. Russo, and Z. Wen (2019) Deep Exploration via Randomized Value Functions. Journal of Machine Learning Research 20, pp. 1–62. External Links: 1703.07608, ISSN 15337928 Cited by: §3.4.1, Table 4.
  • I. Osband, B. Van Roy, and Z. Wen (2016b) Generalization and exploration via randomized value functions. 33rd International Conference on Machine Learning, ICML 2016, pp. 3540–3561. External Links: 1402.0635, ISBN 9781510829008 Cited by: §3.4.1, Table 4.
  • I. Osband and B. Van Roy (2017) Why is Posterior Sampling Better Than Optimism for Reinforcement Learning?. 34th International Conference on Machine Learning, ICML 2017, pp. 4133–4148. External Links: 1607.00215, ISBN 9781510855144 Cited by: §3.4.3, §3.4.
  • G. Ostrovski, M. G. Bellemare, A. Van Den Oord, and R. Munos (2017) Count-based Exploration with Neural Density Models. 34th International Conference on Machine Learning, ICML 2017 6, pp. 4161–4175. External Links: 1703.01310, ISBN 9781510855144 Cited by: §3.1.2, Table 1.
  • P. Oudeyer and F. Kaplan (2007) What Is Intrinsic Motivation? A Typology Of Computational Approaches. Frontiers in Neurorobotics 1 (6), pp. 1184–1191. External Links: Document, arXiv:1410.5401v2, ISBN 1662-5218 (Electronic)$\$r1662-5218 (Linking), ISSN 1662-5218, Link Cited by: §3.1.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-Driven Exploration by Self-Supervised Prediction. Proceedings of the 34th International Conference on Machine Learning, pp. 488–489. External Links: Document, 1705.05363, ISBN 9781538607336, ISSN 21607516 Cited by: §3.1.1, Table 1.
  • D. Pathak, D. Gandhi, and A. Gupta (2019) Self-Supervised Exploration via Disagreement. Proceedings of the 36th International Conference on Machine Learning. Cited by: Figure 4, Figure 5, Figure 7, §3.1.1.
  • R. Patrascu and D. Stacey (1999) Adaptive Exploration in Reinforcement Learning. Proceedings of the International Joint Conference on Neural Networks 4, pp. 2276–2281. External Links: Document, ISBN 0780355296 Cited by: §3.7.1, Table 7.
  • T. Pearce, N. Anastassacos, M. Zaki, and A. Neely (2018) Bayesian Inference with Anchored Ensembles of Neural Networks, and Application to Exploration in Reinforcement Learning. Exploration in Reinforcement Learning Work- shop at the 35th International Conference on Machine Learning. External Links: 1805.11324, Link Cited by: §3.4.2, Table 4.
  • A. Péré, S. Forestier, O. Sigaud, and P. Y. Oudeyer (2018) Unsupervised Learning Of Goal Spaces For Intrinsically Motivated Goal Exploration. 6th International Conference on Learning Representations, ICLR 2018, pp. 1–26. External Links: 1803.00781 Cited by: §3.3.2, Table 3.
  • M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz (2018) Parameter Space Noise for Exploration. 6th International Conference on Learning Representations, ICLR 2018, pp. 1–18. External Links: 1706.01905 Cited by: §3.7.2, Table 7.
  • R. Polvara, M. Patacchiola, S. Sharma, J. Wan, A. Manning, R. Sutton, and A. Cangelosi (2017) Autonomous Quadrotor Landing using Deep Reinforcement Learning. External Links: 1709.03339, ISSN 2331-8422, Link Cited by: §1.
  • V. H. Pong, M. Dalal, S. Lin, A. Nair, S. Bahl, and S. Levine (2019) Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. External Links: 1903.03698, Link Cited by: §3.2.2, Table 2.
  • J. K. Pugh, L. B. Soros, and K. O. Stanley (2016) Quality Diversity: A New Frontier for Evolutionary Computation. Frontiers in Robotics and AI 3 (July). External Links: Document, ISBN 9781424447152, ISSN 2296-9144, Link Cited by: §3.2.1.
  • R. Raileanu and T. Rocktäschel (2020) RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments. 8th International Conference on Learning Representations, ICLR 2020. External Links: 2002.12292, Link Cited by: §3.1.1, Table 1, §4.
  • M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess, and T. Springenberg (2018) Learning by Playing - Solving Sparse Reward Tasks From Scratch. Proceedings of the 35th International Conference on Machine Learning. Cited by: §3.3.2, Table 3.
  • S. Risi, S. D. Vanderbleek, C. E. Hughes, and K. O. Stanley (2009) How Novelty Search Escapes the Deceptive Trap of Learning to Learn. Proceedings of the 11th Annual conference on Genetic and evolutionary computation - GECCO ’09, pp. 153. External Links: Document, ISBN 9781605583259, Link Cited by: §3.2.1, Table 2.
  • T. Rückstieß, M. Felder, and J. Schmidhuber (2008) State-dependent Exploration for Policy Gradient Methods. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 5212 LNAI (PART 2), pp. 234–249. External Links: Document, ISBN 3540874801, ISSN 03029743 Cited by: §3.7.2.
  • T. Rückstieß, F. Sehnke, T. Schaul, D. Wierstra, Y. Sun, and J. Schmidhuber (2010) Exploring Parameter Space in Reinforcement Learning. Journal of Behavioral Robotics 1 (1), pp. 14–24. External Links: Document, ISSN 2081-4836 Cited by: §3.7.2, §3.7.2.
  • T. Salimans and R. Chen (2018) Learning Montezuma’s Revenge from a Single Demonstration. Conference on Neural Information Processing Systems, NeurIPS 2018. External Links: 1812.03381, Link Cited by: §3.5.2, §3.5.2, Table 5.
  • T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv, pp. 476–485. External Links: Document, 1703.03864v2, ISBN 9780769543451 Cited by: §3.2.1.
  • W. Saunders, A. Stuhlmüller, G. Sastry, and O. Evans (2018) Trial Without Error: Towards Safe Reinforcement Learning via Human Intervention. Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, pp. 2067–2069. External Links: 1707.05173, ISBN 9781510868083, ISSN 15582914 Cited by: §3.6.1, Table 6.
  • N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap, and S. Gelly (2019) Episodic Curiosity Through Reachability. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–20. External Links: 1810.02274 Cited by: §3.1.1, Table 1.
  • J. Schmidhuber (2010) Formal Theory Of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development 2 (3), pp. 230–247. External Links: Document, 1510.05840, ISBN 1943-0604, ISSN 19430604 Cited by: §3.1, §3.1, §3.
  • J. Schmidhuber (1991a) A Possibility for Implementing Curiosity And Boredom in Model-Building Neural Controllers. Proceedings of the First International Conference on Simulation of Adaptive Behavior 1, pp. 5–10. External Links: Link Cited by: §1, §3.1.1.
  • J. Schmidhuber (1991b) Curious model-building control systems. 1991 IEEE International Joint Conference on Neural Networks, IJCNN 1991, pp. 1458–1463. External Links: Document, ISBN 0780302273 Cited by: §1, §3.1.1.
  • L. Shani, Y. Efroni, and S. Mannor (2019) Exploration Conscious Reinforcement Learning Revisited. 36th International Conference on Machine Learning, ICML 2019, pp. 9986–10012. External Links: 1812.05551, ISBN 9781510886988 Cited by: §3.7.1, §3.7.1, Table 7.
  • K. Shibata and Y. Sakashita (2015) Reinforcement Learning With Internal-Dynamics-Based Exploration Using A Chaotic Neural Network. Proceedings of the International Joint Conference on Neural Networks, IJCNN 2015. External Links: Document, ISBN 9781479919604 Cited by: §3.7.2, Table 7.
  • P. Shyam, W. Jaskowski, and F. Gomez (2019) Model-based Active Exploration. 36th International Conference on Machine Learning, ICML 2019, pp. 10136–10152. External Links: 1810.12162, ISBN 9781510886988 Cited by: §3.4.2, Table 4.
  • B. C. Stadie, S. Levine, and P. Abbeel (2015) Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. pp. 1–11. External Links: 1507.00814, Link Cited by: §3.1.1, Table 1, §4.
  • C. Stanton and J. Clune (2018) Deep Curiosity Search: Intra-life Exploration Can Improve Performance on Challenging Deep Reinforcement Learning Problems. External Links: 1806.00553, Link Cited by: §3.1.1, Table 1.
  • S. Still and D. Precup (2012) An Information-Theoretic Approach to Curiosity-Driven Reinforcement Learning. Theory in Biosciences 131 (3), pp. 139–148. External Links: Document, ISSN 14317613 Cited by: §3.1.1.
  • S. Still (2009) Information Theoretic Approach to Interactive Learning. Arxiv, pp. 1–6. Cited by: §3.1.1.
  • M. Strens (2000) A Bayesian Framework for Reinforcement Learning. Proc of the 17th International Conference on Machine Learning, pp. 943–950. External Links: ISBN 1-55860-707-2, Link Cited by: §3.4.2, Table 4.
  • F. Stulp (2012) Adaptive Exploration for Continual Reinforcement Learning. IEEE International Conference on Intelligent Robots and Systems, IROS 2012, pp. 1631–1636. External Links: Document, ISBN 9781467317375, ISSN 21530858 Cited by: §3.4.2, Table 4.
  • K. Subramanian, C. L. Isbell, and A. L. Thomaz (2016) Exploration From Demonstration for Interactive Reinforcement Learning. Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, pp. 447–456. External Links: ISBN 9781450342391, ISSN 15582914 Cited by: §3.5.2.
  • F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune (2017)

    Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

    External Links: Document, 1712.06567, ISBN 9781439854242, ISSN 07387946, Link Cited by: §3.2.1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. External Links: ISBN 0262039249 Cited by: §1.
  • H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel (2017) Exploration: A Study Of Count-Based Exploration For Deep Reinforcement Learning. Conference on Neural Information Processing Systems, NeurIPS 2017, pp. 2754–2763. External Links: 1611.04717, ISSN 10495258 Cited by: §3.1.2, Table 1.
  • Y. Tang and S. Agrawal (2018) Exploration by Distributional Reinforcement Learning. IJCAI International Joint Conference on Artificial Intelligence 2018-July, pp. 2710–2716. External Links: Document, 1805.01907, ISBN 9780999241127, ISSN 10450823 Cited by: §3.4.2, Table 4.
  • T. H. Teng and A. H. Tan (2012) Knowledge-based Exploration for Reinforcement Learning in Self-organizing Neural Networks. Proceedings - 2012 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IAT 2012, pp. 332–339. External Links: Document, ISBN 9780769548807 Cited by: §3.7.1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: A Physics Engine for Model-based Control. IEEE International Conference on Intelligent Robots and Systems, pp. 5026–5033. External Links: Document, ISBN 9781467317375, ISSN 21530858 Cited by: §2.4.4, §4.
  • M. Tokic (2010) Adaptive -greedy Exploration in Reinforcement Learning Based on Value Differences. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6359 LNAI, pp. 203–210. External Links: Document, ISBN 3642161103, ISSN 03029743 Cited by: §3.7.1, Table 7.
  • M. Turchetta, F. Berkenkamp, and A. Krause (2016) Safe Exploration in Finite Markov Decision Processes with Gaussian Processes. Conference on Neural Information Processing Systems, NeurIPS 2016, pp. 4312–4320. External Links: 1606.04753, ISSN 10495258 Cited by: §3.6.1, §3.6.1, Table 6.
  • M. Usama and D. E. Chang (2019) Learning-Driven Exploration for Reinforcement Learning. External Links: 1906.06890, Link Cited by: §3.7.1, Table 7.
  • M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller (2017) Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. External Links: 1707.08817, Link Cited by: §3.5.1, Table 5.
  • A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) FeUdal Networks for Hierarchical Reinforcement Learning. 34th International Conference on Machine Learning, ICML 2017 7, pp. 5409–5418. External Links: 1703.01161, ISBN 9781510855144 Cited by: §3.3.2.
  • P. Wang, W. J. Zhou, D. Wang, and A. H. Tan (2018) Probabilistic guided exploration for reinforcement learning in self-organizing neural networks. Proceedings - 2018 IEEE International Conference on Agents, ICA 2018, pp. 109–112. External Links: Document, ISBN 9781538681800, Link Cited by: §3.7.1, Table 7.
  • R. J. Williams (1992) Simple Statistical Gradient-Following Algorithms For Connectionist Reinforcement Learning. Machine Learning 8 (3-4), pp. 229–256. External Links: Document, ISSN 0885-6125 Cited by: §2.1.3.
  • C. Xie, S. Patil, T. Moldovan, S. Levine, and P. Abbeel (2016) Model-Based Reinforcement Learning With Parametrized Physical Models And Optimism-Driven Exploration. IEEE International Conference on Robotics and Automation, ICRA 2016, pp. 504–511. External Links: Document, 1509.06824, ISBN 9781467380263, ISSN 10504729 Cited by: §3.4.1, Table 4.
  • C. Yu, J. Liu, and S. Nemati (2019) Reinforcement Learning in Healthcare: A Survey. External Links: 1908.08796, ISSN 2331-8422, Link Cited by: §1.
  • M. J. Zaki and W. Meira (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press. External Links: Document Cited by: §3.3.2.
  • L. Zhang, K. Tang, and X. Yao (2019) Explicit Planning for Efficient Exploration in Reinforcement Learning. Conference on Neural Information Processing Systems, NeurIPS 2019 32. External Links: ISSN 10495258 Cited by: §4.
  • R. Zhao and V. Tresp (2019) Curiosity-Driven Experience Prioritization via density estimation. arXiv. External Links: 1902.08039 Cited by: §3.1.2.