1 Introduction
Reinforcement learning (RL) methods have recently enjoyed widely publicised success in domains that once seemed far beyond their reach [Mnih et al.2015]
. Much of this progress is due to the application of modern function approximation techniques to the problem of policy evaluation for Markov Decision Processes (MDPs)
[Sutton and Barto1998]. These techniques address a key shortcoming of tabular MDP solution methods: their inability to generalise what is learnt from one context to another. This sort of generalisation is crucial if the stateaction space of the MDP is large, because the agent typically only visits a small subset of that space during training.Comparatively little progress has been made on the problem of efficient exploration in large domains. Even algorithms that use sophisticated nonlinear methods for policy evaluation tend to use very old, inefficient exploration techniques, such as the greedy strategy [van Hasselt et al.2016b, Mnih et al.2016, Nair et al.2015]. There are more efficient tabular countbased exploration algorithms for finite MDPs, which drive the agent to reduce its uncertainty by visiting states that have low visitcounts [Strehl and Littman2008]. However, these algorithms are often ineffective in MDPs with highdimensional stateaction spaces, because most states are never visited during training, and the visitcount remains at zero nearly everywhere.
Countbased exploration algorithms have only very recently been successfully adapted for these large problems [Bellemare et al.2016, Tang et al.2016]. Just as function approximation techniques achieve generalisation across the state space regarding value, these algorithms achieve generalisation regarding uncertainty. The breakthrough has been the development of generalised state visitcounts, which are larger for states that are more similar to visited states, and which can be nonzero for unvisited states. The key challenge is to compute an appropriate similarity measure in an efficient way, such that these exploration methods can be combined with scalable RL algorithms. It soon becomes infeasible, for example, to do so by storing the entire history of visited states and comparing each new state to those in the history. The most promising proposals instead compute generalised counts from a compressed representation of the history of visited states – for example, by constructing a visitdensity model over the state space and deriving a “pseudocount” [Bellemare et al.2016, Ostrovski et al.2017], or by using localitysensitive hashing to cluster states and counting the occurrences in each cluster [Tang et al.2016].
This paper presents a new countbased exploration algorithm that is feasible in environments with large stateaction spaces. It can be combined with any valuebased RL algorithm that uses linear function approximation (LFA). Our principal contribution is a new method for computing generalised visitcounts. Following [Bellemare et al.2016], we construct a visitdensity model in order to measure the similarity between states. Our approach departs from theirs in that we do not construct our density model over the raw state space. Instead, we exploit the feature map that is used for value function approximation, and construct a density model over the transformed feature space
. This model assigns higher probability to state feature vectors that
share features with visited states. Generalised visitcounts are then computed from these probabilities; states with frequently observed features are assigned higher counts. These counts serve as a measure of the uncertainty associated with a state. Exploration bonuses are then computed from these counts in order to encourage the agent to visit regions of the statespace with less familiar features.Our density model can be trivially derived from any feature map used for LFA, regardless of the application domain, and requires little or no additional design. In contrast to existing algorithms, there is no need to perform a special dimensionality reduction of the state space in order to compute our generalised visitcounts. Our method uses the same lowerdimensional feature representation to estimate value and to estimate uncertainty. This makes it simpler to implement and less computationally expensive than some existing proposals. Our evaluation demonstrates that this simple approach achieves near stateoftheart performance on highdimensional RL benchmarks.
2 Background and Related Work
2.1 Reinforcement Learning
The reinforcement learning (RL) problem formalises the task of learning from interaction to achieve a goal [Sutton and Barto1998]. It is usually formulated as an MDP , where is the set of states of the environment, is the set of available actions, is the state transition distribution, is the reward function, and is the discount factor. The agent is formally a policy that maps a state to an action. At timestep , the agent is in a state , receives a reward , and takes an action . We seek a policy that maximises the expected sum of future rewards, or value. The actionvalue of a stateaction pair under a policy is the expected discounted sum of future rewards, given that the agent takes action from state , and follows thereafter: .
RL methods that compute a value function are called valuebased methods. Tabular methods store the value function as a table having one entry for each state(action). This representation of the state space does not have sufficient structure to permit generalisation based on the similarity between states. Function approximation methods achieve generalisation by approximating the value function by a parameterised functional form. In LFA the approximate actionvalue function is a linear combination of stateaction features, where is an dimensional feature map and is a parameter vector.
2.2 CountBased Exploration and Optimism
Since the true transition and reward distributions and are unknown to the agent, it must explore the environment to gather more information and reduce its uncertainty. At the same time, it must exploit its current information to maximise expected cumulative reward. This tradeoff between exploration and exploitation is a fundamental problem in RL.
Many of the exploration algorithms that enjoy strong theoretical guarantees implement the ‘optimism in the face of uncertainty
’ (OFU) heuristic
[Strehl et al.2009]. Most are tabular and countbased in that they compute exploration bonuses from a table of state(action) visit counts. These bonuses are added to the estimated state/action value. Lower counts entail higher bonuses, so the agent is effectively optimistic about the value of less frequently visited regions of the environment. OFU algorithms are more efficient than random strategies like greedy because the agent avoids actions that yield neither large rewards nor large reductions in uncertainty [Osband et al.2016b].One of the best known is the UCB1 bandit algorithm, which selects an action that maximises an upper confidence bound , where is the estimated mean reward and is the visitcount [Lai and Robbins1985]. The dependence of the bonus term on the inverse squareroot of the visitcount is justified using Chernoff bounds. In the MDP setting, the tabular OFU algorithm most closely resembling our method is ModelBased Interval Estimation with Exploration Bonuses (MBIEEB) [Strehl and Littman2008].^{1}^{1}1To the best of our knowledge, the first work to use exploration bonuses in the MDP setting was the DynaQ+ algorithm, in which the bonus is a function of the recency of visits to a state, rather than the visitcount [Sutton1990]. Empirical estimates and of the transition and reward functions are maintained, and is augmented with a bonus term , where is the stateaction visitcount, and is a theoretically derived constant. The Bellman optimality equation for the augmented actionvalue function is . Here the dependence of the bonus on the inverse squareroot of the visitcount is provably optimal [Kolter and Ng2009]. This equation can be solved using any MDP solution method.
2.3 Exploration in Large MDPs
While tabular OFU algorithms perform well in practice on small MDPs [Strehl and Littman2004], their sample complexity becomes prohibitive for larger problems [Kakade2003]. MBIEEB, for example, has a sample complexity bound of . In the highdimensional setting – where the agent cannot hope to visit every state during training – this bound offers no guarantee that the trained agent will perform well.
Several very recent extensions of countbased exploration methods have produced impressive results on highdimensional RL benchmarks. These algorithms closely resemble MBIEEB, but they substitute the stateaction visitcount for a generalised count which quantifies the similarity of a state to previously visited states. Bellemare et. al. construct a Context Tree Switching (CTS) density model over the state space such that higher probability is assigned to states that are more similar to visited states [Bellemare et al.2016, Veness et al.2012]. A state pseudocount
is then derived from this density. A subsequent extension of this work replaces the CTS density model with a neural network
[Ostrovski et al.2017]. Another recent proposal uses locality sensitive hashing (LSH) to cluster similar states, and the number of visited states in a cluster serves as a generalised visitcount [Tang et al.2016]. As in the MBIEEB algorithm, these counts are used to compute exploration bonuses. These three algorithms outperform random strategies, and are currently the leading exploration methods in large discrete domains where exploration is hard.3 Method
Here we introduce the Exploration Bonus (EB) algorithm, which drives the agent to visit states about which it is uncertain. Following other optimistic countbased exploration algorithms, we use a (generalised) state visitcount in order to estimate the uncertainty associated with a state. A generalised count is a novelty measure that quantifies how dissimilar a state is from those already visited. Measuring novelty therefore involves choosing a similarity measure for states. Of course, states can be similar in myriad ways, but not all of these are relevant to solving the MDP. If the solution method used is valuebased, then states should only be considered similar if they share the features that are determinative of value. This motivates us to construct a similarity measure that exploits the feature representation that is used for value function approximation. These features are explicitly designed to be relevant for estimating value. If they were not, they would not permit a good approximation to the true value function. This sets our method apart from the approaches described in section 2.3. They measure novelty with respect to a separate, explorationspecific representation of the state space, one that bears no relation to the value function or the reward structure of the MDP. We argue that measuring novelty in feature space is a simpler and more principled approach, and hypothesise that more efficient exploration will result.
3.1 A VisitDensity over Feature Space
Our exploration method is designed for use with LFA, and measures novelty with respect to a fixed feature representation of the state space. The challenge is to measure novelty without computing the distance between each new feature vector and those in the history. That approach becomes infeasible because the cost of computing these distances grows with the size of the history.
Our method constructs a density model over feature space that assigns higher probability to states that share more features with more frequently observed states. Let be the feature mapping from the state space into an dimensional feature space . Let denote the state feature vector observed at time . We denote the sequence of observed feature vectors after timesteps by , and denote the set of all finite sequences of feature vectors by . Let denote the sequence where is followed by . The th element of is denoted by , and the th element of is .
Definition 1 (Feature VisitDensity).
Let be a density model that maps a finite sequence of feature vectors
to a probability distribution over
. The feature visitdensity at time is the distribution over that is returned by after observing .We construct our feature visitdensity as a product of independent factor distributions over individual features :
If is countable we can use a countbased estimator for the factor models , such as the empirical estimator , where is the number of times has occurred. In our implementation we use the KrichevskyTrofimov (KT) estimator .
This density model induces a similarity measure on the feature space. Loosely speaking, feature vectors that share component features are deemed similar. This enables us to use as a novelty measure for states, by comparing the features of newly observed states to those in the history. If has more novel component features, will be lower. By modelling the features as independent, and using countbased estimators as factor models, our method learns reasonable novelty estimates from very little data.
Example.
Suppose we use a 3D binary feature map and that after 3 timesteps the history of observed feature vectors is . Let us estimate the feature visit densities of two unobserved feature vectors , and . Using the KT estimator for the factor models, we have , and . Note that because the component features of are more similar to those in the history. As desired, our novelty measure generalises across the state space.
3.2 The pseudocount
Here we adopt a recently proposed method for computing generalised visitcounts from density models [Bellemare et al.2016, Ostrovski et al.2017]. By analogy with these pseudocounts, we derive two pseudocounts from our feature visitdensity.
Definition 2 (pseudocount and Naive pseudocount).
Let be the feature visitdensity after observing . Let denote the same density model after has been observed.

The naive pseudocount for a state at time is

The pseudocount for a state at time is
Empirically, is usually larger than and leads to better performance.^{2}^{2}2The expression for is derived by letting it depend on an implicit total pseudocount that can be much larger than , and assuming , and [Bellemare et al.2016].
3.3 Reinforcement Learning with Eb
Following traditional countbased exploration algorithms, we drive optimistic exploration by computing a bonus from the pseudocount.
Definition 3 (Exploration Bonus).
Let be a free parameter. The exploration bonus for a stateaction pair at time is
As in the MBIEEB algorithm, this bonus is added to the reward . The agent is trained on the augmented reward using any valuebased RL algorithm with LFA. At each timestep our algorithm performs updates for at most estimators, one for each feature. The cost of our method is therefore independent of the size of the stateaction space, and scales only in the number of features. If the feature vectors are sparse, we can maintain a single prototype estimator for all the features that have not yet been observed. Under these conditions our method scales only in the number of observed features.
4 Theoretical Results
Here we formalise the comments made in section 3.1 by proving a bound that relates our pseudocount to an appropriate similarity measure. To simplify the analysis, we prove results for the naive exploration bonus , though we expect analogous results to hold for as well. We use the empirical estimator for the factor models in the visitdensity. Since the feature set we use in our implementation is binary, our analysis assumes . We begin by defining a similarity measure for binary feature vectors, and prove two lemmas.
Definition 4 (Hamming Similarity for Binary Vectors).
Let be length binary vectors. The Hamming similarity between and is .
Note that for all . The Hamming similarity is large if and share features (i.e. if the distance between them is small). We now prove a lemma relating the joint probability of a feature vector to the sum of the probabilities of its factors.
Lemma 1 (AMGM Inequality and Factorised ).
Let , and let . Then .
Proof.
By the inequality of arithmetic and geometric means
∎The following lemma relates the probability of an individual feature to its distance from previously observed values.
Lemma 2 (Feature VisitDensity and distance).
Let . Then for all , .
Proof.
Suppose :
The case follows by an almost identical argument. ∎
The following theorem and its corollary are the major results of this section. These connect the Hamming similarity (to previously observed feature vectors) with both the feature visitdensity and the pseudocount. We show that a state which shares few features with those in the history will be assigned low probability by our density model, and will therefore have a low pseudocount.
Theorem 1 (Feature VisitDensity and Average Similarity).
Let be a state with binary feature representation , and let be its feature visitdensity at time . Then
We immediately get a similar bound for the naive pseudocount .
Corollary 1 (pseudocount and Total Similarity).
therefore captures an intuitive relation between novelty and similarity to visited states. By visiting a state that minimises the pseudocount, an agent also minimises a lower bound on its Hamming similarity to previously visited states. As desired, we have a novelty measure that is closely related to the distances between states in feature space, but which obviates the cost of computing those distances directly.
5 Empirical Evaluation
Our evaluation is designed to answer the following research questions:

Is a novelty measure derived from the features used for LFA a good way to generalise state visitcounts?

Does EB produce improvement across a range of environments, or only if rewards are sparse?

Can EB with LFA compete with the stateoftheart in exploration and deep RL?
5.1 Setup
We evaluate our algorithm on five games from the Arcade Learning Environment (ALE), which has recently become a standard highdimensional benchmark for RL [Bellemare et al.2013]. The reward signal is computed from the game score. The raw state is a frame of video (a 160210 array of 7bit pixels). There are 18 available actions. The ALE is a particularly interesting testbed in our context, because the difficulty of exploration varies greatly between games. Random strategies often work well, and it is in these games that Deep QNetworks (DQN) with greedy is able to achieve socalled humanlevel performance [Mnih et al.2015]. In others, however, DQN with greedy does not improve upon a random policy, and its inability to explore efficiently is one of the key determinants of this failure [Osband et al.2016a]. We chose five of these games where exploration is hard. Three of the chosen games have sparse rewards (Montezuma’s Revenge, Venture, Freeway) and two have dense rewards (Frostbite, Q*bert).^{3}^{3}3Note that our experimental evaluation uses the stochastic version of the ALE [Bellemare et al.2013].
Evaluating agents in the ALE is computationally demanding. We chose to focus more resources on Montezuma’s Revenge and Venture, for two reasons: (1) we hypothesise that EB will produce more improvement in sparse reward games, and (2) leading algorithms with which we seek to compare EB have also focused on these games. We conducted five independent learning trials for Montezuma and Venture, and two trials for the remaining three games. All agents were trained for 100 million frames on the noop metric [Bellemare et al.2013]. Trained agents were then evaluated for 500 episodes; Table 1 reports the average evaluation score.
We implement Algorithm 1 using Sarsa() with replacing traces and LFA as our RL method, because it is less likely to diverge than learning [Sutton and Barto1998]. To implement LFA in the ALE we use the BlobPROST feature set presented in [Liang et al.2016]. To date this is the best performing feature set for LFA in the ALE. The parameters for the Sarsa( algorithm are set to the same values as in [Liang et al.2016]. Hereafter we refer to our algorithm as SarsaEB. To conduct a controlled investigation of the effectiveness of EB, we also evaluate a baseline implementation of Sarsa() with the same features but with greedy exploration (which we denote Sarsa). The same training and evaluation regime is used for both; learning curves are reported in Figure 1.
The coefficient in the exploration bonus was set to 0.05 for all games, after a coarse parameter search. This search was performed once, across a range of ALE games, and a value was chosen for which the agent achieved good scores in most games.
5.2 Results
Comparison with greedy Baseline
. Dashed lines are min/max scores. Shaded regions describe one standard deviation.
Venture  Montezuma’s Revenge  Freeway  Frostbite  Q*bert  
SarsaEB  1169.2  2745.4  0.0  2770.1  4111.8 
Sarsa  0.0  399.5  29.9  1394.3  3895.3 
DDQNPC  N/A  3459  N/A  N/A  N/A 
A3C+  0  142  27  507  15805 
TRPOHash  445  75  34  5214  N/A 
MPEB  N/A  0  12  380  N/A 
DDQN  98  0  33  1683  15088 
DQNPA  1172  0  33  3469  5237 
Gorila  1245  4  12  605  10816 
TRPO  121  0  16  2869  7733 
Dueling  497  0  0  4672  19220 
In Montezuma’s Revenge, Sarsa rarely leaves the first room. Its policy converges after an average of 20 million frames. SarsaEB continues to improve throughout training, visiting up to 14 rooms. The largest improvement over the baseline occurs in Venture. Sarsa fails to score, while SarsaEB continues to improve throughout training. In Q*bert and Frostbite, the difference is less dramatic. These games have dense, wellshaped rewards that guide the agent’s path through state space and elide greedy’s inefficiency. Nonetheless, SarsaEB consistently outperforms Sarsa throughout training so its cumulative reward is much higher.
In Freeway, SarsaEB with fails to match the performance of the baseline algorithm, but with it performs better (Figure 1 shows the learning curve for the latter). This sensitivity to the parameter likely results from the large number of unique BlobPROST features that are active in Freeway, many of which are not relevant for finding the optimal policy. If is too high the agent is content to stand still and receive exploration bonuses for observing new configurations of traffic. This accords with our hypothesis that efficient optimistic exploration should involve measuring novelty with respect to taskrelevant features.
In summary, SarsaEB with outperforms Sarsa on all tested games except Freeway. Since both use the same feature set and RL algorithm, and differ only in their exploration policies, this is strong evidence that EB produces improvement over random exploration across a range of environments. This also supports our conjecture that using the same features for value function approximation and novelty estimation is an appropriate way to generalise visitcounts to the highdimensional setting.
Comparison with Leading Algorithms
Table 1 compares our evaluation scores to Double DQN (DDQN) [van Hasselt et al.2016b], Double DQN with pseudocount (DDQNPC) [Bellemare et al.2016], A3C+ [Bellemare et al.2016], DQN PopArt (DQNPA) [van Hasselt et al.2016a], Dueling Network (Dueling) [Wang et al.2016], Gorila [Nair et al.2015], DQN with Model Prediction Exploration Bonuses (MPEB) [Stadie et al.2015], Trust Region Policy Optimisation (TRPO) [Schulman et al.2015], and TRPOAESimHash (TRPOHash) [Tang et al.2016]. The most interesting comparisons for our purposes are with TRPOHash, DDQNPC, A3C+, and MPEB, because these algorithms all use exploration strategies that drive the agent to reduce its uncertainty. TRPOHash, DDQNPC, and A3C+ are countbased methods, MPEB seeks high model prediction error.
Our SarsaEB algorithm achieves an average score of 2745.4 on Montezuma: the second highest reported score. On this game it far outperforms every algorithm apart from DDQNPC, despite only having trained for half the number of frames. Note that neither A3C+ nor TRPOHash achieves more than 200 points, despite their exploration strategies.
On Venture SarsaEB also achieves stateoftheart performance. It achieves the third highest reported score despite its short training regime, and far outperforms A3C+ and TRPOHash. DDQNPC evaluation scores are not given for Venture, but reported learning curves suggest SarsaEB performs much better here [Bellemare et al.2016]. The performance of SarsaEB in Frostbite also seems competitive given the shorter training regime. Nonlinear algorithms perform better in Q*bert. In Freeway SarsaEB fails to score any points, for reasons already discussed.
6 Conclusion
We have introduced the Exploration Bonus method, a countbased optimistic exploration strategy that scales to highdimensional environments. It is simpler to implement and less computationally demanding than some other proposals. Our evaluation shows that it improves upon greedy exploration on a variety of games, and that it is even competitive with leading exploration techniques developed for deep RL. Unlike other methods, it does not require the design of an explorationspecific state representation, but rather exploits the features used in the approximate value function. We have argued that computing novelty with respect to these taskrelevant features is an efficient and principled way to generalise visitcounts for exploration. We conclude by noting that this reliance on the feature representation used for LFA is also a limitation. It is not obvious how a method like ours could be combined with the nonlinear function approximation techniques that have driven recent progress in RL. We hope the success of our simple method will inspire future work in this direction.
References

[Bellemare et al.2013]
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  [Bellemare et al.2016] Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying countbased exploration and intrinsic motivation. CoRR, abs/1606.01868, 2016.
 [Kakade2003] Sham Machandranath Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, 2003.

[Kolter and Ng2009]
J Zico Kolter and Andrew Y Ng.
NearBayesian exploration in polynomial time.
International Conference on Machine Learning
, pages 513–520, 2009.  [Lai and Robbins1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
 [Liang et al.2016] Yitao Liang, Marlos C Machado, Erik Talvitie, and Michael Bowling. State of the art control of Atari games using shallow reinforcement learning. In Autonomous Agents and MultiAgent Systems, 2016.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [Mnih et al.2016] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016.
 [Nair et al.2015] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, and David Silver. Massively parallel methods for deep reinforcement learning. CoRR, abs/1507.04296, 2015.
 [Osband et al.2016a] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. CoRR, abs/1602.04621, 2016.
 [Osband et al.2016b] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. International Conference on Machine Learning, pages 1–26, 2016.
 [Ostrovski et al.2017] Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Countbased exploration with neural density models. CoRR, abs/1703.01310, 2017.
 [Schulman et al.2015] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
 [Stadie et al.2015] Bradly C. Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. CoRR, abs/1507.00814, 2015.
 [Strehl and Littman2004] A. L. Strehl and M. L. Littman. An empirical evaluation of interval estimation for Markov decision processes. In 16th IEEE International Conference on Tools with Artificial Intelligence, pages 128–135, 2004.
 [Strehl and Littman2008] Alexander L Strehl and Michael L Littman. An analysis of modelbased interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
 [Strehl et al.2009] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement Learning in Finite MDPs : PAC Analysis. Journal of Machine Learning Research, 10:2413–2444, 2009.
 [Sutton and Barto1998] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. IEEE Transactions on Neural Networks, 9(5):1054–1054, 1998.
 [Sutton1990] Richard S. Sutton. Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning, pages 216–224, 1990.
 [Tang et al.2016] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #Exploration: A study of countbased exploration for deep reinforcement learning. CoRR, abs/1611.04717, 2016.
 [van Hasselt et al.2016a] Hado van Hasselt, Arthur Guez, Matteo Hessel, and David Silver. Learning values across many orders of magnitude. CoRR, abs/1602.07714, 2016.
 [van Hasselt et al.2016b] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Qlearning. In AAAI, 2016.
 [Veness et al.2012] Joel Veness, Kee Siong Ng, Marcus Hutter, and Michael Bowling. Context tree switching. In IEEE Data Compression Conference, pages 327–336, 2012.
 [Wang et al.2016] Ziyu Wang, Nando de Freitas, Tom Schaul, Matteo Hessel, Hado van Hasselt, and Marc Lanctot. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, 2016.
Comments
There are no comments yet.