Clustered Reinforcement Learning

06/06/2019
by   Xiao Ma, et al.
Nanjing University
0

Exploration strategy design is one of the challenging problems in reinforcement learning (RL), especially when the environment contains a large state space or sparse rewards. During exploration, the agent tries to discover novel areas or high reward (quality) areas. In most existing methods, the novelty and quality in the neighboring area of the current state are not well utilized to guide the exploration of the agent. To tackle this problem, we propose a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration in RL. CRL adopts clustering to divide the collected states into several clusters, based on which a bonus reward reflecting both novelty and quality in the neighboring area (cluster) of the current state is given to the agent. Experiments on a continuous control task and several Atari 2600 games show that CRL can outperform other state-of-the-art methods to achieve the best performance in most cases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

10/31/2019

VASE: Variational Assorted Surprise Exploration for Reinforcement Learning

Exploration in environments with continuous control and sparse rewards r...
06/01/2018

Deep Curiosity Search: Intra-Life Exploration Improves Performance on Challenging Deep Reinforcement Learning Problems

Traditional exploration methods in RL require agents to perform random a...
07/03/2017

Hashing Over Predicted Future Frames for Informed Exploration of Deep Reinforcement Learning

In reinforcement learning (RL) tasks, an efficient exploration mechanism...
08/10/2020

GRIMGEP: Learning Progress for Robust Goal Sampling in Visual Deep Reinforcement Learning

Autonomous agents using novelty based goal exploration are often efficie...
07/19/2021

Multimodal Reward Shaping for Efficient Exploration in Reinforcement Learning

Maintaining long-term exploration ability remains one of the challenges ...
06/19/2019

QXplore: Q-learning Exploration by Maximizing Temporal Difference Error

A major challenge in reinforcement learning for continuous state-action ...
12/26/2020

Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards

A major challenge in reinforcement learning is the design of exploration...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) [29] studies how an agent can maximize its cumulative reward in an unknown environment, by learning through exploration and exploiting the collected experience. A key challenge in RL is to balance the relationship between exploration and exploitation. If the agent explores more novel states, it might never find rewards to guide the learning direction. Otherwise, if the agent exploits rewards too intensely, it might converge to suboptimal behaviors and have fewer opportunities to discover more rewards from exploring.

Although reinforcement learning, especially deep RL (DRL), has recently attracted much attention and achieved significant performance in a variety of applications, such as game playing [20, 25] and robot navigation [33], exploration techniques in RL are far from satisfactory in many cases. Exploration strategy design is still one of the challenging problems in RL, especially when the environment contains a large state space or sparse rewards. Hence, it has become a hot research topic to design exploration strategy, and many exploration methods have been proposed in recent years.

Some heuristic methods for exploration, such as

-greedy [25, 29], uniform sampling [20] and i.i.d./correlated Gaussian noise [19, 24], try to directly obtain more different samples [5] during exploration. For hard applications or games, these heuristic methods are insufficient enough and the agent needs exploration techniques that can incorporate meaningful information about the environment.

In recent years, some exploration strategies try to discover novel state areas for exploring. The direct way to measure novelty is using the counts. In [5, 21]

, pseudo-count is estimated from a density model. Hash-based method 

[30] records the visits of hash codes as counts. There also exist some approximate ways for computing counts [15, 6, 3, 12, 28, 18, 13]. Besides, the state novelty can also be measured by empowerment [17], the agent’s belief of environment dynamics [14], prediction error of the system dynamics model [22, 26], prediction by exemplar model [11], and the error of predicting features of states [7].

All the above methods perform exploration mainly based on the novelty of states, without considering the quality of states. Furthermore, in most existing methods, the novelty and quality in the neighboring area of the current state are not well utilized to guide the exploration of the agent. To tackle this problem, we propose a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration in RL. The contributions of CRL are briefly outlined as follows:

  • CRL adopts clustering to divide the collected states into several clusters. The states from the same cluster have similar features. Hence, the clustered results in CRL provide a possibility to share meaningful information among different states from the same cluster.

  • CRL proposes a novel bonus reward, which reflects both novelty and quality in the neighboring area of the current state. Here, the neighboring area is defined by the states which share the same cluster with the current state. This bonus reward can guide the agent to perform efficient exploration, by seamlessly integrating novelty and quality of states.

  • Experiments on a continuous control task and several Atari 2600 [4] games with sparse rewards show that CRL can outperform other state-of-the-art methods to achieve the best performance in most cases. In particular, on several games known to be hard for heuristic exploration strategies, CRL achieves significant improvement over baselines.

The rest content of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents the details of CRL, including the clustering algorithm and clustering-base bonus reward. Section 4 is the experimental result and analysis. Section 5 concludes the paper.

2 Related Work

In the tabular setting, there is a finite number of state-action pairs that can directly define a decreasing function of the true visitation count as the exploration bonus. MBIE-EB [27] adds the square root of counts of state-action pairs as the bonus reward to the augmented Bellman equation for exploring less visited ones with theoretical guarantee.

In finite MDPs,  [15], R-Max [6] and UCRL [3] all make use of state-action counts and are activated by the idea of optimism under uncertainty.  [15] determines online to choose an efficient learning policy. R-Max [6] assumes the received reward is not in quality area and trains a fictitious model to learn the optimal policy. UCRL [3] chooses an optimistic policy by using upper confidence bounds. Bayesian RL methods maintain a distribution of belief state as the uncertainty over possible MDPs [12, 28, 18, 13] and use counts to explore .

In the continuous and high-dimensional space, the number of states is too large to be counted. In [5, 21], the exploration bonus reward is designed based on a state pseudo-count quantity, which is estimated from a density model. In the hash-based method [30], the hash function encodes states to hash codes and then it explores with the reciprocal of visitation as a reward bonus, which performs well on some hard exploration Atari 2600

games. Hash-based method is limited by the hash function. Static hashing, using locality-sensitive hashing, is stable but random. Learned hashing, using an autoencoder (AE) to capture the semantic features, updates during the training time. A related work is 

[1], which record the number of cluster center and action pairs which used to select an action from the Gibbs distribution given to a state.

These count-based methods activate the agent by making use of the novelty of states and do not take quality as consideration. To the best of our knowledge, the novelty and quality in the neighboring area of the current state have not been well utilized to guide the exploration of the agent in existing methods. This motivates the work of this paper.

3 Clustered Reinforcement Learning

This section presents the details of our proposed RL framework, called clustered reinforcement learning (CRL).

3.1 Notation

In this paper, we adopt similar notations as those in [30]

. More specifically, we model the RL problem as a finite-horizon discounted Markov decision process (MDP), which can be defined by a tuple

. Here, denotes the state space, denotes the action space,

denotes a transition probability distribution,

denotes a reward function, is an initial state distribution, is a discount factor, and denotes the horizon. In this paper, we assume . For cases with negative rewards, we can transform them to cases without negative rewards. The goal of RL is to maximize which is the total expected discounted reward over a policy .

3.2 Crl

The key idea of CRL is to adopt clustering to divide the collected states into several clusters, and then design a novel cluster-based bonus reward for exploration. The algorithmic framework of CRL is shown in Algorithm 1, from which we can find that CRL is actually a general framework. We can get different RL variants by taking different clustering algorithms and different policy updating algorithms. The details of Algorithm 1 are presented in the following subsections, including clustering and clustering-based bonus reward.

  Initialize the number of clusters , bonus coefficient , count coefficient
  for each iteration  do
     Collect a set of state-action samples with policy ;
     Cluster the state samples with , where and is some clustering algorithm;
     Compute the cluster assignment for each state ;
     Compute sum of rewards using (1) and the number of states using (2), ;
     Compute the bonus using (3);
     Update the policy using rewards with some policy updating algorithm;
  end for
Algorithm 1 Framework of Clustered Reinforcement Learning (CRL)

3.2.1 Clustering

Intuitively, both novelty and quality are useful for exploration strategy design. If the agent only cares about novelty, it might explore intensively in some unexplored areas without any reward. If the agent only cares about quality, it might converge to suboptimal behaviors and have low opportunity to discover unexplored areas with higher rewards. Hence, it is better to integrate both novelty and quality into the same exploration strategy.

We find that clustering can provide the possibility to integrate both novelty and quality together. Intuitively, a cluster of states can be treated as an area. The number of collected states in a cluster reflects the count (novelty) information of that area. The average reward of the collected states in a cluster reflects the quality of that area. Hence, based on the clustered results, we can design an exploration strategy considering both novelty and quality. Furthermore, the states from the same cluster have similar features, and hence the clustered results provide a possibility to share meaningful information among different states from the same cluster. The details of exploration strategy design based on clustering will be left to the following subsection. Here, we only describe the clustering algorithm.

In CRL, we perform clustering on states. Assume the number of clusters is , and we have collected state-action samples with some policy. We need to cluster the collected states into clusters by using some clustering algorithm , where and is the center of the

th cluster. We can use any clustering algorithm in the CRL framework. Although more sophisticated clustering algorithms might be able to achieve better performance, in this paper we just choose k-means algorithm 

[9]

for illustration. K-means is one of the simplest clustering algorithms with wide applications. The detail of k-means is omitted here, and readers can find it in most machine learning textbooks.

3.2.2 Clustering-based Bonus Reward

As stated above, clustering can provide the possibility to integrate both novelty and quality together for exploration. Here, we propose a novel clustering-based bonus reward, based on which many policy updating algorithms can be adopted to get an exploration strategy considering both novelty and quality.

Given a state , it will be allocated to the nearest cluster by the cluster assignment function . Here, and denotes the distance between and the th cluster center . The sum of rewards in the th cluster is denoted as , which can be computed as follows:

(1)

where is an indicator function. is also called cluster reward of cluster in this paper. The number of states in the th cluster is denoted as , which can be computed as follows:

(2)

Intuitively, a larger typically means that the area corresponding to cluster has more visits (exploration), which implies the novelty of this area is lower. Hence, the bonus reward should be inversely proportional to . The average reward of cluster , denoted as , can be used to represents the quality of the corresponding area of cluster . Hence, the bonus reward should be proportional to .

With the above intuition, we propose a clustering-based bonus reward to integrate both novelty and quality of the neighboring area of the current state , which is defined as follows:

(3)

where is the bonus coefficient and is the count (novelty) coefficient. Typically, is set to a very small number. There are two cases:

Please note that in this paper, we assume .

In the first case, , which means that the current policy can get some rewards in some states. Hence, the states with rewards can share meaningful information among different states from the same cluster. Please note that only means that there exist some clusters with positive reward, and does not mean all clusters have positive reward. It is possible that all states in some clusters have zero rewards. Please note that is typically set to be a very small positive number. In general, as long as there exist one or two states with positive rewards in cluster , will be larger than . Hence, if = , it is highly possible that all states in cluster have zero reward. Hence, when which means no rewards have been got for cluster , the bonus reward should be determined by the count of the cluster. This is just what our bonus reward function in (3) does. From (3), larger will result in smaller bonus reward . This will guide the agent to explore novel areas corresponding to clusters with less visits (exploration), which is reasonable. When , typically . For two clusters with the same cluster reward, the cluster with smaller number of states (higher novelty) will be more likely to be explored, which is reasonable. For two clusters with the same number of states, the cluster with higher cluster reward (higher quality) will be more likely to be explored, which is also reasonable.

In the second case, , which means that the policy is unbelievable and sharing information among different states from the same cluster is not a good choice. Furthermore, the states explored by the current policy should not get any extra bonus reward. This is just what our bonus reward function in (3) does.

Hence, the clustering-based bonus reward function defined in (3) is intuitively reasonable, and it can seamlessly integrate both novelty and quality into the same bonus function. Finally, the agent will adopt to update the policy (perform exploration). Many policy updating algorithms, such as trust region policy optimization (TRPO) [24], can be adopted. Please note that is only used for training CRL in Algorithm 1. But the performance evaluation (test) is measured without , which can be directly compared with existing RL methods without extra bonus reward.

4 Experiments

We use a continuous control task and several Atari 2600 games to evaluate CRL and baselines. We want to investigate and answer the following research questions:

  • Is the count-based exploration sufficient to excite the agent to achieve the final goal?

  • Can CRL improve performance significantly across different tasks?

  • What is the impact of hyperparameters on the performance?

Due to space limitation, the hyperparameter settings are reported in the supplementary material.

4.1 Experimental Setup

4.1.1 Environments

MuJoCo. The rllab benchmark [10] consists of various continuous control tasks to test RL algorithms. We design a variant of MountainCar, in which . The agent receives a reward of when the car escapes the valley from the right side and receives a reward of at other positions. One snapshot of this task is shown in Figure 1 (a).

(a) (b)
Figure 1:

(a) A snapshot of MountainCar; (b) Mean average return of different algorithms on MountainCar over 5 random seeds. The solid line represents the mean average return and the shaded area represents one standard deviation.

Arcade Learning Environment. The Arcade Learning Environment (ALE) [4] is an important benchmark for RL because of its high-dimensional state space and wide variety of video games. We select five games111The Montezuma game evaluated in [30] is not adopted in this paper for evaluation, because this paper only uses raw pixels which are not enough for learning effective policy on Montezuma game for most methods including CRL and other baselines. We can use advanced feature to learn effective policy, but this is not the focus of this paper. featuring long horizons and still requiring significant exploration: Freeway, Frostbite, Gravitar, Solaris and Venture. Figure 2

shows a snapshot for each game. For example, in Freeway, the agent need to go through the road, avoid the traffic and get the reward across the street. These games are classified into hard exploration category, according to the taxonomy in 

[5].

(a) freeway
(b) frostbite
(c) gravitar
(d) solaris
(e) venture
Figure 2: Snapshots of five hard exploration Atari 2600 games.

4.1.2 Baselines

CRL is a general framework which can adopt many different policy updating (optimization) algorithms to get different variants. In this paper, we only adopt trust region policy optimization (TRPO) [24] as the policy updating algorithm for CRL, and leave other variants of CRL for future work. We will denote our method as in the following content. The baselines for comparison include TRPO and TRPO-Hash [30], which are also TRPO-based methods and have achieved state-of-the-art performance in many tasks.

TRPO [24] is a classic policy gradient method, which uses trust region to guarantee stable improvement of policy and can handle both discrete and continuous action space. Furthermore, this method is not too sensitive to hyperparameters. TRPO adopts a Gaussian control noise as a heuristic exploration strategy.

TRPO-Hash [30] is a hash-based method, which is a generalization of classic count-based method for high-dimensional and continuous state spaces. The main idea is to use locality-sensitive hashing (LSH) [2]

to encode continuous and high-dimensional data into hash codes, like

. Here, is the length of hash codes. TRPO-Hash has several variants in [30]. For fair comparison, we choose SimHash [8] as the hash function and pixels as inputs for TRPO-Hash in this paper, because our CRL also adopts pixels rather than advanced features as inputs. TRPO-Hash is trained by using the code provided by its authors.

4.2 Disadvantage of Count-based Exploration

TRPO-Hash tries to help the agent explore more novel states and hope that it can achieve better performance than TRPO. But it might go through all states until reaching the goal state, which is the disadvantage of count (novelty) based exploration. Here, we use MountainCar to show this disadvantage. Figure 1 (b) shows the results of TRPO, TRPO-Hash and in MountainCar. We find that TRPO-Hash is slower than TRPO and on finding out the goal state because the curve of TRPO-Hash ascends until the middle of training. At the end of training, TRPO-Hash fails to reach the goal, but the other methods can achieve the goal. Our method, is the first to visit the goal and learn to achieve the goal. The reason why TRPO-Hash fails is that the novelty of states diverts the agent’s attention. The worst case is that the agent collects all states until it finds the goal. This disadvantage of count-based methods might become more serious in the high-dimensional state space since it is impossible to go through all states in the high-dimensional state space. Therefore, strategies with only count-based exploration are insufficient.

4.3 Performance on Atari 2600

For video games which typically have high-dimensional and complex state space, advanced features like those extracted by auto-encoder (AE) and variational auto-encoder (VAE) [16, 23] can be used for performance improvement. But this is not the focus of this paper. Hence, we simply use raw pixels as inputs for our method and other baselines. Hence, the comparison is fair.

For the five games of Atari 2600, the agent is trained for 500 iterations in all experiments with each iteration consisting of 0.4M frames. During each iteration, although the previous four frames are taken into account by the policy and baseline, clustering is performed on the latest frames and counting also pay attention to last frames. The performance is evaluated over 5 random seeds. The seeds for evaluation are the same for TRPO, TRPO-Hash and .

We show the training curves in Figure 3 and summarize all results in Table 1. Please note that TRPO and TRPO-Hash are trained with the code provided by the authors of TRPO-Hash. All hyper-parameters are reported in the supplementary material. We also compare our results to double-DQN [31], dueling network [32], A3C+ [5], double DQN with pseudo-count [5], the results of which are from [30].

(a) freeway
(b) frostbite
(c) gravitar
(d) solaris
(e) venture
Figure 3: Mean average return of different algorithms on Atari 2600 over 5 random seeds. The solid line represents the mean average return and the shaded area represents one standard deviation.
Freeway Frostbite Gravitar Solaris Venture
TRPO 17.55 1229.66 500.33 2110.22 283.48
TRPO-Hash 22.29 2954.10 577.47 2619.32 299.61
26.68 4558.52 541.72 2976.23 523.79
Double-DQN 33.3 1683 412 3068 98.0
Dueling network 0.0 4672 588 2251 497
A3C+ 27.3 507 246 2175 0
pseudo-count 29.2 1450 - - 369
Table 1: Average total reward after training for 50M time steps.

achieves significant improvement over TRPO and TRPO-Hash on Freeway, Frostbite, Solaris and Venture. Especially on Frostbite, achieves more than improvement compared with TRPO and more than improvement compared with TRPO-Hash. On Venture, achieves more than improvement compared with TRPO and TRPO-Hash. Furthermore, can outperform all other methods in most cases. Please note that DQN-based methods reuse off-policy experience, which is an advantage over TRPO. Hence, DQN-based methods have better performance than TRPO. But our can still outperform DQN-based methods in most cases. It is illustrated that the novelty and quality in the neighboring area of states give the on-policy agent an experience buffer like off-policy.

4.4 Hyperparameter Effect

We use Venture of Atari 2600 to study the performance sensitivity to hyperparameters, including in k-means, and in bonus reward.

We choose different from on Venture to illustrate the effect of when . Larger will divide the state space more precisely but the statistic of average reward might become less meaningful. Smaller will mix information from different areas, which might be too coarse for exploration. The results on Venture are summarized in Table 7 with fixed and . On Venture, the scores are roughly concave in , peaking at around . We can find that the performance is not too sensitive to in a relatively large range.

8 12 16 20
Venture 347.86 663.32 523.79 451.94
Table 2: Effect of the number () of clusters on Venture

We choose from and from . The results are shown in Table 3. The value of decides the level about pure exploration of novelty. Therefore, by fixing as , the performance in are better than because large causes the bonus rewards to overwhelm the true rewards. By fixing , decides the degree of pure exploration about the novel states. Larger means that more novel states will be explored. The scores are roughly concave, peaking at around , which shows that count-based exploration is insufficient.

0 0.0001 0.001 0.01 0.1
0.01 292.39 523.79 (10e-7) 512.74 (10e-6) 279.84 (10e-5) 182.04 (10e-4)
0.1 218.44 113.12 (10e-6) 101.95 (10e-5) 81.70(10e-4) 88.51 (10e-3)
Table 3: Effect of hyperparameter and on Venture, where the number in bracket is .

5 Conclusion

In this paper, we propose a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration. By using clustering, CRL provides a general framework to adopt both novelty and quality in the neighboring area of the current state for exploration. Experiments on a continuous control task and several hard exploration Atari 2600 games show that CRL can outperform other state-of-the-art methods to achieve the best performance in most cases.

References

  • [1] D. Abel, A. Agarwal, F. Diaz, A. Krishnamurthy, and R. E. Schapire. Exploratory gradient boosting for reinforcement learning in complex domains. CoRR, abs/1603.04119, 2016.
  • [2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In (FOCS, pages 459–468, 2006.
  • [3] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In NeurIPS, pages 49–56, 2006.
  • [4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 47:253–279, 2013.
  • [5] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying count-based exploration and intrinsic motivation. In NeurIPS, pages 1471–1479, 2016.
  • [6] R. I. Brafman and M. Tennenholtz. R-MAX - A general polynomial time algorithm for near-optimal reinforcement learning. JMLR, 3:213–231, 2002.
  • [7] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov. Exploration by random network distillation. CoRR, abs/1810.12894, 2018.
  • [8] M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, 2002.
  • [9] A. Coates and A. Y. Ng. Learning feature representations with k-means. In Neural Networks: Tricks of the Trade - Second Edition, pages 561–580. 2012.
  • [10] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In ICML, pages 1329–1338, 2016.
  • [11] J. Fu, J. D. Co-Reyes, and S. Levine. EX2: exploration with exemplar models for deep reinforcement learning. In NeurIPS, pages 2574–2584, 2017.
  • [12] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(5-6):359–483, 2015.
  • [13] A. Guez, N. Heess, D. Silver, and P. Dayan. Bayes-adaptive simulation-based search with value function approximation. In NeurIPS, pages 451–459, 2014.
  • [14] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. VIME: variational information maximizing exploration. In NeurIPS, pages 1109–1117, 2016.
  • [15] M. J. Kearns and S. P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2-3):209–232, 2002.
  • [16] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [17] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Empowerment: a universal agent-centric measure of control. In CEC, pages 128–135, 2005.
  • [18] J. Z. Kolter and A. Y. Ng. Near-bayesian exploration in polynomial time. In ICML, pages 513–520, 2009.
  • [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
  • [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [21] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-based exploration with neural density models. In ICML, pages 2721–2730, 2017.
  • [22] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, pages 2778–2787, 2017.
  • [23] D. J. Rezende, S. Mohamed, and D. Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, pages 1278–1286, 2014.
  • [24] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
  • [25] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • [26] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. CoRR, abs/1507.00814, 2015.
  • [27] A. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. JCSS, 74(8):1309–1331, 2008.
  • [28] Y. Sun, F. J. Gomez, and J. Schmidhuber. Planning to be surprised: Optimal bayesian exploration in dynamic environments. In AGI, pages 41–51, 2011.
  • [29] R. S. Sutton and A. G. Barto. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press, 1998.
  • [30] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In NeurIPS, pages 2750–2759, 2017.
  • [31] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In AAAI, pages 2094–2100, 2016.
  • [32] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures for deep reinforcement learning. In ICML, pages 1995–2003, 2016.
  • [33] M. Zhang, Z. McCarthy, C. Finn, S. Levine, and P. Abbeel. Learning deep neural network policies with continuous memory states. In ICRA, pages 520–527, 2016.

Appendix A Hyperparameter Settings

a.1 Hyperparameter setting in MuJoCo

In MuJoCo, the hyperparameter setting is shown as Table 4.

TRPO TRPO-Hash
TRPO batchsize 5000
TRPO stepsize 0.01
Discount factor 0.99
Policy hidden units (32, 32)
Baseline function Linear
Iteration 30
Max length of path 500
Bonus coefficient - 0.01 1
Others - Simhash dimension: 32 cluster centers: 16
- -
Table 4: Hyperparameter setting in MuJoCo

a.2 Hyperparameter settings in Atari 2600

The hyperparameter settings of results in Figure 3 and Table 1 is shown as Table 5 and Table 6.

In TRPO-Hash, [30] chooses 16,64,128,256,512 as the SimHash dimension. When SimHash dimension is 16, it only has 65536 distinct hash codes. When SimHash dimension is 64, it has more than hash codes and the agent only receive states during the training time. Therefore, we choose 64 as the SimHash dimension, which is sufficient. The hyperparameters settings about exploration of TRPO-Hash and TRPO-Clustering are shown in Table 6.

We choose smaller for Venture because Venture belongs to hard exploration category with sparse rewards. As the analyze in Section 4.4, large might mislead the agent to novel but low-quality area because the bonus is decided by the novelty part severely. Therefore, we choose for Venture.

TRPO, TRPO-Hash,
TRPO batchsize 100K
TRPO stepsize 0.001
Discount factor 0.99
Iteration 500
Max length of path 4500
Policy structure 16 conv filters of size

, stride 4

32 conv filters of size , stride 2
fully-connect layer with 256 units
linear transform and softmax to output action probabilities
Input preprocessing grayscale; downsampled to ; each pixel rescaled to
4 previous frames are concatenated to form the input state
Table 5: Hyperparameter setting in Atari 2600
TRPO-Hash
bonus coeffcient 0.01 0.01
others SimHash dimension: 64 number of cluster centers: 16
venture:
others:
Table 6: Hyperparameter setting of Exploration in the Table 3

Appendix B Hyperparameter sensitivity in Frostbite

Frostbite is easier than Venture because of dense rewards, although Frostbite is one of games in the hard exploration category. In Frostbite, we achieve more than of the baseline (TRPO) and more than of TRPO-Hash method score. And due to space limitation, we show the hyperparameter effects on this Section.

b.1 The hyperparameter in k-means

Similar to Venture, we choose different from when . When , it is too large to make the information in cluster centers useless. When , the performance has significant improvement. It is illustrated that the choice of needs to balance the relationship between segmentation granularity and statistics difference.

8 12 16 20
Frostbite 6275.06 2249.02 4526.88 1346.49
Table 7: The effect of the number of cluster center on Frostbite

b.2 The hyperparameter of bonus

When , we choose from and from . We fix the value of and performs better than in most cases. When is fixed as , the performances are better than TRPO. The performances have no significantly trend because this game has dense rewards. Therefore, the bonus is affected by the novelty slightly.

0 0.0001 0.001 0.01 0.1
0.01 3292.63 4526.88 (10e-7) 2719.07 (10e-6) 3691.03 (10e-5) 4558.52 (10e-4)
0.1 2835.28 766.28 (10e-6) 4125.28 (10e-5) 2350.22 (10e-4) 497.64 (10e-3)
Table 8: The effect of hyperparameter and on Frostbite, where the number in brackets is .