1 Introduction
Reinforcement learning (RL) [29] studies how an agent can maximize its cumulative reward in an unknown environment, by learning through exploration and exploiting the collected experience. A key challenge in RL is to balance the relationship between exploration and exploitation. If the agent explores more novel states, it might never find rewards to guide the learning direction. Otherwise, if the agent exploits rewards too intensely, it might converge to suboptimal behaviors and have fewer opportunities to discover more rewards from exploring.
Although reinforcement learning, especially deep RL (DRL), has recently attracted much attention and achieved significant performance in a variety of applications, such as game playing [20, 25] and robot navigation [33], exploration techniques in RL are far from satisfactory in many cases. Exploration strategy design is still one of the challenging problems in RL, especially when the environment contains a large state space or sparse rewards. Hence, it has become a hot research topic to design exploration strategy, and many exploration methods have been proposed in recent years.
Some heuristic methods for exploration, such as
greedy [25, 29], uniform sampling [20] and i.i.d./correlated Gaussian noise [19, 24], try to directly obtain more different samples [5] during exploration. For hard applications or games, these heuristic methods are insufficient enough and the agent needs exploration techniques that can incorporate meaningful information about the environment.In recent years, some exploration strategies try to discover novel state areas for exploring. The direct way to measure novelty is using the counts. In [5, 21]
, pseudocount is estimated from a density model. Hashbased method
[30] records the visits of hash codes as counts. There also exist some approximate ways for computing counts [15, 6, 3, 12, 28, 18, 13]. Besides, the state novelty can also be measured by empowerment [17], the agent’s belief of environment dynamics [14], prediction error of the system dynamics model [22, 26], prediction by exemplar model [11], and the error of predicting features of states [7].All the above methods perform exploration mainly based on the novelty of states, without considering the quality of states. Furthermore, in most existing methods, the novelty and quality in the neighboring area of the current state are not well utilized to guide the exploration of the agent. To tackle this problem, we propose a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration in RL. The contributions of CRL are briefly outlined as follows:

CRL adopts clustering to divide the collected states into several clusters. The states from the same cluster have similar features. Hence, the clustered results in CRL provide a possibility to share meaningful information among different states from the same cluster.

CRL proposes a novel bonus reward, which reflects both novelty and quality in the neighboring area of the current state. Here, the neighboring area is defined by the states which share the same cluster with the current state. This bonus reward can guide the agent to perform efficient exploration, by seamlessly integrating novelty and quality of states.

Experiments on a continuous control task and several Atari 2600 [4] games with sparse rewards show that CRL can outperform other stateoftheart methods to achieve the best performance in most cases. In particular, on several games known to be hard for heuristic exploration strategies, CRL achieves significant improvement over baselines.
The rest content of this paper is organized as follows. Section 2 introduces the related work. Section 3 presents the details of CRL, including the clustering algorithm and clusteringbase bonus reward. Section 4 is the experimental result and analysis. Section 5 concludes the paper.
2 Related Work
In the tabular setting, there is a finite number of stateaction pairs that can directly define a decreasing function of the true visitation count as the exploration bonus. MBIEEB [27] adds the square root of counts of stateaction pairs as the bonus reward to the augmented Bellman equation for exploring less visited ones with theoretical guarantee.
In finite MDPs, [15], RMax [6] and UCRL [3] all make use of stateaction counts and are activated by the idea of optimism under uncertainty. [15] determines online to choose an efficient learning policy. RMax [6] assumes the received reward is not in quality area and trains a fictitious model to learn the optimal policy. UCRL [3] chooses an optimistic policy by using upper confidence bounds. Bayesian RL methods maintain a distribution of belief state as the uncertainty over possible MDPs [12, 28, 18, 13] and use counts to explore .
In the continuous and highdimensional space, the number of states is too large to be counted. In [5, 21], the exploration bonus reward is designed based on a state pseudocount quantity, which is estimated from a density model. In the hashbased method [30], the hash function encodes states to hash codes and then it explores with the reciprocal of visitation as a reward bonus, which performs well on some hard exploration Atari 2600
games. Hashbased method is limited by the hash function. Static hashing, using localitysensitive hashing, is stable but random. Learned hashing, using an autoencoder (AE) to capture the semantic features, updates during the training time. A related work is
[1], which record the number of cluster center and action pairs which used to select an action from the Gibbs distribution given to a state.These countbased methods activate the agent by making use of the novelty of states and do not take quality as consideration. To the best of our knowledge, the novelty and quality in the neighboring area of the current state have not been well utilized to guide the exploration of the agent in existing methods. This motivates the work of this paper.
3 Clustered Reinforcement Learning
This section presents the details of our proposed RL framework, called clustered reinforcement learning (CRL).
3.1 Notation
In this paper, we adopt similar notations as those in [30]
. More specifically, we model the RL problem as a finitehorizon discounted Markov decision process (MDP), which can be defined by a tuple
. Here, denotes the state space, denotes the action space,denotes a transition probability distribution,
denotes a reward function, is an initial state distribution, is a discount factor, and denotes the horizon. In this paper, we assume . For cases with negative rewards, we can transform them to cases without negative rewards. The goal of RL is to maximize which is the total expected discounted reward over a policy .3.2 Crl
The key idea of CRL is to adopt clustering to divide the collected states into several clusters, and then design a novel clusterbased bonus reward for exploration. The algorithmic framework of CRL is shown in Algorithm 1, from which we can find that CRL is actually a general framework. We can get different RL variants by taking different clustering algorithms and different policy updating algorithms. The details of Algorithm 1 are presented in the following subsections, including clustering and clusteringbased bonus reward.
3.2.1 Clustering
Intuitively, both novelty and quality are useful for exploration strategy design. If the agent only cares about novelty, it might explore intensively in some unexplored areas without any reward. If the agent only cares about quality, it might converge to suboptimal behaviors and have low opportunity to discover unexplored areas with higher rewards. Hence, it is better to integrate both novelty and quality into the same exploration strategy.
We find that clustering can provide the possibility to integrate both novelty and quality together. Intuitively, a cluster of states can be treated as an area. The number of collected states in a cluster reflects the count (novelty) information of that area. The average reward of the collected states in a cluster reflects the quality of that area. Hence, based on the clustered results, we can design an exploration strategy considering both novelty and quality. Furthermore, the states from the same cluster have similar features, and hence the clustered results provide a possibility to share meaningful information among different states from the same cluster. The details of exploration strategy design based on clustering will be left to the following subsection. Here, we only describe the clustering algorithm.
In CRL, we perform clustering on states. Assume the number of clusters is , and we have collected stateaction samples with some policy. We need to cluster the collected states into clusters by using some clustering algorithm , where and is the center of the
th cluster. We can use any clustering algorithm in the CRL framework. Although more sophisticated clustering algorithms might be able to achieve better performance, in this paper we just choose kmeans algorithm
[9]for illustration. Kmeans is one of the simplest clustering algorithms with wide applications. The detail of kmeans is omitted here, and readers can find it in most machine learning textbooks.
3.2.2 Clusteringbased Bonus Reward
As stated above, clustering can provide the possibility to integrate both novelty and quality together for exploration. Here, we propose a novel clusteringbased bonus reward, based on which many policy updating algorithms can be adopted to get an exploration strategy considering both novelty and quality.
Given a state , it will be allocated to the nearest cluster by the cluster assignment function . Here, and denotes the distance between and the th cluster center . The sum of rewards in the th cluster is denoted as , which can be computed as follows:
(1) 
where is an indicator function. is also called cluster reward of cluster in this paper. The number of states in the th cluster is denoted as , which can be computed as follows:
(2) 
Intuitively, a larger typically means that the area corresponding to cluster has more visits (exploration), which implies the novelty of this area is lower. Hence, the bonus reward should be inversely proportional to . The average reward of cluster , denoted as , can be used to represents the quality of the corresponding area of cluster . Hence, the bonus reward should be proportional to .
With the above intuition, we propose a clusteringbased bonus reward to integrate both novelty and quality of the neighboring area of the current state , which is defined as follows:
(3) 
where is the bonus coefficient and is the count (novelty) coefficient. Typically, is set to a very small number. There are two cases:
Please note that in this paper, we assume .
In the first case, , which means that the current policy can get some rewards in some states. Hence, the states with rewards can share meaningful information among different states from the same cluster. Please note that only means that there exist some clusters with positive reward, and does not mean all clusters have positive reward. It is possible that all states in some clusters have zero rewards. Please note that is typically set to be a very small positive number. In general, as long as there exist one or two states with positive rewards in cluster , will be larger than . Hence, if = , it is highly possible that all states in cluster have zero reward. Hence, when which means no rewards have been got for cluster , the bonus reward should be determined by the count of the cluster. This is just what our bonus reward function in (3) does. From (3), larger will result in smaller bonus reward . This will guide the agent to explore novel areas corresponding to clusters with less visits (exploration), which is reasonable. When , typically . For two clusters with the same cluster reward, the cluster with smaller number of states (higher novelty) will be more likely to be explored, which is reasonable. For two clusters with the same number of states, the cluster with higher cluster reward (higher quality) will be more likely to be explored, which is also reasonable.
In the second case, , which means that the policy is unbelievable and sharing information among different states from the same cluster is not a good choice. Furthermore, the states explored by the current policy should not get any extra bonus reward. This is just what our bonus reward function in (3) does.
Hence, the clusteringbased bonus reward function defined in (3) is intuitively reasonable, and it can seamlessly integrate both novelty and quality into the same bonus function. Finally, the agent will adopt to update the policy (perform exploration). Many policy updating algorithms, such as trust region policy optimization (TRPO) [24], can be adopted. Please note that is only used for training CRL in Algorithm 1. But the performance evaluation (test) is measured without , which can be directly compared with existing RL methods without extra bonus reward.
4 Experiments
We use a continuous control task and several Atari 2600 games to evaluate CRL and baselines. We want to investigate and answer the following research questions:

Is the countbased exploration sufficient to excite the agent to achieve the final goal?

Can CRL improve performance significantly across different tasks?

What is the impact of hyperparameters on the performance?
Due to space limitation, the hyperparameter settings are reported in the supplementary material.
4.1 Experimental Setup
4.1.1 Environments
MuJoCo. The rllab benchmark [10] consists of various continuous control tasks to test RL algorithms. We design a variant of MountainCar, in which . The agent receives a reward of when the car escapes the valley from the right side and receives a reward of at other positions. One snapshot of this task is shown in Figure 1 (a).
Arcade Learning Environment. The Arcade Learning Environment (ALE) [4] is an important benchmark for RL because of its highdimensional state space and wide variety of video games. We select five games^{1}^{1}1The Montezuma game evaluated in [30] is not adopted in this paper for evaluation, because this paper only uses raw pixels which are not enough for learning effective policy on Montezuma game for most methods including CRL and other baselines. We can use advanced feature to learn effective policy, but this is not the focus of this paper. featuring long horizons and still requiring significant exploration: Freeway, Frostbite, Gravitar, Solaris and Venture. Figure 2
shows a snapshot for each game. For example, in Freeway, the agent need to go through the road, avoid the traffic and get the reward across the street. These games are classified into hard exploration category, according to the taxonomy in
[5].4.1.2 Baselines
CRL is a general framework which can adopt many different policy updating (optimization) algorithms to get different variants. In this paper, we only adopt trust region policy optimization (TRPO) [24] as the policy updating algorithm for CRL, and leave other variants of CRL for future work. We will denote our method as in the following content. The baselines for comparison include TRPO and TRPOHash [30], which are also TRPObased methods and have achieved stateoftheart performance in many tasks.
TRPO [24] is a classic policy gradient method, which uses trust region to guarantee stable improvement of policy and can handle both discrete and continuous action space. Furthermore, this method is not too sensitive to hyperparameters. TRPO adopts a Gaussian control noise as a heuristic exploration strategy.
TRPOHash [30] is a hashbased method, which is a generalization of classic countbased method for highdimensional and continuous state spaces. The main idea is to use localitysensitive hashing (LSH) [2]
to encode continuous and highdimensional data into hash codes, like
. Here, is the length of hash codes. TRPOHash has several variants in [30]. For fair comparison, we choose SimHash [8] as the hash function and pixels as inputs for TRPOHash in this paper, because our CRL also adopts pixels rather than advanced features as inputs. TRPOHash is trained by using the code provided by its authors.4.2 Disadvantage of Countbased Exploration
TRPOHash tries to help the agent explore more novel states and hope that it can achieve better performance than TRPO. But it might go through all states until reaching the goal state, which is the disadvantage of count (novelty) based exploration. Here, we use MountainCar to show this disadvantage. Figure 1 (b) shows the results of TRPO, TRPOHash and in MountainCar. We find that TRPOHash is slower than TRPO and on finding out the goal state because the curve of TRPOHash ascends until the middle of training. At the end of training, TRPOHash fails to reach the goal, but the other methods can achieve the goal. Our method, is the first to visit the goal and learn to achieve the goal. The reason why TRPOHash fails is that the novelty of states diverts the agent’s attention. The worst case is that the agent collects all states until it finds the goal. This disadvantage of countbased methods might become more serious in the highdimensional state space since it is impossible to go through all states in the highdimensional state space. Therefore, strategies with only countbased exploration are insufficient.
4.3 Performance on Atari 2600
For video games which typically have highdimensional and complex state space, advanced features like those extracted by autoencoder (AE) and variational autoencoder (VAE) [16, 23] can be used for performance improvement. But this is not the focus of this paper. Hence, we simply use raw pixels as inputs for our method and other baselines. Hence, the comparison is fair.
For the five games of Atari 2600, the agent is trained for 500 iterations in all experiments with each iteration consisting of 0.4M frames. During each iteration, although the previous four frames are taken into account by the policy and baseline, clustering is performed on the latest frames and counting also pay attention to last frames. The performance is evaluated over 5 random seeds. The seeds for evaluation are the same for TRPO, TRPOHash and .
We show the training curves in Figure 3 and summarize all results in Table 1. Please note that TRPO and TRPOHash are trained with the code provided by the authors of TRPOHash. All hyperparameters are reported in the supplementary material. We also compare our results to doubleDQN [31], dueling network [32], A3C+ [5], double DQN with pseudocount [5], the results of which are from [30].
Freeway  Frostbite  Gravitar  Solaris  Venture  
TRPO  17.55  1229.66  500.33  2110.22  283.48 
TRPOHash  22.29  2954.10  577.47  2619.32  299.61 
26.68  4558.52  541.72  2976.23  523.79  
DoubleDQN  33.3  1683  412  3068  98.0 
Dueling network  0.0  4672  588  2251  497 
A3C+  27.3  507  246  2175  0 
pseudocount  29.2  1450      369 
achieves significant improvement over TRPO and TRPOHash on Freeway, Frostbite, Solaris and Venture. Especially on Frostbite, achieves more than improvement compared with TRPO and more than improvement compared with TRPOHash. On Venture, achieves more than improvement compared with TRPO and TRPOHash. Furthermore, can outperform all other methods in most cases. Please note that DQNbased methods reuse offpolicy experience, which is an advantage over TRPO. Hence, DQNbased methods have better performance than TRPO. But our can still outperform DQNbased methods in most cases. It is illustrated that the novelty and quality in the neighboring area of states give the onpolicy agent an experience buffer like offpolicy.
4.4 Hyperparameter Effect
We use Venture of Atari 2600 to study the performance sensitivity to hyperparameters, including in kmeans, and in bonus reward.
We choose different from on Venture to illustrate the effect of when . Larger will divide the state space more precisely but the statistic of average reward might become less meaningful. Smaller will mix information from different areas, which might be too coarse for exploration. The results on Venture are summarized in Table 7 with fixed and . On Venture, the scores are roughly concave in , peaking at around . We can find that the performance is not too sensitive to in a relatively large range.
8  12  16  20  

Venture  347.86  663.32  523.79  451.94 
We choose from and from . The results are shown in Table 3. The value of decides the level about pure exploration of novelty. Therefore, by fixing as , the performance in are better than because large causes the bonus rewards to overwhelm the true rewards. By fixing , decides the degree of pure exploration about the novel states. Larger means that more novel states will be explored. The scores are roughly concave, peaking at around , which shows that countbased exploration is insufficient.
0  0.0001  0.001  0.01  0.1  

0.01  292.39  523.79 (10e7)  512.74 (10e6)  279.84 (10e5)  182.04 (10e4) 
0.1  218.44  113.12 (10e6)  101.95 (10e5)  81.70(10e4)  88.51 (10e3) 
5 Conclusion
In this paper, we propose a novel RL framework, called clustered reinforcement learning (CRL), for efficient exploration. By using clustering, CRL provides a general framework to adopt both novelty and quality in the neighboring area of the current state for exploration. Experiments on a continuous control task and several hard exploration Atari 2600 games show that CRL can outperform other stateoftheart methods to achieve the best performance in most cases.
References
 [1] D. Abel, A. Agarwal, F. Diaz, A. Krishnamurthy, and R. E. Schapire. Exploratory gradient boosting for reinforcement learning in complex domains. CoRR, abs/1603.04119, 2016.
 [2] A. Andoni and P. Indyk. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In (FOCS, pages 459–468, 2006.
 [3] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In NeurIPS, pages 49–56, 2006.
 [4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 47:253–279, 2013.
 [5] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying countbased exploration and intrinsic motivation. In NeurIPS, pages 1471–1479, 2016.
 [6] R. I. Brafman and M. Tennenholtz. RMAX  A general polynomial time algorithm for nearoptimal reinforcement learning. JMLR, 3:213–231, 2002.
 [7] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov. Exploration by random network distillation. CoRR, abs/1810.12894, 2018.
 [8] M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, 2002.
 [9] A. Coates and A. Y. Ng. Learning feature representations with kmeans. In Neural Networks: Tricks of the Trade  Second Edition, pages 561–580. 2012.
 [10] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In ICML, pages 1329–1338, 2016.
 [11] J. Fu, J. D. CoReyes, and S. Levine. EX2: exploration with exemplar models for deep reinforcement learning. In NeurIPS, pages 2574–2584, 2017.
 [12] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar. Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning, 8(56):359–483, 2015.
 [13] A. Guez, N. Heess, D. Silver, and P. Dayan. Bayesadaptive simulationbased search with value function approximation. In NeurIPS, pages 451–459, 2014.
 [14] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. VIME: variational information maximizing exploration. In NeurIPS, pages 1109–1117, 2016.
 [15] M. J. Kearns and S. P. Singh. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49(23):209–232, 2002.
 [16] D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.
 [17] A. S. Klyubin, D. Polani, and C. L. Nehaniv. Empowerment: a universal agentcentric measure of control. In CEC, pages 128–135, 2005.
 [18] J. Z. Kolter and A. Y. Ng. Nearbayesian exploration in polynomial time. In ICML, pages 513–520, 2009.
 [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
 [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [21] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Countbased exploration with neural density models. In ICML, pages 2721–2730, 2017.
 [22] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiositydriven exploration by selfsupervised prediction. In ICML, pages 2778–2787, 2017.

[23]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, pages 1278–1286, 2014.  [24] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015.
 [25] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [26] B. C. Stadie, S. Levine, and P. Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. CoRR, abs/1507.00814, 2015.
 [27] A. L. Strehl and M. L. Littman. An analysis of modelbased interval estimation for markov decision processes. JCSS, 74(8):1309–1331, 2008.
 [28] Y. Sun, F. J. Gomez, and J. Schmidhuber. Planning to be surprised: Optimal bayesian exploration in dynamic environments. In AGI, pages 41–51, 2011.
 [29] R. S. Sutton and A. G. Barto. Reinforcement learning  an introduction. Adaptive computation and machine learning. MIT Press, 1998.
 [30] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel. #exploration: A study of countbased exploration for deep reinforcement learning. In NeurIPS, pages 2750–2759, 2017.
 [31] H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double qlearning. In AAAI, pages 2094–2100, 2016.
 [32] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas. Dueling network architectures for deep reinforcement learning. In ICML, pages 1995–2003, 2016.
 [33] M. Zhang, Z. McCarthy, C. Finn, S. Levine, and P. Abbeel. Learning deep neural network policies with continuous memory states. In ICRA, pages 520–527, 2016.
Appendix A Hyperparameter Settings
a.1 Hyperparameter setting in MuJoCo
In MuJoCo, the hyperparameter setting is shown as Table 4.
TRPO  TRPOHash  
TRPO batchsize  5000  
TRPO stepsize  0.01  
Discount factor  0.99  
Policy hidden units  (32, 32)  
Baseline function  Linear  
Iteration  30  
Max length of path  500  
Bonus coefficient    0.01  1 
Others    Simhash dimension: 32  cluster centers: 16 
   
a.2 Hyperparameter settings in Atari 2600
In TRPOHash, [30] chooses 16,64,128,256,512 as the SimHash dimension. When SimHash dimension is 16, it only has 65536 distinct hash codes. When SimHash dimension is 64, it has more than hash codes and the agent only receive states during the training time. Therefore, we choose 64 as the SimHash dimension, which is sufficient. The hyperparameters settings about exploration of TRPOHash and TRPOClustering are shown in Table 6.
We choose smaller for Venture because Venture belongs to hard exploration category with sparse rewards. As the analyze in Section 4.4, large might mislead the agent to novel but lowquality area because the bonus is decided by the novelty part severely. Therefore, we choose for Venture.
TRPO, TRPOHash,  
TRPO batchsize  100K 
TRPO stepsize  0.001 
Discount factor  0.99 
Iteration  500 
Max length of path  4500 
Policy structure 
16 conv filters of size , stride 4 
32 conv filters of size , stride 2  
fullyconnect layer with 256 units  
linear transform and softmax to output action probabilities  
Input preprocessing  grayscale; downsampled to ; each pixel rescaled to 
4 previous frames are concatenated to form the input state 
TRPOHash  

bonus coeffcient  0.01  0.01 
others  SimHash dimension: 64  number of cluster centers: 16 
venture:  
others: 
Appendix B Hyperparameter sensitivity in Frostbite
Frostbite is easier than Venture because of dense rewards, although Frostbite is one of games in the hard exploration category. In Frostbite, we achieve more than of the baseline (TRPO) and more than of TRPOHash method score. And due to space limitation, we show the hyperparameter effects on this Section.
b.1 The hyperparameter in kmeans
Similar to Venture, we choose different from when . When , it is too large to make the information in cluster centers useless. When , the performance has significant improvement. It is illustrated that the choice of needs to balance the relationship between segmentation granularity and statistics difference.
8  12  16  20  

Frostbite  6275.06  2249.02  4526.88  1346.49 
b.2 The hyperparameter of bonus
When , we choose from and from . We fix the value of and performs better than in most cases. When is fixed as , the performances are better than TRPO. The performances have no significantly trend because this game has dense rewards. Therefore, the bonus is affected by the novelty slightly.
0  0.0001  0.001  0.01  0.1  

0.01  3292.63  4526.88 (10e7)  2719.07 (10e6)  3691.03 (10e5)  4558.52 (10e4) 
0.1  2835.28  766.28 (10e6)  4125.28 (10e5)  2350.22 (10e4)  497.64 (10e3) 
Comments
There are no comments yet.