Exploration is an essential component of reinforcement learning (RL). During training, agents have to choose between exploiting the current policy and exploring the environment. On the one hand, exploration can make the training process more efficient and improve the current policy. On the other hand, excessive exploration may waste computing resources visiting task-irrelevant regions of the environment  .
Exploration is essential to solving sparse-reward tasks in environments with high dimensional state space. In this case, an exhaustive exploration of the environment is impossible in practice. A considerable amount of interaction data is required to train an effective policy due to the sparseness of the reward. A common approach is to use knowledge-based or competence-based intrinsic motivation . In the first more commonly used approach, it is proposed to augment an extrinsic reward with the additional dense intrinsic reward that encourages exploration [2, 3, 15]. Another approach is to separate an exploration phase from a learning phase . As noted by the authors of , the disadvantage of the first approach is that an intrinsic reward is a non-renewable resource. After exploring an area and consuming the intrinsic reward, the agent likely will never return to the area to continue exploration due to catastrophic forgetting and inability to rediscover the path because it has already consumed the intrinsic reward that could lead to the area.
Implementing a mechanism that reliably returns the agent to the neighborhood of known states from which further exploration might be most effective is a challenging task for both approaches. In the case of resettable environments (e.g., Atari games or some robotic simulators), it is possible to save the current state of the simulator and restore it in the future. Many real-world RL applications are inherently reset-free and require a non-episodic learning process. Examples of this class of problems include robotics problems in real-world settings and problems in domains where effective simulators are not available and agents have to learn directly in the real world. Recent work has focused on reset-free setting [17, 18]. On the other hand, for many domains, simulators are available and widely used at least in the pretraining phase (e.g., robotics simulators ). Specific properties of resettable environments make it possible to reliably return to previously visited states and increase exploration efficiency by reducing the required number of interactions with the environment. Therefore, exploration algorithms should effectively visit all states of an environment. However, factoring in the high dimension of the state space, it is intractable in practice to store all the visited states. Therefore, effective exploration of the environment remains a difficult problem, even for resettable environments.
In this paper, we propose to formalize the interaction with resettable environments as a persistent Markov decision process (pMDP). We introduce the RbExplore algorithm, which combines the properties of pMDP with clustering of the state space based on similarity of states to approach long-term exploration problems. The distance between states in trajectories is used as a feature for clustering. The states located close to each other are considered similar. The states distant from each other are considered dissimilar. Clusters are organized into a directed graph where vertices correspond to clusters, and arcs correspond to possible transitions between states belonging to different clusters. RbExplore uses a novelty detection module as a filter of perspective states. We introduce the Prince of Persia game environment as a hard-exploration benchmark suitable for comparing various exploration methods. The percentage coverage metric of the game’s levels is proposed to evaluate exploration. RbExplore outperforms or shows comparable performance with state-of-the-art curiosity methods ICM and RND on different levels of the Prince of Persia environment.
2 Related Work
Three types of exploration policies can be indicated. Exploration policies of the first type use an intrinsic reward as an exploration bonus. Exploration strategies of the second type are specific to multi-goal RL settings where exploration is driven by selecting sub-goals. Exploration policy of the third type use clustered representation of the set of visited states.
In recent works [4, 3, 11, 15], the curiosity-driven exploration of deep RL agents is investigated. The exploration methods proposed by these works can be attributed to the first type. The extrinsic sparse reward is replaced or augmented by a dense intrinsic reward measuring the curiosity or uncertainty of the agent at a given state. In this way, the agent is encouraged to explore unseen scenarios and unexplored regions of the environment. It has been shown that such a curiosity-driven policy can improve learning efficiency, overcome the sparse reward problem to some extent, and successfully learn challenging tasks in no-reward settings.
Another line of recent work focuses on multi-goal RL and can be attributed to the second type. Algorithm HER  augments trajectories in the memory buffer by replacing the original goals with the actually achieved goals. It helps to get a positive reward for the initially unsuccessful experience, makes reward signal denser, and learning more efficient especially in sparse-reward environments. A number of RL methods [13, 7]
focus on developing a better policy for selecting sub-goals for augmentation of failure trajectories in order to improve HER. These policies ensure that the distribution of the selected goals adaptively changes throughout training. The distribution should have greater variance in the early stages of training and direct the agent to the original goal in the latter stages. Other works[8, 12, 16] propose methods to generate goals that are feasible, and their complexity corresponds to the quality of the agent’s policy. The distribution of generated goals changes adaptively to support sufficient variance ensuring exploration in goal space.
The Go-Explore  algorithm could be attributed to the third type of exploration policy. It builds a clustered lower-dimensional representation of a set of visited states in the form of an archive of cells. Two types of representation are proposed for Montezuma’s Revenge environment: with domain knowledge based on discretized agent coordinates, room number, collected items, and without domain knowledge based on compressed grayscale images with discretized pixel intensity into eight levels.
Exploration of the state space is implemented as an iterative process. At each iteration, a cell is sampled from the archive, its state is restored in the environment, and the agent starts exploration with stochastic exploration policy. If the agent visits new cells during the run, they are added to the archive. The statistic of visits is updated for existing cells in each iteration. For both types of representation, the cell stores the highest score that the agent had when it visited the cell. A cell is sampled from the archive by heuristic, preferring more promising cells.
Exploiting domain-specific knowledge makes it difficult to use Go-Explore in a new environment. In our work, we use the idea of clustering of a set of visited states and propose to use a supervised learning model to perform clustering based on the similarity of states. We use a reachability network from the Episodic Curiosity Module as a similarity model predicting similarity score for a pair of states. The clusters are organized into a graph using connectivity information between their states in a similar way as the Memory graph  is built. RND module  is used to detect novel states. Our approach does not exploit domain knowledge, which allows us to apply RbExplore to the Prince of Persia environment without feature handcrafting.
3.1 Markov Decision Processes
A Markov Decision Process (MDP) for a fully observable environment is considered as a model for interaction of an agent with an environment:
— a state space, — an action space, — a state transition distribution, — a reward function, — a discount factor, and — an initial state of the environment.
An episode starts in the state . Each step the agent samples an action based on the current state : , where — a stochastic policy, which defines the conditional distribution over the action space. The environment responds with a reward and moves into a new state . The result of the episode is a return — a discounted sum of the rewards obtained by the agent during the episode, where . Action-value function is defined as the expected return for using action in a certain state : . State-value function can be defined via action-value function : . The goal of reinforcement learning is to find the optimal policy :
3.2 Persistent MDPs
The persistent data structure allows access to any version of it at any time . Inspired by that structures, we propose persistent MDPs for RL. We consider an MDP to have a persistence property if for any state exists policy , which transits agent from the initial state to state , in a finite number of timesteps . Thus, a persistent MDP is expressed as:
However, the way of returning to visited states can differ. For example, instead of policy , it could be an environment property, that allows one to save and load states.
4 Exploration via State Space Clustering
In this paper, we propose the RbExplore algorithm that uses similarity of states to build clustered representation of a set of visited states. There are two essential components of the algorithm: a similarity model, which predicts a similarity measure for a pair of states, and a graph of clusters, which is a clustered representation of a set of visited states organized as a graph. The scheme of the algorithm is shown in Fig. 1.
A high-level overview of one iteration of the RbExplore algorithm:
Generate exploration trajectories: sample clusters from the graph of clusters based on cluster visits statistics (e.g., preferring the least visited clusters), roll back to corresponding states, and run exploration.
Generate training data for the similarity model from the exploration trajectories and additional trajectories starting from novel states filtered by the novelty detection module. Full trajectory prefixes are used to generate negative examples.
Train the similarity model .
Update the graph with states from the exploration trajectories and merge its clusters. A state is added to the graph and forms new clusters if it is dissimilar to states which are already in the graph . The similarity model is used to select such states.
Train the novelty detection module on the states from the exploration trajectories.
As a result of one iteration, novel states are added to the graph , the statistics of visits to existing clusters are updated, the similarity model and the novelty detection module are trained on the data collected during the current iteration.
4.1 Similarity Model
As a feature for clustering, it is proposed to use the distance between states in trajectories. The states located close to each other are considered similar, the states distant from each other are considered dissimilar. A supervised model is used to estimate the similarity measure between states. It takes a pair of states as input and outputs a similarity measure between them. The training dataset is produced by labeling pairs of states for the same trajectory : triples are constructed, where is a class label. States are considered similar () if the distance between them in the trajectory is less than steps: . Negative examples () are obtained from pairs of states that are more than steps apart from each other: . The model
is trained as a binary classifier predicting whether two states are close in the trajectory (class 1) or not (class 0).
illustrates the training data generation procedure. A neural network model is used as a similarity modelas the experiments are performed in environments with high-dimensional state spaces. The network with parameters is trained on the training data set
using binary cross-entropy as a loss function:
4.2 Graph of Clusters
The clustering of the state space is an iterative process using the similarity model and the chosen stochastic exploration policy
(e.g. uniform distribution over actions). A clusteris a pair of state — the center of the cluster and the corresponding snapshot of the simulator . At each iteration, a cluster is selected from which exploration will be continued. The state of the selected cluster is restored in the environment using the corresponding snapshot, and the agent starts exploration with stochastic exploration policy . For each state of the obtained trajectory a measure of similarity with the current set of clusters is calculated. A state is considered as belonging to the cluster if the measure of similarity between the state and the cluster’s state is greater than the selected threshold : . Otherwise, a new cluster is created.
Clusters are organized into a directed graph . Each vertex of the graph corresponds to a cluster. If two successive states and in the same trajectory belong to different clusters and , an arc between those clusters is added to the graph. Cluster visit statistics and arc visit statistics are updated each iteration. The graph is initialized with the initial state of the environment . The cluster is selected from the graph for exploration using sampling strategy
that can take into account the structure of the graph and the collected statistics (e.g., the probability of sampling a cluster is inversely proportional to the number of visits). Each iteration of the graph building procedure can be alternated with training the similarity modelon the obtained trajectories. The search for a cluster to which the current state of the trajectory belongs can be accelerated by considering first those vertices which are adjacent to the cluster to which the previous state was assigned.
In order to improve the quality of the similarity model on states from novel regions of the state space , an RND module is used. For each state, the RND module outputs an intrinsic reward, which is used as a measure of the state’s novelty. The state is considered novel if the intrinsic reward is greater than . At each iteration, all states from the trajectory , which are detected by the RND module as a novel, are placed into the buffer of novel states . The buffer is used to generate additional training data for the similarity model which includes states from novel regions of the state space . A set of states is randomly sampled from the buffer , and the agent starts an exploration with an exploration strategy by restoring the sampled simulator states. The resulting trajectories are used solely to generate additional training data for the similarity model .
When an exploration trajectory is processed and a new cluster is added to the graph a new arc from to the parent cluster from which was created is added to a set of arcs to parent cluster . A prefix — a sequence of states from the parent cluster to is stored along with the arc. Thus, for any cluster in the graph , it is possible to construct a sequence of states that leads to the initial cluster . This property is used to add negative pairs to the training data set of the similarity model such that and are distant from each other and . If the exploration trajectory started from a cluster , a full prefix for the trajectory is constructed to obtain additional negative examples. For a state , a sufficiently distant state is randomly selected to form a negative example . Fig. 2 (c) illustrates this procedure.
Redundant clusters are created at each iteration due to the inaccuracy of the similarity model . A cluster merge procedure is proposed to mitigate the issue. It tests all pairs of clusters in the graph and merges the pair into a new cluster if the similarity measure of their states is greater than the selected threshold : . The new cluster is incident to any arc that was incident to the two original clusters. Cluster visits statistics are summarized during merging. As a state and a snapshot of the new cluster, the states and the snapshot of the cluster that was added to the graph earlier are selected.
5 The Prince of Persia Domain
We evaluate our algorithm on the challenging Prince of Persia game environment, which has over ten complex levels. Fig. 3 shows the first level of the game. To pass it, the prince needs to: find a sword, avoiding traps; return to the starting point; defeat the guard, and end the level. In most cases, the agent goes to the next level when he passes the final door, which he also needs to open somehow.
The input of the agent is a 96x96 grayscale image. The agent chooses from seven possible joystick actions no-op, left, right, up, down, A, B. The same action may work differently depending on the situation in the game. For example, the agent can jump forward for a different distance, depending on the take-off run. The agent can jump after using action A or strike with a sword if in combat. Also, the agent can interact with various objects: ledges, pressing plates, jugs, breakable plates.
The environment is difficult for RL algorithms because it has the action space with changing causal relationships and requires mastery in many game aspects, such as fighting. Also, the first reward is thousands of steps apart from the initial agent position.
We use the percentage coverage metric % Cov = , i.e. the ratio of coverage to the max coverage in the corresponding level, where is a set of the visited units and full coverage set. We consider the minimum unit to be the area that roughly corresponds to space above the plate. For example, for the first level in the room with the sword, the full area has 36 units, but the agent can visit only 34 of them.
We evaluate the exploration performance of our RbExplore algorithm on the first three levels of the Prince of Persia environment alongside state-of-the-art curiosity methods ICM and RND.
6.1 Experimental Setup
Raw environment frames are preprocessed by applying a frameskip of 4 frames, converting into grayscale, and downsampling to 96x96. Frame pixels are rescaled to the 0-1 range. The neural network consists of two subnetworks: ResNet-18 network and a four-layer fully connected neural network. ResNet-18 accepts a frame as input and produces its embedding of dimension 512. Embeddings of the pair’s frames are concatenated and fed into the fully connected network that performs binary classification of the pair of embeddings.
The same algorithm parameters are used for all levels. The maximum number of frames per exploration trajectory is 1,500. Episodes are terminated on the loss of life. Training data for the similarity model is generated with parameters and . The similarity threshold . The similarity threshold for the cluster merging . Merge is run every iteration. For exploration, clusters are sampled from the graph with probabilities inversely proportional to the number of visits. Every iteration clusters are sampled from the graph. The uniform distribution overall actions are used as the exploration policy . RND module intrinsic reward threshold for detecting novel states . states are sampled from the buffer of novel states.
To prevent the formation of a large number of clusters at the early stages due to the low quality of the similarity model , the similarity model is pretrained for 500,000 steps with a gradual increase of the value of the similarity threshold from 0 to . At the same time, the necessary normalization parameters of the RND module are initialized. After pretraining the graph is reset to the environment’s initial state and RbExplore is restarted with the fixed similar threshold .
6.2 Exploring the Prince of Persia Environment
By design of the Prince of Persia environment, the agent’s observation does not always contain information about whether the agent carries a sword. To get around this issue the agent starts the first level with the sword. Also, the agent is placed at the point where the sword is located. This location is far enough from the final door, so reaching the final door is still a challenging task. For the other levels, we did not make any changes to the initial state.
Evaluation of RbExplore, ICM, and RND on the first three levels of the Prince of Persia environment is shown in Fig. 5. On the first level, RbExplore performed significantly outperforms ICM and RND, and also have visited all possible rooms of the level. The visualization of the coverage is shown in Fig. 4.
On levels two and three, none of the algorithms were able to visit all the rooms in 15 million steps. On level two RbExpore shows slightly worse results than RND and ICM. We explain it by the fact that the learning process in RND and ICM is driven by an exploration bonus, which helps them to explore local areas inside rooms more accurately. Each of the algorithms was able to visit only seven rooms at the very beginning of the level. On level three, RbExplore shows slightly better coverage than RND, and both of them outperform ICM.
6.3 Ablation Study
In order to evaluate the contribution of each component of the RbExplore algorithm, we perform an ablation study. Experiments were run on level one with two versions of RbExplore. The first version does not merge clusters in the graph . The second one does not build the full trajectory prefix when generating negative examples for the training data of the similarity model; thus, states of negative pairs are sampled only from the same trajectory. Fig. 6 shows that disabling these components hurts the performance of the algorithm.
Additional experiments were run on level one to study the impact of the value of parameter on the performance. Fig. 6 shows that the performance of RbExplore with is worse than that of RbExplore with .
In this paper, we introduce a pure exploration algorithm RbExplore that uses the formalized version of a resettable environment named persistent MDP. The experiments showed that RbExplore coupled with a simple exploration policy, which is a uniform distribution over actions, demonstrates performance comparable with RND and ICM methods in the hard-exploration environment of the Prince of Persia game in a no-reward setting. RbExplore, ICM, and RND got stuck on the second and third levels roughly in the same locations where the agent must perform a very specific sequence of actions over a long-time horizon to go further. The combining of RbExplore exploration and exploitation of RL approaches, which also utilize pMDPs, is an important direction for future work to resolve this problem.
Acknowledgements. This work was supported by the Russian Science Foundation (Project No. 20-71-10116).
-  (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 5048–5058. External Links: Cited by: §2.
-  (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1471–1479. External Links: Cited by: §1.
-  (2019) Large-scale study of curiosity-driven learning. In ICLR, Cited by: §1, §2.
-  (2019) Exploration by random network distillation. In International Conference on Learning Representations, External Links: Cited by: §1, §2, §2.
-  (1989) Making data structures persistent. Journal of computer and system sciences 38 (1), pp. 86–124. Cited by: §3.2.
-  (2021) Go-explore: a new approach for hard-exploration problems. External Links: Cited by: §1, §1, §2.
-  (2019) Curriculum-guided hindsight experience replay. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 12623–12634. External Links: Cited by: §2.
Automatic goal generation for reinforcement learning agents.
Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1515–1528. External Links: Cited by: §2.
-  (2019) Learning dexterous in-hand manipulation. External Links: Cited by: §1.
-  (2008) How can we deﬁne intrinsic motivation ?. In Proceedings of the 8th International Conference on Epigenetic Robotics: Modeling CognitiveDevelopment in Robotic Systems, Cited by: §1.
-  (2017) Curiosity-driven exploration by self-supervised prediction. In ICML, Cited by: §2.
-  (2020) Automated curriculum generation through setter-solver interactions. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2019) Exploration via hindsight goal generation. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. 13485–13496. External Links: Cited by: §2.
-  (2018) Semi-parametric topological memory for navigation. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2019) Episodic curiosity through reachability. In International Conference on Learning Representations, External Links: Cited by: §1, §2, §2.
-  (2019) Hierarchical Reinforcement Learning with Clustering Abstract Machines. In Artificial Intelligence. RCAI 2019. Communications in Computer and Information Science, S. O. Kuznetsov and A. I. Panov (Eds.), Vol. 1093, pp. 30–43. External Links: Cited by: §2.
-  (2020) Continual learning of control primitives: skill discovery via reset-games. External Links: Cited by: §1.
-  (2020) The ingredients of real world robotic reinforcement learning. In International Conference on Learning Representations, External Links: Cited by: §1.