1 Introduction
Animals are able to discover topological map (graph) of surrounding environment [O’Keefe and Dostrovsky, 1971, Moser et al., 2008], which will be used as hints for navigation. For example, previous maze experiments on rats [O’Keefe and Dostrovsky, 1971] reveal that rats can create mental representation of the maze and use such representation to reach the food placed in the maze. In cognitive science society, researchers summarize these discoveries in cognitive map theory [Tolman, 1948], which states that animals can extract and code the structure of environment in a compact and abstract map representation.
Inspired by such biological phenomenon, researchers have recently proposed to learn (generate) topological graph representation for Markov decision process (MDP) and use such graphs for planning in reinforcement learning (RL). To generate graphs, existing methods generally treat the states in a replay buffer as vertices. For the edges of the graphs, some methods [Savinov et al., 2018]
train a reachability predictor via selfsupervised learning and combine it with human experience to construct the edges. Other methods
[Eysenbach et al., 2019]exploit a goalconditioned agent to estimate the distance between vertices, based on which edges are constructed. These existing methods suffer from the following drawbacks. Firstly, these methods do not learn an abstraction for graphs and usually consider all the states in the buffer as vertices
[Savinov et al., 2018], which results in high memory cost. Secondly, these existing methods can only work in some specific settings. For example, they either assume human experience or require to train a goalconditioned agent. Therefore, these methods can not generate graphs along with common policy learning process, which limits their application.In this paper, we propose a new method, called TOpological Map Abstraction (TOMA), for graph generation. The main contributions of this paper are outlined as follows:

TOMA can generate an abstract graph representation for MDP. Different from existing methods in which each vertex of the graph represents a state, each vertex in TOMA represents a cluster of states. As a result, TOMA costs much less memory than existing methods.

TOMA can generate graphs by selfsupervised learning, and the generated graphs of TOMA can be used as a basic multipurpose tool for different RL applications.

As an application example, we propose planning to explore, in which TOMA is used to accelerate exploration by guiding the agent towards unexplored states. A novel experience replay module called vertex memory is also proposed to improve exploration performance.

Experimental results show that TOMA can robustly generate abstract graph representation on several 2D world environments with different types of observation. Under the guidance of such graph representation, the agent can escape local minima during exploration.
2 Algorithm
2.1 Notations
In this paper, we model a RL problem as a Markov decision process (MDP). A MDP is a tuple , where is the state space, is the action space, is a reward function, is a discount factor and is the transition dynamic. denotes Euclidean distance. denotes a graph, where is its vertex set and is its edge set. For any set , we define its indicator function as follows: if , if .
2.2 Toma
Figure 1 gives an illustration of TOMA. Before discussing our algorithm, we first introduce the basic idea of mapping states to an abstract graph. A landmark set is a subset of and each landmark is a onetoone correspondence to a vertex in the graph. Each and will represent a cluster of states. In order to decide which vertex a state corresponds to, we first use a locality sensitive embedding function to calculate its latent representation in the embedding space . Then if ’s nearest neighbor in the embedded landmark set is , we will map to vertex .
In the following sections, we describe the details of how to generate the graph. In Section 2.2.1 we describe how to learn this locality sensitive embedding function, which preserves the distance between nearby states in the embedding space. In Section 2.2.2 we present dynamical graph generation, which takes advantage of such embedding to generate the vertices and edges in the graph. We also introduce an approach to increase robustness of TOMA in Section 2.2.3, which we find necessary on image domains without rich visual information.
2.2.1 Locality Sensitive Embedding
A locality sensitive embedding is a local distance preserving mapping from state space to an embedding space , which is an Euclidean space in our implementation. Given a trajectory , we can use to estimate the distance between and . Here is a radius hyperparameter to rescale the distance and we will further explain its meaning later. In practice, however, is a noizy estimation for shortest distance and approximating it directly won’t converge in most cases. Hence, we propose to estimate which interval the real distance lies in. First, we define three indicator functions:
(1)  
(2)  
(3) 
which mark three disjoint regions , respectively. Then we define an antibump function . Here,
is the rectified linear unit function
[Glorot et al., 2011]. With this , we can measure the deviation from the above intervals. Let(4)  
(5)  
(6) 
and to be the distance between and in the embedding space. Our embedding loss is defined as
(7) 
Here is a sample distribution which will be described later, and are two hyperparameters to balance the importance of the estimation for different distances. We find that a good choice is to pick ,
to ensure that our model focuses on the terms with lower variance. In this equation, there are some critical components to notice:
Radius
As we will see later, the hyperparameter will determine the granularity of each graph vertex, which we term as radius. If we define the ball neighborhood of to be
(8) 
Then will cover more states when is larger. During the graph generation process, we will remove redundant vertices by checking whether and intersect too much. Rescaling by makes it easier to train the embedding function.
Sample Distribution
The state pair
in the loss function is sampled from a neighborhood biased distribution
. We will sample () with probability
, if . And we will sample () with probability , if . We simply take and the choice of is not sensitive in our experiment. In the implementation, we use this sample distribution to draw samples from trajectory and put them into a replay pool. Then we train the embedding function by uniformly drawing samples from the pool.AntiBump Functions
The idea of antibump function is inspired by the partition of unity theorem in differential topology [Hirsch, 1997], where a bunch of bump functions are used to glue the local charts of manifold together so as to derive global properties of differential manifolds. In proofs of many differential topology theorems, one crucial step is to use bump function to segregate each local chart into three disjoint regions by radius 1, 2 and 3, which is analogous to our method. The loss function is crucial in our method, as in experiment we find that training won’t converge if we replace this loss function with a commonly used loss.
2.2.2 Dynamic Graph Generation
An abstract graph representation should satisfy the following basic requirements:

Simple: For any , if , should not contain too much elements.

Accurate: For any , if and the agent can’t travel from some to some in few steps, then .

Abundant: should cover the states as many as possible.

Dynamic: grows dynamically, by absorbing topology information of novel states.
In the following content, we show a dynamic graph generation method fulfilling such requirements. First, we introduce the basic operations in our generation procedure. The operations can be reduced to the following three categories:
Initializing [I1: Initialize] If , we will pick a landmark from currently sampled trajectories and add a vertex into accordingly. In our implementation, this landmark is the initial state of the agent.
Adding [A1: Add Labels] For each state on a trajectory, we label it with its corresponding graph vertex. Let and . There are three possible cases: (1) . We label with . (2) . We consider as an appropriate landmark candidate. Therefore, we label with NULL but add it to a candidate queue. (3) Otherwise, is simply labelled with NULL. [A2: Add Vertices] We move some states from the candidate queue into the landmark set and update accordingly. Once a state is added to the landmark set, we will relabel it from NULL to its vertex identifier. [A3 Add Edges] Let the labelled trajectory to be . If we find and are different vertices in the existing graph, we will add an edge into the graph.
Checking [C1: Check Vertices] If , then we will merge and . [C2: Check Edges] For any edge , if , we will remove this edge.
For efficient nearest neighbor search, we use Kdtree [Bentley, 1975] to manage the vertices. Based on the above operations, we can get our graph generation algorithm TOMA which is summarized in Algorithm 1.
2.2.3 Increasing Robustness
In practice, we find TOMA sometimes provides inaccurate estimation on image domains without rich visual information. This is similar to the findings of [Eysenbach et al., 2019], which uses an ensemble of distributional value function for robust distance estimation. To increase robustness, we can also use an ensemble of embedding functions to provide reliable neighborhood relationship estimation on these difficult domains. The functions in the ensemble are trained with data drawn from the same pool. During labelling, each function will vote a nearest neighbor for the given observation and TOMA will select the winner as the label. To evaluate the distance between states, we use the average of the distance estimation of all embedding functions. In [Eysenbach et al., 2019], the authors find that ensemble is an indispensable component in their value function based method for all applications. On the contrary, TOMA does not require ensemble to increase robustness on applications with rich information.
2.3 Planning to Explore
Since the graph of TOMA expands dynamically as agent samples in the environment, it can be fitted into standard RL settings for different purposes. For example, we can facilitate exploration by setting unexplored vertices as goal. We can also use TOMA tackle catastrophic forgetting of policy network [Fedus et al., 2020]. In particular, since TOMA can segregate the state space into different clusters, we can maintain different policies on different clusters. Then training policy on one cluster will not lead to forgetting on the other clusters. Here, we only show that TOMA can be used for facilitating exploration. We choose the furthest vertex or the least visited vertex as the ultimate goal for agent in each episode. During sampling we periodically run Dijkstra’s algorithm to figure out the path towards the goal from the current state, and the vertices on the path are used as intermediate goals. To ensure the agent can stably reach the border, we further introduce the following memory module.
Vertex Memory
We observe that the agent usually fails to explore efficiently simply because it forgets how to reach the border of explored area as training goes on. In order to make agent recall the way to the border, we require that each vertex should maintain a small replay buffer to record successful transitions into the cluster of . Then, if our agent is going towards goal and the vertices on the shortest path towards the corresponding landmark of are , …, , then we will draw some experience from the replay pool of , ,…, to inform the agent of relevant knowledge during training. In the implementation, we use the following replay strategy: half of the training data are drawn from experience of vertex memory which provides taskspecific knowledge, while the other half are drawn from normal hindsight experience replay (HER) [Andrychowicz et al., 2017]
which provides general knowledge. We will use the sampled trajectory to update the memory of visited vertices at the end of each epoch. The overall procedure is summarized in Algorithm 2.
3 Experiments
In the experiments, we first use visualization to show that TOMA can generate abstract graphs. Then we carry out exploration experiment in some sparse reward environments and show that TOMA can facilitate exploration.
3.1 Visualization
3.1.1 Setting
In order to provide visualization easy to follow, we use several 2D world environments which are shown in Figure 2. In this planar world, the agent can take 4 different actions at each step: going up, going down, going left or going right for one unit distance. In these worlds there are some walls which agent can not cross through. To simulate various reinforcement learning domains, we test the agent on three different types of observation respectively: Sensor, MNIST digit [LeCun and Cortes, 2010] and Topdown observation. Sensor observation is simply the coordinates of the agent. Both MNIST digit and topdown observation are image observations. MNIST digit observation is a mixture of MNIST digit images, which is similar to the reconstructed MNIST digit image of variational autoencoders [Kingma and Welling, 2014]. The observed digit is based on the agent’s position and varies continuously as agent moves in this world. Topdown observation is a blank image with a white dot indicating the agent’s location. We use three different maps: “Empty”, “Lines” and “Four rooms”. Since in this experiment we only care about whether TOMA can generate abstract graph representation from enough samples, we spawn a random agent at a random position in the map at the beginning of each episode. Each episode lasts 1000 steps and we run 500 episodes in each experiment. We use ensemble to increase robustness only for the topdown observation.
3.1.2 Result
The graph generation result is shown in the first three columns of Figure 3 (a). Though there exist very few missing edges or wrong edges, the generated graphs are reasonable in nine cases. For sensor and MNIST digit observation, the graph generation process converges after 150 iterations. For topdown observation, the graph generation process usually requires about 250 iterations. Once the graph generation process converges, further sampling won’t bring much change to the graph since the state space has been covered by the clusters. Among three types of observation, we find the topdown observation is the hardest for the agent. The embedding function can’t generalize across different states well for this topdown observation since this observation does not contain rich information. Thanks to the ensemble technique, TOMA still achieves promising results. The successful result on various observation domains suggest that TOMA is a reliable and robust abstract graph generation algorithm.
3.1.3 Learning under Distraction
“Noizy TV” [Burda et al., 2019] is a hard problem in reinforcement learning. In brief, if we inject some random features into the observation, most existing model learning methods will fail since it is impossible to approximate the dynamics of random features. In the sensor 2D world environment, we further append 8 random features into the agent’s observation. The graph generation result is shown in the last column of Figure 3 (a). We can find that regardless of such distraction, TOMA still discovers reasonable topological graphs without using ensemble.
3.1.4 Effects of
We also measure the effect of radius and find it is consistent with our expectation. We use “Empty” map and let for different levels of abstraction. The graph generation results are illustrated in Figure 3 (b). The state coverage of each vertex does increase as we increase , leading to a higher level of abstraction.
3.1.5 Analysis of Training Process
We further analyze the graph generation process in the above experiments. In the early stages of training, since the embedding function has not been fully optimized there may exist false edges on the graph, as illustrated in Figure 3
(c1). In the red circle on the right, we find that the blue cluster is separated by the barrier and it leads to false connections. As training goes on, the embedding function eventually separates distant states apart, at which moment TOMA detects such unreasonable connections and removes them. Similarly, there may exist crowded areas in the graph, as illustrated in Figure
3 (c2), when the local nearest relationships are not fully learned. These areas will also be eliminated at last. We further replace our loss function with the commonly used loss and find that the training will hardly converge. We display its result on “Empty” in Figure 3 (c3). Though we set , the fragments are still small and there hardly exists any stable edge even if we carefully tune the threshold in the graph generation process. This is mainly caused by the noisy nature of . Therefore, the design of our embedding loss function is crucial for getting a stable and robust graph abstraction.3.1.6 Visualizing Embedding Space
3.2 Unsupervised Exploration
3.2.1 Setting
In this section, we test whether Algorithm 2 can explore the sparsereward environments efficiently. The environments for test are MountainCar and another 2D world called Snake maze, which are shown in Figure 5. MountainCar is a classical RL experiment where the agent tries to drive a car up to the hill on the right. make turns at the right points in a long distance travel.
Snake maze is a 2D world environment where the agent tries to go from the upperleft corner to the bottomright corner. In this environment, reaching the end of the maze usually requires 300400 steps. This is a challenging task since the agent should learn how to In these environments, we set the reward provided by environment to 0. We use DQN [Mnih et al., 2015] as the agent for MountainCar and Snake maze as they are tasks with discrete actions. In MountainCar, we set the goal of each episode to be the least visited landmark since the agent needs to explore an acceleration skill and the furthest vertex will sometimes guide the agent into a local minima. In Snake maze, we simply set the goal to be the furthest vertex in the graph. Since HER makes up part of our memory, we use DQN with HER as the baseline for comparison. We test two variants of TOMA: TOMA with vertex memory (TOMAVM) and TOMA without vertex memory (TOMA). For fair comparison, these three methods share the same DQN and HER parameters. For MountainCar, we train the agent for 20 iterations and each iteration lasts for 200 steps. For Snake maze on sensor observation, we train the agent for 300 iterations. For Snake maze on MNIST digit and topdown observation, we train the agent for 500 iterations. Each iteration lasts for 1000 steps. In each iteration, we record the max distance the agent reached in the past history. We additionally calculate a mean reached distance for experiments in Snake maze, which is the average reached distance in the past 10 iterations. We repeat the experiments for 5 times, and report the mean results.
3.2.2 Result
The results are shown in Figure 7. We can find that TOMAVM and TOMA outperform the baseline HER in all these experiments. In MountainCar experiment, we find that the HER agent fails to discover the acceleration skill and gets stuck at the local minima. In contrast, both TOMAVM and TOMA agents can discover the acceleration skill within 3 iterations and successfully climb up to the right hill. Figure 6 (a) shows some intermediate goals of the agent of TOMAVM, which intuitively demonstrates the effectiveness of TOMAVM.
In Snake maze with sensor observation, the HER agent cannot learn any meaningful action while our TOMAVM agent can successfully reach the end of the maze. Though the TOMA agent cannot always successfully reach the border of the exploration states in every iteration, there is still over probability of reaching the final goal. In the image based experiments, however, we find that the learning process of goalconditioned DQN on such a domain is not stable enough. Therefore, our agent can only reach the left or middle bottom corner of the maze on average. A typical example of the generated graph representation during exploration in Topdown Snake maze is shown in Figure 6 (b). This generated graph does provide correct guidance, but the agent struggles to learn the right action across all states. In these experiments, TOMAVM constantly performs better than TOMA. We will discuss the reason in the next subsection.
3.2.3 Dynamics
We visualize the trajectory and the generated graph during training on the Snake maze with sensor observation. We render the last 10 trajectories and the generated graph every 50 iterations. The result is shown in Figure 8. There are several phenomenons we can observe. Firstly, TOMAVM explores much more faster than TOMA. We find that TOMA will get stuck at the first corner simply because it fails to realize that it should go left, as the past experience from HER pool are mainly for going right and down. In contrast, since TOMAVM can recall the past experience of reaching the middle of the second corridor, it can successfully go across the second corridor and reach the bottom. Secondly, we find that both TOMA and TOMAVM can follow intermediate goals. For instance, under the guidance of intermediate goals, the agents will directly go right at the first corridor, and go down without hesitation at the first turning point. Thirdly, the generated graph expands as the exploration continues and it maintains correct neighborhood relationships about the environment.
4 Related Work
Studies on animals [O’Keefe and Dostrovsky, 1971, Moser et al., 2008, Collett, 1996] reveal that animals are able to build an mental representation to reflect the topological map (graph) of the surrounding environment and aninals will use such representation for navigation. This mental representation is usually termed as mental map [Lynch and for Urban Studies, 1960] or cognitive map [Tolman, 1948]. Furthermore, there exists evidence [Gillner and Mallot, 1998, Driscoll et al., 2000] showing that the mental representation is based on landmarks, which serve as an abstraction of the real environment.
Graph is a natural implementation of this mental representation. Inspired by this biological phenomenon, researchers have recently proposed to generate graph representation for RL. Existing methods such as SPTM [Savinov et al., 2018] and SoRB [Eysenbach et al., 2019] propose to generate graph representations for planning and they treat the states in the replay buffer as vertices. SPTM learns a reachability predictor and a locomotion model from random samples by selfsupervised learning and it applies them over a replay buffer of human experience to compute paths towards goals. SoRB considers the value function of goalconditioned policy as a distance metric, which is used to determine edges between vertices. SoRB requires to train the agent on several random generated goal reaching tasks in the environment during learning. Compared with these approaches which do not adopt abstraction, TOMA generates an abstract graph which has less memory cost. Furthermore, TOMA can be used for multiple purposes since it can be easily learned from sampled trajectories, while existing methods can only learn graphs for specific tasks.
Graph generation methods are also related to some modelbased RL methods [Sutton, 1990, Amos et al., 2018, Kaiser et al., 2020], but these modelbased RL methods do not learn topological maps. There also exist some state abstraction methods like [Sutton et al., 1999, Singh et al., 1994, Andre and Russell, 2002, Mannor et al., 2004, Nouri and Littman, 2010, Li et al., 2006, Abel et al., 2016] for RL. But these state abstraction methods only aggregate states for abstraction but do not model topology information.
5 Conclusion
In this paper, we propose an abstract graph generation method called TOMA for reinforcement learning. TOMA provides a basic multipurpose tool for different RL applications. There are several possible directions for further pursuing.
TOMA on POMDP: This work only studies TOMA on MDP. One future direction is to investigate how to enable TOMA to work in partially observable Markov decision process (POMDP), where the agent needs to reason about latent states beyond observation.
Countbased exploration: We propose Planning to Explore in this work, while some previous exploration baselines are based on counting [Tang et al., 2017]. TOMA provides suitable abstractions for counting and one can investigate whether counting on vertices will improve the performance.
Reducing catastrophic forgetting: Deep RL (DRL) may suffer from catastrophic forgetting [McCloskey and Cohen, 1989, Fedus et al., 2020], because the policy networks may forget the learned knowledge when there exists drift in the sampled training data. Since TOMA can divide the whole space into several clusters, we can train different policies on different clusters. Then training the policy on new clusters will not affect the performance of policies on existing clusters.
References
 O’Keefe and Dostrovsky [1971] J. O’Keefe and J. Dostrovsky. The hippocampus as a spatial map. preliminary evidence from unit activity in the freelymoving rat. Brain Research, 34(1):171 – 175, 1971.
 Moser et al. [2008] Edvard I. Moser, Emilio Kropff, and MayBritt Moser. Place cells, grid cells, and the brain’s spatial representation system. Annual Review of Neuroscience, 31(1):69–89, 2008.
 Tolman [1948] Edward C. Tolman. Cognitive maps in rats and men. Psychological Review, 55:189 – 208, 1948.
 Savinov et al. [2018] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semiparametric topological memory for navigation. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
 Eysenbach et al. [2019] Ben Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. Search on the replay buffer: Bridging planning and reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2019.

Glorot et al. [2011]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
Deep sparse rectifier neural networks.
InProceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2011.  Hirsch [1997] M.W. Hirsch. Differential Topology. Graduate Texts in Mathematics. Springer New York, 1997.
 Bentley [1975] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, 1975.
 Fedus et al. [2020] William Fedus, Dibya Ghosh, John D. Martin, Marc G. Bellemare, Yoshua Bengio, and Hugo Larochelle. On catastrophic interference in Atari 2600 games. arXiv:2002.12499, 2020.
 Andrychowicz et al. [2017] Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017.
 LeCun and Cortes [2010] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
 Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.
 Burda et al. [2019] Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. In Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.

van der Maaten and Hinton [2008]
Laurens van der Maaten and Geoffrey Hinton.
Visualizing data using tSNE.
Journal of Machine Learning Research
, 9:2579–2605, 2008.  Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Humanlevel control through deep reinforcement learning. Nature, 518:529 – 533, 2015.
 Collett [1996] Thomas Collett. Insect navigation en route to the goal: Multiple strategies for the use of landmarks. The Journal of Experimental Biology, 199:227–35, 02 1996.
 Lynch and for Urban Studies [1960] K. Lynch and Joint Center for Urban Studies. The Image of the City. HarvardMIT Joint Center for Urban Studies Series. Harvard University Press, 1960.
 Gillner and Mallot [1998] Sabine Gillner and Hanspeter Mallot. Navigation and acquisition of spatial knowledge in a virtual maze. Journal of Cognitive Neuroscience, 10:445–63, 08 1998.
 Driscoll et al. [2000] Ira Driscoll, Derek Hamilton, and Robert Sutherland. Limitations on the use of distal cues in virtual place learning. Journal of Cognitive Neuroscience, page 21, 01 2000.
 Sutton [1990] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the 7th International Conference on Machine Learning (ICML), 1990.
 Amos et al. [2018] Brandon Amos, Ivan Dario Jimenez Rodriguez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable MPC for endtoend planning and control. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Kaiser et al. [2020] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Modelbased reinforcement learning for Atari. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.
 Sutton et al. [1999] Richard S. Sutton, Doina Precup, and Satinder P. Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(12):181–211, 1999.
 Singh et al. [1994] Satinder P. Singh, Tommi S. Jaakkola, and Michael I. Jordan. Reinforcement learning with soft state aggregation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 1994.
 Andre and Russell [2002] David Andre and Stuart J. Russell. State abstraction for programmable reinforcement learning agents. In Proceedings of the 18th National Conference on Artificial Intelligence and the 14th Conference on Innovative Applications of Artificial Intelligence (AAAI), 2002.
 Mannor et al. [2004] Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004.
 Nouri and Littman [2010] Ali Nouri and Michael L. Littman. Dimension reduction and its application to modelbased exploration in continuous spaces. Machine Learning, 81(1):85–98, 2010.
 Li et al. [2006] Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a unified theory of state abstraction for mdps. In Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2006.
 Abel et al. [2016] David Abel, D. Ellis Hershkowitz, and Michael L. Littman. Near optimal behavior via approximate state abstraction. In Proceedings of the 33nd International Conference on Machine Learning (ICML), 2016.
 Tang et al. [2017] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #exploration: A study of countbased exploration for deep reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS), 2017.
 McCloskey and Cohen [1989] Michael McCloskey and Neal J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109 – 165. Academic Press, 1989.