1. Introduction
The ability to create useful abstractions automatically is a critical tool for an autonomous agent. Without this, the agent is condemned to plan or learn policies at a relatively low level of abstraction, and it becomes hard to solve complex tasks. What we would like is the ability for the agent to learn new skills or abstractions over time that gradually increase its ability to solve challenging tasks. This paper explores this in the context of reinforcement learning.
There are two main approaches to abstraction in reinforcement learning: temporal abstraction and state abstraction. In temporal abstraction, the agent learns multistep skills, i.e. policies for achieving subtasks. In state abstraction, the agent learns to group similar states together for the purposes of decision making. For example, for the purposes of handwriting a note, it may be irrelevant whether the agent is holding a pencil or a pen. In the context of the Markov decision process (MDP), state abstraction can be understood using an elegant approach known as the MDP homomorphism framework Ravindran (2004). An MDP homomorphism is a mapping from the original MDP to a more compact MDP that preserves the important transition and reward structure of the original system. Given an MDP homomorphism to a compact MDP, one may solve the original problem by solving the compact MDP and then projecting those solutions back onto the original problem. Figure 1 illustrates this in the context of a toydomain puck stacking problem. The bottom left of Figure 1 shows two pucks on a grid. The agent must pick up one of the pucks (bottom middle of Figure 1) and place it on top of the other puck (bottom right of Figure 1). The key observation to make here is that although there are many different twopuck configurations (bottom right of Figure 1), they are all equivalent in the sense that the next step is for the agent to pick up one of the pucks. In fact, for the purposes of puck stacking, the entire system can be summarized by the threestate MDP shown at the top of Figure 1. This compact MDP is clearly a useful abstraction for the purposes of solving this problem.
Although MDP homomorphisms are a useful mechanism for abstraction, it is not yet clear how to learn the MDP homomorphism mapping from experience in a modelfree scenario. This is particularly true for a deep reinforcement learning context where the state space is effectively continuous. The closest piece of related work is probably that of
Wolfe and Barto (2006) who study the MDP homomorphism learning problem in a narrow context. This paper considers the problem of learning general MDP homomorphisms from experience. We make the following key contributions:#1: We propose an algorithm for learning MDPs homomorphisms from experience in both discrete and continuous state spaces (Subsection 4.2). The algorithm groups together stateaction pairs with similar behaviors, creating a partition of the stateaction space. The partition then induces an abstract MDP homomorphic to the original MDP. We prove the correctness of our method (Section 5).
#2:
Our abstraction algorithm requires a learning component. We develop a classifier based on a convolutional network that enables our algorithm to handle mediumsized environments with continuous state spaces. We include several augmentations, such as sharing the weights of previously learned models and oversampling minority classes, for speedingup learning and dealing with extreme class imbalance (Subsection
4.3). We test our algorithm in two environments (Subsection 6.2): a continuous state space puck stacking task, which leverages the convolutional network, and a discrete state space blocks world task, which we solve with a decision tree.
#3: We propose a transfer learning method for guiding exploration in a new task with a previously learned abstract MDP (Subsection 4.4). Our method is based on the framework of options (Sutton et al. (1999)): it can augment any existing reinforcement learning agent with a new set of temporallyextended actions. The method beats a baseline based on a deep Qnetwork in one class of tasks and performs equally well in another.
2. Background
2.1. Reinforcement Learning
An agent’s interaction with an environment can be modeled as a Markov Decision Process (MDP, Bellman (1957)). An MDP is a tuple , where is the set of states, is the set of actions, is the stateaction space (the set of available actions for each state), is the transition function and is the reward function.
We use the framework of options Sutton et al. (1999) for the purpose of transferring knowledge between similar tasks. An option is a temporally extended action: it can be executed from the set of states and select primitive actions with a policy until it terminates. The probability of termination for each state is expressed by .
2.2. Abstraction with MDP homomorphisms
The aim of abstraction in our paper is to group similar stateaction pairs from the stateaction space . The grouping can be described as a partitioning of .
Definition .
A partition of an MDP is a partition of . Given a partition of , the block transition probability of is the function defined by .
Definition .
A partition is a refinement of a partition , , if and only if each block of is a subset of some block of .
To obtain a grouping of states, the partition of is projected on the state space .
Definition .
Let be a partition of , where and are arbitrary sets. For any , let denote the set of distinct blocks of containing pairs of which is a component, that is, . The projection of B onto X is the partition of such that for any , if and only if .
Next, we define two desirable properties of a partition over .
Definition .
A partition B of an MDP is said to be reward respecting if implies for all .
Definition .
A partition B of an MDP has the stochastic substitution property if for all , implies for all .
Having a partition with these properties, we can construct the quotient MDP (we also call it the abstract MDP).
Definition .
Given a reward respecting SSP partition of an MDP , the quotient MDP is the MDP , where ; where for each ; is given by and is given by . is the number of distinct classes of that contain a stateaction pair with as the state component.
We want the quotient MDP to retain the structure of the original MDP while abstracting away unnecessary information. MDP homomorphism formalizes this intuition.
Definition .
An MDP homomorphism from to is a tuple of surjections with , where and such that and . We call a homomorphic image of under .
The following theorem states that the quotient MDP defined above retains the structure of the original MDP.
[Ravindran (2004)]
Let be a reward respecting SSP partition of MDP . The quotient MDP is a homomorphic image of .
Computing the optimal stateaction value function in the quotient MDP usually requires fewer computations, but does it help us act in the underlying MDP? The last theorem states that the optimal stateaction value function lifted from the minimized MDP is still optimal in the original MDP:
[Optimal value equivalence, Ravindran (2004)]
Let be the homomorphic image of the MDP under the MDP homomorphism . For any , .
3. Related Work
Balaraman Ravindran proposed Markov Decision Process (MDP) homomorphism together with a sketch of an algorithm for finding homomorphisms (i.e. finding the minimal MDP homomorphic to the underlying MDP) given the full specification of the MDP in his Ph.D. thesis Ravindran (2004). The first and only algorithm (to the best of our knowledge) for finding homomorphisms from experience (online) Wolfe and Barto (2006) operates over Controlled Markov Processes (CMP), an MDP extended with an output function that provides more supervision than the reward function alone. Homomorphisms over CMPs were also used in Wolfe (2006) to find objects that react the same to a defined set of actions.
An approximate MDP homomorphism Ravindran and Barto (2004)
allows aggregating together stateaction pairs with similar, but not the same dynamics. It is essential when learning homomorphisms from experience in nondeterministic environments because the estimated transition probabilities for individual stateaction pairs will rarely be the same, which is required by the MDP homomorphism. Taylor et al.
Taylor et al. (2008) built upon this framework by introducing a similarity metric for stateaction pairs as well as an algorithm for finding approximate homomorphisms.Sorg et al. Sorg and Singh (2009) developed a method based on homomorphism for transferring a predefined optimal policy to a similar task. However, their approach maps only states and not actions, requiring actions to behave the same across all MDPs. Soni et al. and Rajendran et al. Soni and Singh (2006); Rajendran and Huber (2009) also studied skill transfer in the framework of MDP homomorphisms. Their works focus on the problem of transferring policies between discrete or factored MDPs with predefined mappings, whereas our primary contribution is the abstraction of MDPs with continuous state spaces.
4. Methods
We solve the problem of abstracting an MDP with a discrete or continuous statespace and a discrete action space. The MDP can have an arbitrary reward function, but we restrict the transition function to be deterministic. This restriction simplifies our algorithm and makes it more sampleefficient (because we do not have to estimate the transition probabilities for each stateaction pair).
This section starts with an overview of our abstraction process (Subsection 1), followed by a description of our algorithm for finding MDP homomorphisms (Subsection 4.2). We describe several augmentations to the base algorithm that make it faster and increase its robustness in Subsection 4.3. Finally, Subsection 4.4 contains the description of our transfer learning method that leverages the learned MDP homomorphism to speed up the learning of new tasks.
4.1. Abstraction
Algorithm 1 gives an overview of our abstraction process. Since we find MDP homomorphisms from experience, we first need to collect experience that is diverse enough. For simple environments, a random exploration policy provides such experience. But, a random walk is clearly not sufficient for more realistic environments because it rarely reaches the goal of the task. Therefore, we use the vanilla version of a deep Qnetwork Mnih et al. (2015) to collect the initial experience in bigger environments.
Subsequently, we partition the stateaction space of the original MDP based on the collected experience with our Online Partition Iteration algorithm (Algorithm 2). The algorithm is described in detail in Subsection 4.2. The stateaction partition –the output of Algorithm 2–induces a quotient, or abstract, MDP according to Definition 2.2.
4.2. Partitioning algorithm
Our online partitioning algorithm (Algorithm 2) is based on the Partition Iteration algorithm from Givan et al. (2003). It was originally developed for stochastic bisimulation based partitioning, and we adapted it to MDP homomorphisms (following Ravindran’s sketch Ravindran (2004)). Algorithm 4.2 starts with a reward respecting partition obtained by separating transitions that receive distinct rewards (SplitRewards). The reward respecting partition is subsequently refined with the Split (Algorithm 4) operation until it attains the SSP property. Split(b, c, B) splits a stateaction block from stateaction partition with respect to a state block obtained by projecting the partition onto the state space.
The projection of the stateaction partition onto the state space (Algorithm 3) is the most complex component of our method. We train a classifier , which can be an arbitrary model, to classify stateaction pairs into their corresponding stateaction blocks. The training set consists of all transitions the agent experienced, with each transition belonging to a particular stateaction block. During State Projection, evaluates a state under a sampled set of actions, predicting a stateaction block for each action. For discrete action spaces, the set of actions should include all available actions. The set of predicted stateaction blocks determines which state block the state belongs to.
Figure 2 illustrates the projection process: a single state is evaluated under four actions: , , and . The first three actions are classified into the stateaction block , whereas the last action is assigned to block . Therefore, belongs to a state block identified by the set of predicted stateaction blocks .
The output of Online Partition Iteration is a partition of the stateaction space . According to Definition 2.2, we can use the partition to construct a quotient MDP. Since the quotient MDP is fully defined, we can compute its optimal Qvalues with a dynamic programming method such as Value Iteration Sutton and Barto (1998).
To be able to act according to the quotient MDP, we need to connect it to the original MDP in which we select actions. Given a current state and a set of actions admissible in , , we predict the stateaction block of each pair , using the classifier . Note that Online Partition Iteration trains in the process of refining the partition. This process of predicting stateaction block corresponds to a single step of State Projection: we determine which state block belongs to. Since each state in the quotient MDP corresponds to a single state block (by Definition 2.2), we can map to some state in the quotient MDP.
Given the current state in the quotient MDP, we select the action with the highest Qvalue and map it back to the underlying MDP. An action in the quotient MDP can correspond to more than one action in the underlying MDP. For instance, an action that places a puck on the ground can be executed in many locations, while still having the same Qvalue in the context of puck stacking. We break the ties between actions by sampling a single action in proportion to the confidence predicted by : predict a stateaction block with some probability given a stateaction pair.
4.3. Speedingup and increasing robustness
Retraining the classifier from scratch at every step of Online Partition Iteration can be timeconsuming (especially if
is a neural network). Moreover, if the task is sufficiently complex, the classifier will make mistakes during State Projection. For these reasons, we developed several modifications to our partition algorithm that increase its speed and robustness. Some of them are specific to a neural network classifier.

weight reuse
: The neural network is initialized with the number of logits (outputs) equal to the maximum number of stateaction blocks. First, the neural network is trained from scratch, but its weights are kept for all subsequent retrainings, with old stateaction blocks being assigned to the same logits and new stateaction blocks to free logits.

early stopping of training: We reserve 20% of the experience to calculate the validation accuracy. Having a measure of performance, we can select the snapshot of the neural network that performs the best and stop the training after no improvement for steps.

class balancing: The sets of stateaction pairs belonging to different stateaction blocks can be extremely unbalanced. Namely, the number of transitions that are assigned a positive reward is usually low. We follow the best practices from Buda et al. (2018) and oversample all minority classes so that the number of samples for each class is equal to the size of the majority class. We found decision trees do not require oversampling; hence we use this method only with a neural network.

stateaction block size threshold: During State Projection, the classifier sometimes makes mistakes in classifying a stateaction pair to a stateaction block. Hence, the State Projection algorithm can assign a state to a wrong state block. This problems usually manifests itself with the algorithm "hallucinating" state bocks that do not exist in reality (note that there are possible state blocks, given a stateaction partition ). To prevent the Split function from oversegmenting the stateaction partition due to these phantom state blocks, we only split a stateaction block if the new blocks contain a number of samples higher than a threshold .
4.4. Transferring abstract MDPs
Solving a new task from scratch requires the agent to take a random walk before it stumbles upon a reward. The abstract MDP learned in the previous task can guide exploration by taking the agent into a good starting state. However, how do we know which state block in the abstract MDP is a good start for solving a new task?
If we do not have any prior information about the structure of the next task, the agent needs to explore the starting states. To formalize this, we create options, each taking the agent to a particular state in the quotient MDP from the first task. Each option is a tuple with

being the set of all starting states of the MDP for the new task,

uses the quotient MDP from the previous task to select actions that lead to a particular state in the quotient MDP (see Subsection 4.2 for more details) and

terminates the option when the target state is reached.
The agent learns the values of the options with a Monte Carlo update (Sutton and Barto (1998)) with a fixed (the learning rate)–the agent prefers options that make it reach the goal the fastest upon being executed. If the tasks are similar enough, the agent will find an option that brings it closer to the goal of the next task. If not, the agent can choose not to execute any option.
We use a DQN to collect the initial experience in all transfer learning experiments. While our algorithm suffers from the same scalability issues as a DQN when learning the initial task, our transfer learning method makes the learning of new tasks easier by guiding the agent’s exploration.
5. Proof of Correctness
This section contains the proof of the correctness of our algorithm. We first prove two lemmas that support the main theorem. The first lemma and corollary ensure that Algorithm 2 finds a reward respecting SSP partition.
Lemma
Given a reward respecting partition of an MDP and such that for some , and are not in the same block of any reward respecting SSP partition refining .
Proof.
Following the proof of Lemma 8.1 from Givan et al. (2003): proof by contradiction.
Let be a reward respecting SSP partition that is a refinement of . Let , such that . Define such that are in the same block and . Because is a reward respecting SSP partition, for each state block , . Then, . This contradicts .
∎
Corollary
Let be a reward respecting partition of an MDP , a block in and a union of blocks from . Every reward respecting SSP partition over that refines is a refinement of the partition .
Proof.
Following the proof of Corollary 8.2 from Givan et al. (2003).
Let , . Let be a reward respecting SSP partition that refines . will only split stateaction pairs if . But if , then there must be some such that because for any , . Therefore, we can conclude by Lemma 5 that .
∎
The versions of Partition Iteration from Givan et al. (2003) and Ravindran (2004) partition a fullydefined MDP. We designed our algorithm for the more realistic case, where only a stream of experience is available. This change makes the algorithm different only during State Projection (Algorithm 3). In the next lemma, we prove that the output of State Projection converges to a state partition as the number of experienced transitions goes to infinity.
Lemma
Let be an MDP with a finite , a finite or infinite , a stateaction space that is a separable metric space and a deterministic defined such that each stateaction pair is visited with a probability greater than zero. Let (Algorithm 3, line 5). Let ,
, … be i.i.d. random variables that represent observed transitions,
a 1 nearest neighbor classifier that classifies stateaction pairs into stateaction blocks and let the nearest neighbor to from a set of transitions . Let be a stateaction partition over and . Let be a state partition obtained by the State Projection algorithm with taking neighbors from . as with probability one.Proof.
is obtained by projecting onto . In this process, is divided into blocks based on , the set of distinct blocks containing pairs of which is a component, . Given , line 8 in Algorithm 3 predicts a for each , such that . By the Convergence of the Nearest Neighbor Lemma Cover and Hart (1967), converges to with probability one. The rest of the Algorithm 3 exactly follows the projection procedure (Definition 3), therefore, with probability one.
∎
Finally, we prove the correctness of our algorithm given an infinite stream of i.i.d. experience. While the i.i.d. assumption does not usually hold in reinforcement learning (RL), the deep RL literature often leverages the experience buffer Mnih et al. (2015) to ensure the training data is diverse enough. Our algorithm also contains a large experience buffer to collect the data needed to run Online Partition Iteration.
[Correctness]
Let be an MDP with a finite , a finite or infinite , a stateaction space that is a separable metric space and a deterministic defined such that each stateaction pair is visited with a probability greater than zero. Let (Algorithm 3), line 5). Let , , … be i.i.d. random variables that represent observed transitions, a 1 nearest neighbor classifier that classifies stateaction pairs into stateaction blocks. As the number of observed goes to infinity, Algorithm 2 computes a reward respecting SSP partition over the observed stateaction pairs with probability one.
Proof.
Loosely following the proof of Theorem 8 from Givan et al. (2003).
Let be a partition over the observed stateaction pairs, the set of observed states and (BS)’ the result of StateProjection(B,g) (Algorithm 3).
Algorithm 2 first splits the initial partition such that a block is created for each set of transitions with a distinct reward (line 2). Therefore, Algorithm 2 refines a reward respecting partition from line 2 onward.
Algorithm 1 terminates with when for all , . will split any block containing for which . According to Lemma 2, as with probability one. Consequently, any partition returned by Algorithm 2 must be a reward respecting SSP partition.
Since Algorithm 2 first creates a reward respecting partition, and each step only refines the partition by applying , we can conclude by Corollary 1 that each partition encountered, including the resulting partition, must contain a reward respecting SSP partition.
∎
6. Experiments
Our algorithm is intended to run in continuous state space environments. However, to compare with prior work, we also developed a version for discrete environments. We test our algorithm’s ability to solve a single task and compare it with prior work in Subsection 6.2 and experiment with two kinds of task transfer in Subsection 6.3
. All our neural networks were implemented in Tensorflow
^{1}^{1}1https://www.tensorflow.org/, version 1.7.0. The source code for our work is available online; we will link it after the end of the review period.6.1. Setup
We first describe the convolutional network we use for the classifier and the exact settings of the deep Qnetwork we use both as a baseline and to generate experience for Online Partition Iteration. Then we describe the two environments in which we test our agent.
6.1.1. Convolutional network architecture and settings
We ran a grid search over various convolutional network architectures. In particular, we searched for the architecture with the highest downsampling factor of the depth image (the observed state) that still achieves an acceptable accuracy. Figure 7 shows the bestperforming architecture. We set the learning rate to 0.001, the batch size to 32 and the weight decay for all layers to 0.0001 without an extensive grid search. The early stopping constant was set to 1000 steps.
6.1.2. Deep Qnetwork settings
Our implementation of the vanilla deep Qnetwork (DQN, Mnih et al. (2015)) is based on the OpenAI baselines^{2}^{2}2https://github.com/openai/baselines with a custom architecture described above. We did not use prioritized experience replay Schaul et al. (2015) or dueling networks Wang et al. (2015). We ran a second grid search to find the best learning rate and target network update frequency with the following results: 0.0001 learning rate and update the target network every 100 steps. For the exploration, we linearly decayed the value of from 1.0 to 0.1 for 5000 steps in all experiments involving a DQN. The batch size was set to 32.
6.1.3. Puck stacking
For our continuous task, we chose stacking pucks in a grid world environment (see Figure 1) with endtoend actions: the agent can attempt to pick or place a puck in any of the cells. Executing the pick action in a cell with a puck places it into the agent’s hand, whereas trying to pick an empty cell does not alter the state of the environment. The place action is allowed everywhere as long as the agent is holding a puck. The agent senses the environment as a depth image with a size , with being the number of cells along an axis, together with the state of its hand: either full or empty. The task is episodic and the goal is to stack a target number of pucks on top of each other. The task terminates after 20 time steps or when the goal is reached. Upon reaching the goal, the agent is awarded 10 reward, other stateaction pairs get 0 reward.
The convolutional network described above was used as the classifier . The agent collected experience either with a random uniform policy or a DQN (Mnih et al. (2015)) of the same structure as
, except for the number of output neurons. The number of stateaction blocks was limited to 10: the purpose of the limit was mostly to speedup faulty experiments, but an oversegmented partition can still perform well in some cases. The replay buffers of the partitioning algorithm and the DQN were both limited to 10000 transitions–it suffices for the purpose of finding a reward respecting SSP partition.
6.1.4. Blocks world
We implemented the blocks world environment from Wolfe and Barto (2006). The environment consists of three blocks that can be placed in four positions. The blocks can be stacked on top of one another, and the goal is to place a particular block, called the focus block, in the goal position and height. With four positions and three blocks, 12 tasks of increasing difficulty can be generated. The agent is penalized with 1 reward for each action that does not lead to the goal; reaching the goal state results in 100 reward.
Although a neural network can learn to solve this task, a decision tree trains two orders of magnitude faster and often reaches better performance. We used a decision tree from the scikitlearn package^{3}^{3}3http://scikitlearn.org, version 0.19.2 with the default settings as our classifier. All modifications from Subsection 4.3 specific to a neural network were omitted: weight reuse, early stopping of training and class balancing. We also disabled the stateaction block size threshold because the number of unique transitions generated by this environment is low and the decision tree does not make many mistakes. Despite the decision tree reaching high accuracy, we cap the number of stateaction blocks at 100 to avoid creating thousands of stateaction pairs if the algorithm fails. The abstract MDP was recreated every 3000 time steps and the task terminated after 15000 time steps.
6.2. Single task
We first demonstrate our algorithm’s ability to find the abstract MDP homomorphic to the underlying problem and to act optimally given the abstract MDP. The agent was taught to solve the task of stacking 3 pucks in a 3x3 grid world environment; a random uniform policy selects actions in the first 100 episodes and we run the Online Partition Iteration algorithm on the collected experience to update the quotient MDP every 100 episodes after that.
Figure 3 shows the cumulative reward of our algorithm over 2000 episodes with five different settings of the stateaction block size threshold . If the threshold is set too high, the algorithm cannot split the necessary blocks, which leads to undersegmentation. A policy planned in an undersegmented abstract MDP does not fit well to the underlying MDP. All other settings eventually achieved a nearly optimal behavior, but the higher the threshold the longer it took because more experience needed to be collected. Conversely, if the threshold was set too low, our algorithm became more susceptible to misclassification and the performance dropped. The minimal MDP learned by our algorithm contains five stateaction blocks, see Figure 1 for an example of a quotient MDP for a similar task of stacking two pucks with three stateaction blocks.
In figure 4, we compare the decision tree version of our algorithm (as described in Subsection 6.1) with the results reported in Wolfe and Barto (2006). There are several differences between our experiments: the algorithm described in Wolfe and Barto (2006) works with a Controlled Markov Process (CMP), an MDP augmented with an output function that provides more supervision than the reward. Therefore, their algorithm can start segmenting stateaction blocks before it even observes the reward. CMP also allows an easy transfer of the learned partition from one task to another; we solve each task separately. On the other hand, each action in Wolfe’s version of the task has a 0.2 chance of failure, but we omit this detail to satisfy the assumptions of our algorithm. Even though each version of the task is easier in some ways are harder in other, we believe the comparison with the only previous algorithm that solves the same problem is valuable.
6.3. Task transfer
We first test our algorithm on a sequence of two tasks, where getting to the goal of the first task helps the agent reach the goal of the second task. Specifically, the first task is stacking two pucks in a 4x4 grid world environment and the second task requires the agent to stack three pucks. Both of the tasks run for 1000 episodes. We compare our method, which first executes the option that brings the agent to the goal of the first task and then acts with a vanilla DQN, with two baselines:

baseline: A vanilla DQN. We reset its weights and replay buffer after the end of each task–it does not retain any information from the previous task it solved.

baseline, weight sharing: The same as the above, but we do not reset its weights. When a new task starts, it goes to the goal state of the previous tasks and explores from there.
Our agent augmented with the goal option reaches a similar cumulative reward to the baseline with weight sharing (Figure 6). We expected this result because creating an abstract MDP of the whole environment does not bring any more benefits than simply knowing how to get to the goal state of the previous task.
To show the benefit of the abstract MDP, we created a sequence of tasks in which reaching the goal of the first task does not help: the first task is stacking three pucks and the second task is making two stacks of height two. Upon the completion of the first task, our agent is augmented with options for reaching all state blocks of the abstract MDP. The agent learns the Qvalue of the options with a Monte Carlo update Sutton and Barto (1998) with the learning rate set to 0.1. The baselines are the same as in the previous experiment.
Figure 7 shows that our agent learns to select the correct option that brings it closer to the goal of the second task, reaching a significantly higher cumulative reward than both of the baselines. An unexpected result is that the baseline with weight sharing performs better than the other even when reaching the goal of the first task is not as beneficial. We hypothesize that the DQN can learn the second policy easier due to all convolutional layers being pretrained to spot the pucks from the first task–pretraining convolutional network has been shown to help in tasks such as image classification.
7. Conclusion
We developed Online Partition Iteration, an algorithm for finding abstract MDPs in discrete and continuous state spaces from experience, building on the existing work of Givan et al., Ravindran and Wolfe et al. Givan et al. (2003); Ravindran (2004); Wolfe and Barto (2006). We proved the correctness of our algorithm under certain assumptions and demonstrated that it can successfully abstract mediumsized MDPs. In addition to being interpretable, the abstract MDPs can guide exploration when learning a new task. We created a transfer learning method in the framework of options Sutton et al. (1999), and demonstrated that it learns new tasks faster than a baseline transfer learning method based on weight sharing.
References
 (1)
 Bellman (1957) Richard Bellman. 1957. A Markovian Decision Process. Journal of Mathematics and Mechanics 6, 5 (1957), 679–684.

Buda
et al. (2018)
Mateusz Buda, Atsuto
Maki, and Maciej A. Mazurowski.
2018.
A systematic study of the class imbalance problem in convolutional neural networks.
Neural networks : the official journal of the International Neural Network Society 106 (2018), 249–259.  Cover and Hart (1967) T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1 (January 1967), 21–27. https://doi.org/10.1109/TIT.1967.1053964
 Givan et al. (2003) Robert Givan, Thomas Dean, and Matthew Greig. 2003. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence 147, 1 (2003), 163 – 223. https://doi.org/10.1016/S00043702(02)003764 Planning with Uncertainty and Incomplete Information.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Humanlevel control through deep reinforcement learning. Nature 518 (25 Feb 2015), 529 EP –.
 Rajendran and Huber (2009) S. Rajendran and M. Huber. 2009. Learning to generalize and reuse skills using approximate partial policy homomorphisms. In 2009 IEEE International Conference on Systems, Man and Cybernetics. 2239–2244. https://doi.org/10.1109/ICSMC.2009.5345891
 Ravindran (2004) Balaraman Ravindran. 2004. An Algebraic Approach to Abstraction in Reinforcement Learning. Ph.D. Dissertation. AAI3118325.
 Ravindran and Barto (2004) Balaraman Ravindran and Andrew G Barto. 2004. Approximate homomorphisms: A framework for nonexact minimization in Markov decision processes. (2004).
 Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized Experience Replay. CoRR abs/1511.05952 (2015). arXiv:1511.05952 http://arxiv.org/abs/1511.05952
 Soni and Singh (2006) Vishal Soni and Satinder Singh. 2006. Using Homomorphisms to Transfer Options Across Continuous Reinforcement Learning Domains. In Proceedings of the 21st National Conference on Artificial Intelligence  Volume 1 (AAAI’06). AAAI Press, 494–499.
 Sorg and Singh (2009) Jonathan Sorg and Satinder Singh. 2009. Transfer via Soft Homomorphisms. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems  Volume 2 (AAMAS ’09). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 741–748.
 Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning (1st ed.). MIT Press, Cambridge, MA, USA.
 Sutton et al. (1999) Richard S. Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112, 1 (1999), 181 – 211. https://doi.org/10.1016/S00043702(99)000521
 Taylor et al. (2008) Jonathan J. Taylor, Doina Precup, and Prakash Panangaden. 2008. Bounding Performance Loss in Approximate MDP Homomorphisms. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS’08). Curran Associates Inc., USA, 1649–1656.
 Wang et al. (2015) Ziyu Wang, Nando de Freitas, and Marc Lanctot. 2015. Dueling Network Architectures for Deep Reinforcement Learning. CoRR abs/1511.06581 (2015). arXiv:1511.06581 http://arxiv.org/abs/1511.06581
 Wolfe (2006) Alicia P. Wolfe. 2006. Defining Object Types and Options Using MDP Homomorphisms.
 Wolfe and Barto (2006) Alicia Peregrin Wolfe and Andrew G. Barto. 2006. Decision Tree Methods for Finding Reusable MDP Homomorphisms. In Proceedings of the 21st National Conference on Artificial Intelligence  Volume 1 (AAAI’06). AAAI Press, 530–535.
Comments
There are no comments yet.