Online abstraction with MDP homomorphisms for Deep Learning

11/30/2018 ∙ by Ondrej Biza, et al. ∙ Czech Technical University in Prague Northeastern University 0

Abstraction of Markov Decision Processes is a useful tool for solving complex problems, as it can ignore unimportant aspects of an environment, simplifying the process of learning an optimal policy. In this paper, we propose a new algorithm for finding abstract MDPs in environments with continuous state spaces. It is based on MDP homomorphisms, a structure-preserving mapping between MDPs. We demonstrate our algorithm's ability to learns abstractions from collected experience and show how to reuse the abstractions to guide exploration in new tasks the agent encounters. Our novel task transfer method beats a baseline based on a deep Q-network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The ability to create useful abstractions automatically is a critical tool for an autonomous agent. Without this, the agent is condemned to plan or learn policies at a relatively low level of abstraction, and it becomes hard to solve complex tasks. What we would like is the ability for the agent to learn new skills or abstractions over time that gradually increase its ability to solve challenging tasks. This paper explores this in the context of reinforcement learning.

There are two main approaches to abstraction in reinforcement learning: temporal abstraction and state abstraction. In temporal abstraction, the agent learns multi-step skills, i.e. policies for achieving subtasks. In state abstraction, the agent learns to group similar states together for the purposes of decision making. For example, for the purposes of handwriting a note, it may be irrelevant whether the agent is holding a pencil or a pen. In the context of the Markov decision process (MDP), state abstraction can be understood using an elegant approach known as the MDP homomorphism framework Ravindran (2004). An MDP homomorphism is a mapping from the original MDP to a more compact MDP that preserves the important transition and reward structure of the original system. Given an MDP homomorphism to a compact MDP, one may solve the original problem by solving the compact MDP and then projecting those solutions back onto the original problem. Figure 1 illustrates this in the context of a toy-domain puck stacking problem. The bottom left of Figure 1 shows two pucks on a grid. The agent must pick up one of the pucks (bottom middle of Figure 1) and place it on top of the other puck (bottom right of Figure 1). The key observation to make here is that although there are many different two-puck configurations (bottom right of Figure 1), they are all equivalent in the sense that the next step is for the agent to pick up one of the pucks. In fact, for the purposes of puck stacking, the entire system can be summarized by the three-state MDP shown at the top of Figure 1. This compact MDP is clearly a useful abstraction for the purposes of solving this problem.

Although MDP homomorphisms are a useful mechanism for abstraction, it is not yet clear how to learn the MDP homomorphism mapping from experience in a model-free scenario. This is particularly true for a deep reinforcement learning context where the state space is effectively continuous. The closest piece of related work is probably that of 

Wolfe and Barto (2006) who study the MDP homomorphism learning problem in a narrow context. This paper considers the problem of learning general MDP homomorphisms from experience. We make the following key contributions:

Figure 1. Abstraction for the task of stacking two pucks on top of one another. The diagram shows a minimal quotient MDP that is homomorphic to the underlying MDP. The minimal MDP has three states, the last of them being the goal state, and four actions. Each action is annotated with the state-action block that induced it. Two actions are annotated with because they both lead to the first state.

#1: We propose an algorithm for learning MDPs homomorphisms from experience in both discrete and continuous state spaces (Subsection 4.2). The algorithm groups together state-action pairs with similar behaviors, creating a partition of the state-action space. The partition then induces an abstract MDP homomorphic to the original MDP. We prove the correctness of our method (Section 5).

#2:

Our abstraction algorithm requires a learning component. We develop a classifier based on a convolutional network that enables our algorithm to handle medium-sized environments with continuous state spaces. We include several augmentations, such as sharing the weights of previously learned models and oversampling minority classes, for speeding-up learning and dealing with extreme class imbalance (Subsection

4.3). We test our algorithm in two environments (Subsection 6.2

): a continuous state space puck stacking task, which leverages the convolutional network, and a discrete state space blocks world task, which we solve with a decision tree.

#3: We propose a transfer learning method for guiding exploration in a new task with a previously learned abstract MDP (Subsection 4.4). Our method is based on the framework of options (Sutton et al. (1999)): it can augment any existing reinforcement learning agent with a new set of temporally-extended actions. The method beats a baseline based on a deep Q-network in one class of tasks and performs equally well in another.

2. Background

2.1. Reinforcement Learning

An agent’s interaction with an environment can be modeled as a Markov Decision Process (MDP, Bellman (1957)). An MDP is a tuple , where is the set of states, is the set of actions, is the state-action space (the set of available actions for each state), is the transition function and is the reward function.

We use the framework of options Sutton et al. (1999) for the purpose of transferring knowledge between similar tasks. An option is a temporally extended action: it can be executed from the set of states and select primitive actions with a policy until it terminates. The probability of termination for each state is expressed by .

2.2. Abstraction with MDP homomorphisms

The aim of abstraction in our paper is to group similar state-action pairs from the state-action space . The grouping can be described as a partitioning of .

Definition .

A partition of an MDP is a partition of . Given a partition of , the block transition probability of is the function defined by .

Definition .

A partition is a refinement of a partition , , if and only if each block of is a subset of some block of .

To obtain a grouping of states, the partition of is projected on the state space .

Definition .

Let be a partition of , where and are arbitrary sets. For any , let denote the set of distinct blocks of containing pairs of which is a component, that is, . The projection of B onto X is the partition of such that for any , if and only if .

Next, we define two desirable properties of a partition over .

Definition .

A partition B of an MDP is said to be reward respecting if implies for all .

Definition .

A partition B of an MDP has the stochastic substitution property if for all , implies for all .

Having a partition with these properties, we can construct the quotient MDP (we also call it the abstract MDP).

Definition .

Given a reward respecting SSP partition of an MDP , the quotient MDP is the MDP , where ; where for each ; is given by and is given by . is the number of distinct classes of that contain a state-action pair with as the state component.

We want the quotient MDP to retain the structure of the original MDP while abstracting away unnecessary information. MDP homomorphism formalizes this intuition.

Definition .

An MDP homomorphism from to is a tuple of surjections with , where and such that and . We call a homomorphic image of under .

The following theorem states that the quotient MDP defined above retains the structure of the original MDP.

[Ravindran (2004)]

Let be a reward respecting SSP partition of MDP . The quotient MDP is a homomorphic image of .

Computing the optimal state-action value function in the quotient MDP usually requires fewer computations, but does it help us act in the underlying MDP? The last theorem states that the optimal state-action value function lifted from the minimized MDP is still optimal in the original MDP:

[Optimal value equivalence, Ravindran (2004)]

Let be the homomorphic image of the MDP under the MDP homomorphism . For any , .

3. Related Work

Balaraman Ravindran proposed Markov Decision Process (MDP) homomorphism together with a sketch of an algorithm for finding homomorphisms (i.e. finding the minimal MDP homomorphic to the underlying MDP) given the full specification of the MDP in his Ph.D. thesis Ravindran (2004). The first and only algorithm (to the best of our knowledge) for finding homomorphisms from experience (online) Wolfe and Barto (2006) operates over Controlled Markov Processes (CMP), an MDP extended with an output function that provides more supervision than the reward function alone. Homomorphisms over CMPs were also used in Wolfe (2006) to find objects that react the same to a defined set of actions.

An approximate MDP homomorphism Ravindran and Barto (2004)

allows aggregating together state-action pairs with similar, but not the same dynamics. It is essential when learning homomorphisms from experience in non-deterministic environments because the estimated transition probabilities for individual state-action pairs will rarely be the same, which is required by the MDP homomorphism. Taylor et al.

Taylor et al. (2008) built upon this framework by introducing a similarity metric for state-action pairs as well as an algorithm for finding approximate homomorphisms.

Sorg et al. Sorg and Singh (2009) developed a method based on homomorphism for transferring a predefined optimal policy to a similar task. However, their approach maps only states and not actions, requiring actions to behave the same across all MDPs. Soni et al. and Rajendran et al. Soni and Singh (2006); Rajendran and Huber (2009) also studied skill transfer in the framework of MDP homomorphisms. Their works focus on the problem of transferring policies between discrete or factored MDPs with pre-defined mappings, whereas our primary contribution is the abstraction of MDPs with continuous state spaces.

1:procedure Abstraction
2:      collect initial experience with an arbitrary policy
3:      a classifier for state-action pairs
4:     
5:      a quotient MDP constructed from according to Definition 2.2
6:end procedure
Algorithm 1 Abstraction

4. Methods

We solve the problem of abstracting an MDP with a discrete or continuous state-space and a discrete action space. The MDP can have an arbitrary reward function, but we restrict the transition function to be deterministic. This restriction simplifies our algorithm and makes it more sample-efficient (because we do not have to estimate the transition probabilities for each state-action pair).

This section starts with an overview of our abstraction process (Subsection 1), followed by a description of our algorithm for finding MDP homomorphisms (Subsection 4.2). We describe several augmentations to the base algorithm that make it faster and increase its robustness in Subsection 4.3. Finally, Subsection 4.4 contains the description of our transfer learning method that leverages the learned MDP homomorphism to speed up the learning of new tasks.

4.1. Abstraction

Algorithm 1 gives an overview of our abstraction process. Since we find MDP homomorphisms from experience, we first need to collect experience that is diverse enough. For simple environments, a random exploration policy provides such experience. But, a random walk is clearly not sufficient for more realistic environments because it rarely reaches the goal of the task. Therefore, we use the vanilla version of a deep Q-network Mnih et al. (2015) to collect the initial experience in bigger environments.

Subsequently, we partition the state-action space of the original MDP based on the collected experience with our Online Partition Iteration algorithm (Algorithm 2). The algorithm is described in detail in Subsection 4.2. The state-action partition –the output of Algorithm 2–induces a quotient, or abstract, MDP according to Definition 2.2.

The quotient MDP can be used both to select actions for the current task (Subsection 4.2) and for the purpose of learning a new task faster (Subsection 4.4).

Input: Experience , classifier .

Output: Reward respecting SSP partition .

1:procedure OnlinePartitionIteration
2:     
3:     
4:     while  do
5:         
6:         
7:         for block in  do
8:              while  contains block for which  do
9:                  
10:              end while
11:         end for
12:     end while
13:end procedure
Algorithm 2 Online Partition Iteration

4.2. Partitioning algorithm

Our online partitioning algorithm (Algorithm 2) is based on the Partition Iteration algorithm from Givan et al. (2003). It was originally developed for stochastic bisimulation based partitioning, and we adapted it to MDP homomorphisms (following Ravindran’s sketch Ravindran (2004)). Algorithm 4.2 starts with a reward respecting partition obtained by separating transitions that receive distinct rewards (SplitRewards). The reward respecting partition is subsequently refined with the Split (Algorithm 4) operation until it attains the SSP property. Split(b, c, B) splits a state-action block from state-action partition with respect to a state block obtained by projecting the partition onto the state space.

The projection of the state-action partition onto the state space (Algorithm 3) is the most complex component of our method. We train a classifier , which can be an arbitrary model, to classify state-action pairs into their corresponding state-action blocks. The training set consists of all transitions the agent experienced, with each transition belonging to a particular state-action block. During State Projection, evaluates a state under a sampled set of actions, predicting a state-action block for each action. For discrete action spaces, the set of actions should include all available actions. The set of predicted state-action blocks determines which state block the state belongs to.

Figure 2 illustrates the projection process: a single state is evaluated under four actions: , , and . The first three actions are classified into the state-action block , whereas the last action is assigned to block . Therefore, belongs to a state block identified by the set of predicted state-action blocks .

The output of Online Partition Iteration is a partition of the state-action space . According to Definition 2.2, we can use the partition to construct a quotient MDP. Since the quotient MDP is fully defined, we can compute its optimal Q-values with a dynamic programming method such as Value Iteration Sutton and Barto (1998).

To be able to act according to the quotient MDP, we need to connect it to the original MDP in which we select actions. Given a current state and a set of actions admissible in , , we predict the state-action block of each pair , using the classifier . Note that Online Partition Iteration trains in the process of refining the partition. This process of predicting state-action block corresponds to a single step of State Projection: we determine which state block belongs to. Since each state in the quotient MDP corresponds to a single state block (by Definition 2.2), we can map to some state in the quotient MDP.

Given the current state in the quotient MDP, we select the action with the highest Q-value and map it back to the underlying MDP. An action in the quotient MDP can correspond to more than one action in the underlying MDP. For instance, an action that places a puck on the ground can be executed in many locations, while still having the same Q-value in the context of puck stacking. We break the ties between actions by sampling a single action in proportion to the confidence predicted by : predict a state-action block with some probability given a state-action pair.

Figure 2. Projection (Algorithm 3) of a single state . is evaluated under actions , , and . For each pair , the classifier predicts its state-action block . belongs to a state block identified by the set of state-action blocks .

4.3. Speeding-up and increasing robustness

Input: State-action partition , classifier .

Output: State partition .

1:procedure Project
2:     
3:     for block in  do
4:         for transition in  do
5:              
6:              
7:              for action in  do
8:                  
9:                  
10:              end for
11:              
12:         end for
13:     end for
14:end procedure
Algorithm 3 State Projection

Retraining the classifier from scratch at every step of Online Partition Iteration can be time-consuming (especially if

is a neural network). Moreover, if the task is sufficiently complex, the classifier will make mistakes during State Projection. For these reasons, we developed several modifications to our partition algorithm that increase its speed and robustness. Some of them are specific to a neural network classifier.

  • weight reuse

    : The neural network is initialized with the number of logits (outputs) equal to the maximum number of state-action blocks. First, the neural network is trained from scratch, but its weights are kept for all subsequent re-trainings, with old state-action blocks being assigned to the same logits and new state-action blocks to free logits.

  • early stopping of training: We reserve 20% of the experience to calculate the validation accuracy. Having a measure of performance, we can select the snapshot of the neural network that performs the best and stop the training after no improvement for steps.

  • class balancing: The sets of state-action pairs belonging to different state-action blocks can be extremely unbalanced. Namely, the number of transitions that are assigned a positive reward is usually low. We follow the best practices from Buda et al. (2018) and over-sample all minority classes so that the number of samples for each class is equal to the size of the majority class. We found decision trees do not require oversampling; hence we use this method only with a neural network.

  • state-action block size threshold: During State Projection, the classifier sometimes makes mistakes in classifying a state-action pair to a state-action block. Hence, the State Projection algorithm can assign a state to a wrong state block. This problems usually manifests itself with the algorithm "hallucinating" state bocks that do not exist in reality (note that there are possible state blocks, given a state-action partition ). To prevent the Split function from over-segmenting the state-action partition due to these phantom state blocks, we only split a state-action block if the new blocks contain a number of samples higher than a threshold .

Input State-action block , state block , partition .

Output State-action partition .

1:procedure Split
2:     
3:     for transition in  do
4:         if  then
5:              
6:         else
7:              
8:         end if
9:     end for
10:     
11:     if  then
12:         
13:     end if
14:end procedure
Algorithm 4 Split

4.4. Transferring abstract MDPs

Solving a new task from scratch requires the agent to take a random walk before it stumbles upon a reward. The abstract MDP learned in the previous task can guide exploration by taking the agent into a good starting state. However, how do we know which state block in the abstract MDP is a good start for solving a new task?

If we do not have any prior information about the structure of the next task, the agent needs to explore the starting states. To formalize this, we create options, each taking the agent to a particular state in the quotient MDP from the first task. Each option is a tuple with

  • being the set of all starting states of the MDP for the new task,

  • uses the quotient MDP from the previous task to select actions that lead to a particular state in the quotient MDP (see Subsection 4.2 for more details) and

  • terminates the option when the target state is reached.

The agent learns the -values of the options with a Monte Carlo update (Sutton and Barto (1998)) with a fixed (the learning rate)–the agent prefers options that make it reach the goal the fastest upon being executed. If the tasks are similar enough, the agent will find an option that brings it closer to the goal of the next task. If not, the agent can choose not to execute any option.

We use a DQN to collect the initial experience in all transfer learning experiments. While our algorithm suffers from the same scalability issues as a DQN when learning the initial task, our transfer learning method makes the learning of new tasks easier by guiding the agent’s exploration.

5. Proof of Correctness

This section contains the proof of the correctness of our algorithm. We first prove two lemmas that support the main theorem. The first lemma and corollary ensure that Algorithm 2 finds a reward respecting SSP partition.

Lemma

Given a reward respecting partition of an MDP and such that for some , and are not in the same block of any reward respecting SSP partition refining .

Proof.

Following the proof of Lemma 8.1 from Givan et al. (2003): proof by contradiction.

Let be a reward respecting SSP partition that is a refinement of . Let , such that . Define such that are in the same block and . Because is a reward respecting SSP partition, for each state block , . Then, . This contradicts .

Corollary

Let be a reward respecting partition of an MDP , a block in and a union of blocks from . Every reward respecting SSP partition over that refines is a refinement of the partition .

Proof.

Following the proof of Corollary 8.2 from Givan et al. (2003).

Let , . Let be a reward respecting SSP partition that refines . will only split state-action pairs if . But if , then there must be some such that because for any , . Therefore, we can conclude by Lemma 5 that .

The versions of Partition Iteration from Givan et al. (2003) and Ravindran (2004) partition a fully-defined MDP. We designed our algorithm for the more realistic case, where only a stream of experience is available. This change makes the algorithm different only during State Projection (Algorithm 3). In the next lemma, we prove that the output of State Projection converges to a state partition as the number of experienced transitions goes to infinity.

Lemma

Let be an MDP with a finite , a finite or infinite , a state-action space that is a separable metric space and a deterministic defined such that each state-action pair is visited with a probability greater than zero. Let (Algorithm 3, line 5). Let ,

, … be i.i.d. random variables that represent observed transitions,

a 1 nearest neighbor classifier that classifies state-action pairs into state-action blocks and let the nearest neighbor to from a set of transitions . Let be a state-action partition over and . Let be a state partition obtained by the State Projection algorithm with taking neighbors from . as with probability one.

Proof.

is obtained by projecting onto . In this process, is divided into blocks based on , the set of distinct blocks containing pairs of which is a component, . Given , line 8 in Algorithm 3 predicts a for each , such that . By the Convergence of the Nearest Neighbor Lemma Cover and Hart (1967), converges to with probability one. The rest of the Algorithm 3 exactly follows the projection procedure (Definition 3), therefore, with probability one.

Finally, we prove the correctness of our algorithm given an infinite stream of i.i.d. experience. While the i.i.d. assumption does not usually hold in reinforcement learning (RL), the deep RL literature often leverages the experience buffer Mnih et al. (2015) to ensure the training data is diverse enough. Our algorithm also contains a large experience buffer to collect the data needed to run Online Partition Iteration.

[Correctness]

Let be an MDP with a finite , a finite or infinite , a state-action space that is a separable metric space and a deterministic defined such that each state-action pair is visited with a probability greater than zero. Let (Algorithm 3), line 5). Let , , … be i.i.d. random variables that represent observed transitions, a 1 nearest neighbor classifier that classifies state-action pairs into state-action blocks. As the number of observed goes to infinity, Algorithm 2 computes a reward respecting SSP partition over the observed state-action pairs with probability one.

Proof.

Loosely following the proof of Theorem 8 from Givan et al. (2003).

Let be a partition over the observed state-action pairs, the set of observed states and (B|S)’ the result of StateProjection(B,g) (Algorithm 3).

Algorithm 2 first splits the initial partition such that a block is created for each set of transitions with a distinct reward (line 2). Therefore, Algorithm 2 refines a reward respecting partition from line 2 onward.

Algorithm 1 terminates with when for all , . will split any block containing for which . According to Lemma 2, as with probability one. Consequently, any partition returned by Algorithm 2 must be a reward respecting SSP partition.

Since Algorithm 2 first creates a reward respecting partition, and each step only refines the partition by applying , we can conclude by Corollary 1 that each partition encountered, including the resulting partition, must contain a reward respecting SSP partition.

Figure 3. Cumulative rewards for the task of stacking three pucks in a 3x3 environment with varying state-action block size thresholds (). Our method is quite robust to the value of the threshold. Too high threshold value prevents our algorithm from finding a homomorphic quotient MDP. The results were averaged over 20 runs.

6. Experiments

Our algorithm is intended to run in continuous state space environments. However, to compare with prior work, we also developed a version for discrete environments. We test our algorithm’s ability to solve a single task and compare it with prior work in Subsection 6.2 and experiment with two kinds of task transfer in Subsection 6.3

. All our neural networks were implemented in Tensorflow

111https://www.tensorflow.org/, version 1.7.0. The source code for our work is available online; we will link it after the end of the review period.

6.1. Setup

Figure 4. Comparison with Wolfe et al. Wolfe and Barto (2006) in the Blocks World environment. The horizontal line marks the highest mean reward per time step reached by Wolfe et al. We averaged our results over 100 runs; each run selects a different goal state.

We first describe the convolutional network we use for the classifier and the exact settings of the deep Q-network we use both as a baseline and to generate experience for Online Partition Iteration. Then we describe the two environments in which we test our agent.

6.1.1. Convolutional network architecture and settings

We ran a grid search over various convolutional network architectures. In particular, we searched for the architecture with the highest downsampling factor of the depth image (the observed state) that still achieves an acceptable accuracy. Figure 7 shows the best-performing architecture. We set the learning rate to 0.001, the batch size to 32 and the weight decay for all layers to 0.0001 without an extensive grid search. The early stopping constant was set to 1000 steps.

Figure 5. Cumulative rewards for the first transfer learning experiment in a 4x4 environment. In the first 1000 episode, the agent learns the initial task: stacking two pucks. Subsequently, the agent tries to learn a second, harder, task: stacking three pucks. The results were averaged over 20 runs.

6.1.2. Deep Q-network settings

Our implementation of the vanilla deep Q-network (DQN, Mnih et al. (2015)) is based on the OpenAI baselines222https://github.com/openai/baselines with a custom architecture described above. We did not use prioritized experience replay Schaul et al. (2015) or dueling networks Wang et al. (2015). We ran a second grid search to find the best learning rate and target network update frequency with the following results: 0.0001 learning rate and update the target network every 100 steps. For the exploration, we linearly decayed the value of from 1.0 to 0.1 for 5000 steps in all experiments involving a DQN. The batch size was set to 32.

6.1.3. Puck stacking

For our continuous task, we chose stacking pucks in a grid world environment (see Figure 1) with end-to-end actions: the agent can attempt to pick or place a puck in any of the cells. Executing the pick action in a cell with a puck places it into the agent’s hand, whereas trying to pick an empty cell does not alter the state of the environment. The place action is allowed everywhere as long as the agent is holding a puck. The agent senses the environment as a depth image with a size , with being the number of cells along an axis, together with the state of its hand: either full or empty. The task is episodic and the goal is to stack a target number of pucks on top of each other. The task terminates after 20 time steps or when the goal is reached. Upon reaching the goal, the agent is awarded 10 reward, other state-action pairs get 0 reward.

The convolutional network described above was used as the classifier . The agent collected experience either with a random uniform policy or a DQN (Mnih et al. (2015)) of the same structure as

, except for the number of output neurons. The number of state-action blocks was limited to 10: the purpose of the limit was mostly to speed-up faulty experiments, but an over-segmented partition can still perform well in some cases. The replay buffers of the partitioning algorithm and the DQN were both limited to 10000 transitions–it suffices for the purpose of finding a reward respecting SSP partition.

6.1.4. Blocks world

We implemented the blocks world environment from Wolfe and Barto (2006). The environment consists of three blocks that can be placed in four positions. The blocks can be stacked on top of one another, and the goal is to place a particular block, called the focus block, in the goal position and height. With four positions and three blocks, 12 tasks of increasing difficulty can be generated. The agent is penalized with -1 reward for each action that does not lead to the goal; reaching the goal state results in 100 reward.

Although a neural network can learn to solve this task, a decision tree trains two orders of magnitude faster and often reaches better performance. We used a decision tree from the scikit-learn package333http://scikit-learn.org, version 0.19.2 with the default settings as our classifier. All modifications from Subsection 4.3 specific to a neural network were omitted: weight reuse, early stopping of training and class balancing. We also disabled the state-action block size threshold because the number of unique transitions generated by this environment is low and the decision tree does not make many mistakes. Despite the decision tree reaching high accuracy, we cap the number of state-action blocks at 100 to avoid creating thousands of state-action pairs if the algorithm fails. The abstract MDP was recreated every 3000 time steps and the task terminated after 15000 time steps.

6.2. Single task

Figure 6. Cumulative rewards for the second transfer learning experiment in a 4x4 environment. In the first 1500 episode, the agent learns the initial task: stacking three pucks. The second task requires the agent to make two stacks of two pucks. The results were averaged over 20 runs.

We first demonstrate our algorithm’s ability to find the abstract MDP homomorphic to the underlying problem and to act optimally given the abstract MDP. The agent was taught to solve the task of stacking 3 pucks in a 3x3 grid world environment; a random uniform policy selects actions in the first 100 episodes and we run the Online Partition Iteration algorithm on the collected experience to update the quotient MDP every 100 episodes after that.

Figure 3 shows the cumulative reward of our algorithm over 2000 episodes with five different settings of the state-action block size threshold . If the threshold is set too high, the algorithm cannot split the necessary blocks, which leads to under-segmentation. A policy planned in an under-segmented abstract MDP does not fit well to the underlying MDP. All other settings eventually achieved a nearly optimal behavior, but the higher the threshold the longer it took because more experience needed to be collected. Conversely, if the threshold was set too low, our algorithm became more susceptible to misclassification and the performance dropped. The minimal MDP learned by our algorithm contains five state-action blocks, see Figure 1 for an example of a quotient MDP for a similar task of stacking two pucks with three state-action blocks.

In figure 4, we compare the decision tree version of our algorithm (as described in Subsection 6.1) with the results reported in Wolfe and Barto (2006). There are several differences between our experiments: the algorithm described in Wolfe and Barto (2006) works with a Controlled Markov Process (CMP), an MDP augmented with an output function that provides more supervision than the reward. Therefore, their algorithm can start segmenting state-action blocks before it even observes the reward. CMP also allows an easy transfer of the learned partition from one task to another; we solve each task separately. On the other hand, each action in Wolfe’s version of the task has a 0.2 chance of failure, but we omit this detail to satisfy the assumptions of our algorithm. Even though each version of the task is easier in some ways are harder in other, we believe the comparison with the only previous algorithm that solves the same problem is valuable.

6.3. Task transfer

Figure 7.

A diagram of the architecture of our convolutional network. "Conv" stands for a convolutional layer. Each convolutional and fully-connected layer is following with a ReLU activation function.

We first test our algorithm on a sequence of two tasks, where getting to the goal of the first task helps the agent reach the goal of the second task. Specifically, the first task is stacking two pucks in a 4x4 grid world environment and the second task requires the agent to stack three pucks. Both of the tasks run for 1000 episodes. We compare our method, which first executes the option that brings the agent to the goal of the first task and then acts with a vanilla DQN, with two baselines:

  • baseline: A vanilla DQN. We reset its weights and replay buffer after the end of each task–it does not retain any information from the previous task it solved.

  • baseline, weight sharing: The same as the above, but we do not reset its weights. When a new task starts, it goes to the goal state of the previous tasks and explores from there.

Our agent augmented with the goal option reaches a similar cumulative reward to the baseline with weight sharing (Figure 6). We expected this result because creating an abstract MDP of the whole environment does not bring any more benefits than simply knowing how to get to the goal state of the previous task.

To show the benefit of the abstract MDP, we created a sequence of tasks in which reaching the goal of the first task does not help: the first task is stacking three pucks and the second task is making two stacks of height two. Upon the completion of the first task, our agent is augmented with options for reaching all state blocks of the abstract MDP. The agent learns the Q-value of the options with a Monte Carlo update Sutton and Barto (1998) with the learning rate set to 0.1. The baselines are the same as in the previous experiment.

Figure 7 shows that our agent learns to select the correct option that brings it closer to the goal of the second task, reaching a significantly higher cumulative reward than both of the baselines. An unexpected result is that the baseline with weight sharing performs better than the other even when reaching the goal of the first task is not as beneficial. We hypothesize that the DQN can learn the second policy easier due to all convolutional layers being pre-trained to spot the pucks from the first task–pre-training convolutional network has been shown to help in tasks such as image classification.

7. Conclusion

We developed Online Partition Iteration, an algorithm for finding abstract MDPs in discrete and continuous state spaces from experience, building on the existing work of Givan et al., Ravindran and Wolfe et al. Givan et al. (2003); Ravindran (2004); Wolfe and Barto (2006). We proved the correctness of our algorithm under certain assumptions and demonstrated that it can successfully abstract medium-sized MDPs. In addition to being interpretable, the abstract MDPs can guide exploration when learning a new task. We created a transfer learning method in the framework of options Sutton et al. (1999), and demonstrated that it learns new tasks faster than a baseline transfer learning method based on weight sharing.

References

  • (1)
  • Bellman (1957) Richard Bellman. 1957. A Markovian Decision Process. Journal of Mathematics and Mechanics 6, 5 (1957), 679–684.
  • Buda et al. (2018) Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. 2018.

    A systematic study of the class imbalance problem in convolutional neural networks.

    Neural networks : the official journal of the International Neural Network Society 106 (2018), 249–259.
  • Cover and Hart (1967) T. Cover and P. Hart. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 1 (January 1967), 21–27. https://doi.org/10.1109/TIT.1967.1053964
  • Givan et al. (2003) Robert Givan, Thomas Dean, and Matthew Greig. 2003. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence 147, 1 (2003), 163 – 223. https://doi.org/10.1016/S0004-3702(02)00376-4 Planning with Uncertainty and Incomplete Information.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518 (25 Feb 2015), 529 EP –.
  • Rajendran and Huber (2009) S. Rajendran and M. Huber. 2009. Learning to generalize and reuse skills using approximate partial policy homomorphisms. In 2009 IEEE International Conference on Systems, Man and Cybernetics. 2239–2244. https://doi.org/10.1109/ICSMC.2009.5345891
  • Ravindran (2004) Balaraman Ravindran. 2004. An Algebraic Approach to Abstraction in Reinforcement Learning. Ph.D. Dissertation. AAI3118325.
  • Ravindran and Barto (2004) Balaraman Ravindran and Andrew G Barto. 2004. Approximate homomorphisms: A framework for non-exact minimization in Markov decision processes. (2004).
  • Schaul et al. (2015) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized Experience Replay. CoRR abs/1511.05952 (2015). arXiv:1511.05952 http://arxiv.org/abs/1511.05952
  • Soni and Singh (2006) Vishal Soni and Satinder Singh. 2006. Using Homomorphisms to Transfer Options Across Continuous Reinforcement Learning Domains. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1 (AAAI’06). AAAI Press, 494–499.
  • Sorg and Singh (2009) Jonathan Sorg and Satinder Singh. 2009. Transfer via Soft Homomorphisms. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2 (AAMAS ’09). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 741–748.
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning (1st ed.). MIT Press, Cambridge, MA, USA.
  • Sutton et al. (1999) Richard S. Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112, 1 (1999), 181 – 211. https://doi.org/10.1016/S0004-3702(99)00052-1
  • Taylor et al. (2008) Jonathan J. Taylor, Doina Precup, and Prakash Panangaden. 2008. Bounding Performance Loss in Approximate MDP Homomorphisms. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS’08). Curran Associates Inc., USA, 1649–1656.
  • Wang et al. (2015) Ziyu Wang, Nando de Freitas, and Marc Lanctot. 2015. Dueling Network Architectures for Deep Reinforcement Learning. CoRR abs/1511.06581 (2015). arXiv:1511.06581 http://arxiv.org/abs/1511.06581
  • Wolfe (2006) Alicia P. Wolfe. 2006. Defining Object Types and Options Using MDP Homomorphisms.
  • Wolfe and Barto (2006) Alicia Peregrin Wolfe and Andrew G. Barto. 2006. Decision Tree Methods for Finding Reusable MDP Homomorphisms. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1 (AAAI’06). AAAI Press, 530–535.