Hierarchical Linearly-Solvable Markov Decision Problems

03/10/2016 ∙ by Anders Jonsson, et al. ∙ Universitat Pompeu Fabra 0

We present a hierarchical reinforcement learning framework that formulates each task in the hierarchy as a special type of Markov decision process for which the Bellman equation is linear and has analytical solution. Problems of this type, called linearly-solvable MDPs (LMDPs) have interesting properties that can be exploited in a hierarchical setting, such as efficient learning of the optimal value function or task compositionality. The proposed hierarchical approach can also be seen as a novel alternative to solving LMDPs with large state spaces. We derive a hierarchical version of the so-called Z-learning algorithm that learns different tasks simultaneously and show empirically that it significantly outperforms the state-of-the-art learning methods in two classical hierarchical reinforcement learning domains: the taxi domain and an autonomous guided vehicle task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hierarchical reinforcement learning (HRL) is a general framework for addressing large-scale reinforcement learning problems. It exploits the task (or action) structure of a problem by considering policies over temporally extended actions that typically involve a reduced subset of the state components. For example, the MAXQ approach [Dietterich2000] decomposes a Markov decision process (MDP) and its value function into a hierarchy of smaller MDPs such that the value function of the target MDP corresponds to an additive combination of the value functions of the smaller MDPs. Another example is the options approach for which different tasks can be learned simultaneously in an online fashion [Sutton and Precup1998]. HRL methods have also been used to explain human and animal behavior [Botvinick, Niv, and Barto2009].

Independently, a class of stochastic optimal control problems was introduced for which the actions and cost function are restricted in ways that make the Bellman equation linear and thus more efficiently solvable [Todorov2006, Kappen2005]. This class of problems is known in the discrete setting as linearly-solvable MDPs (LMDPs), in the continuous setting as path-integral control or more generally, as Kullback-Leibler (KL) control [Kappen, Gómez, and Opper2012]. Optimal control computation for this class of problems is equivalent to a KL-divergence minimization.

The original LMDP formulation considers a single action that changes the stochastic laws of the environment. An alternative interpretation (that we adopt in this work) is to consider a stochastic policy over deterministic actions. LMDPs have many interesting properties. For example, optimal control laws for LMDPs can be linearly combined to derive composite optimal control laws efficiently [Todorov2009a]. Also, the power iteration method, used to solve LMDPs, is equivalent to the popular belief propagation algorithm used for probabilistic inference in dynamic graphical models [Kappen, Gómez, and Opper2012]. The optimal value function for LMDPs can be learned efficiently using an off-policy learning algorithm, Z-learning, that operates directly in the state space instead of in the product space of states and actions [Todorov2009b].

In the continuous setting, the KL-divergence reduces to the familiar quadratic energy cost, widely used in robotic applications. Examples of such applications include robot navigation [Kinjo, Uchibe, and Doya2013] and motor skill reinforcement learning [Theodorou, Buchli, and Schaal2010]. This class of problems is also relevant in other disciplines, such as cognitive science and decision making theory [Friston et al.2013, Ortega and Braun2013]. However, in general, application of LMDPs to real-world problems is challenging, mainly due to the curse dimensionality [Abbasi-Yadkori et al.2015, Matsubara, Gómez, and Kappen2014, Todorov2009c].

In this paper, we propose to combine both HRL and LMDP frameworks and formulate a reinforcement learning problem as a hierarchy of LMDPs. Surprisingly, despite LMDPs were introduced already ten years ago, no unifying framework that combines both methodologies has been proposed yet. The benefits of this combination are two-fold. On one hand, HRL problems expressed in this way can benefit from the same properties that LMDPs enjoy. For example, one can use Z-learning as an efficient alternative to the state-of-the-art HRL methods. Another example is task compositionality, by which a composite task can be learned at no cost given the optimal solution for the different composing tasks. This is useful in tasks that have several terminal states, as we will show later. On the other hand, LMDPs can also benefit from the HRL framework, for example, by addressing the curse of dimensionality in an alternative way to the previously mentioned approaches or by simultaneous intra-task learning from HRL.

The paper is organized as follows. We review HRL and LMDPs in Section 2. The main contribution of this work is the hierarchical formulation for LMDPs, which we present in Section 3. We empirically illustrate its benefits on two benchmarks in Section 4  We conclude this work in Section 5.

2 Preliminaries

In this section we introduce preliminaries and notation. We first define MDPs and Semi-MDPs, then explain the idea behind MAXQ decomposition, and finally describe linearly-solvable MDPs.

2.1 MDPs and Semi-MDPs

An MDP consists of a set of states , a set of actions

, a transition probability distribution

satisfying for each state-action pair , and an expected reward function . The aim is to learn an optimal policy , i.e. a mapping from states to actions that maximizes expected future reward.

MDPs are usually solved by defining a value function

that estimates the expected future reward in each state. In the undiscounted case, the optimal value is obtained by solving the Bellman optimality equation:

To bound the optimal value function, we only consider first exit problems that define a set of terminal states . A function defines the final reward of each terminal state. As an alternative to the value function , one can define an action-value function that estimates the expected future reward for each state-action pair .

The Bellman optimality equation can be solved globally using algorithms such as value iteration and policy iteration. However, for large state spaces this is not feasible. Alternative algorithms make local updates to the value function online. Arguably, the most popular online algorithm for MDPs is Q-learning [Watkins1989]. Given a transition from state to state when taking action and receiving reward , Q-learning makes the following update to the estimate of the optimal action-value function:

where is a learning rate.

A Semi-MDP generalizes an MDP by including actions that take more than one time-step to complete. In this case, the Bellman optimality equation becomes

where is the duration of action , is the expected reward when applied in lasts for steps, and is the probability of transitioning to in steps. By defining and , we get

i.e. a Semi-MDP can be treated and solved as an MDP.

2.2 MAXQ Decomposition

MAXQ decomposition [Dietterich2000] decomposes an MDP into a finite set of tasks with root task , i.e. solving is equivalent to solving . Each task , , consists of a termination set , an action set and a pseudo-reward .

A task is primitive if it has no subtasks, i.e. if . A primitive task corresponds to an action of the original MDP , defined such that is always applicable, terminates after one time step and has pseudo-reward everywhere. A non-primitive task can only be applied in non-terminal states (i.e. states not in ). Terminating in state produces pseudo-reward . corresponds to a Semi-MDP with action set , i.e. actions are other tasks.

Figure 1: The task graph of the Taxi domain.

MAXQ defines a task graph with tasks in as nodes. There is an edge between nodes and if and only if , i.e. if is an action of task . To avoid infinite recursion, the task graph has to be acyclic. Figure 1 shows a simplified task graph of the Taxi domain, commonly used to illustrate MAXQ decomposition.

The aim of MAXQ decomposition is to learn a hierarchical policy , i.e. a separate policy for each individual task , . Each task defines its own value function that, for each state , estimates the expected cumulative reward until terminates. The reward associated with applying action in state of task equals the value of in , i.e. . Hence the Bellman optimality equation for decomposes as

where is the probability of transitioning from to when applying the (possibly composite) action . If is a primitive task corresponding to an action of the original MDP , its value is the expected immediate reward, i.e. . The pseudo-reward is only used for learning the policy of and does not contribute to the value function.

dietterich2000hierarchical (dietterich2000hierarchical) proposed an online algorithm for MAXQ decomposition called MAXQ-Q learning. The algorithm maintains two value functions for each task : an estimate of the value function defined above, and an estimate of the expected cumulative reward that includes the pseudo-reward . The estimate defines the policy for , while the estimate is passed as reward to parent tasks of . MAXQ-Q learning achieves recursive optimality, i.e. each policy is locally optimal with respect to . conf/nips/Dietterich99 (conf/nips/Dietterich99) also showed how to use state abstraction to simplify learning in MAXQ decomposition.

2.3 Linearly-Solvable MDPs

Linearly-solvable MDPs (LMDPs) were first introduced by TodorovNIPS2007 (TodorovNIPS2007,TodorovPNAS2009). The original formulation has no explicit actions, and control consists in changing a predefined uncontrolled probability distribution over next states. An alternative interpretation is to view the resulting probability distribution as a stochastic policy over deterministic actions. Todorov’s idea was to transform the discrete optimization problem over actions to a continuous optimization problem over transition probabilities, which is convex and analytically tractable.

Formally, an LMDP consists of a set of states , an uncontrolled transition probability distribution satisfying for each state , and an expected reward function . Given a state and any next state distribution , we define the set of next states under as . For first-exit problems, LMDPs also have a subset of terminal states .

The control in LMDPs is a probability distribution over next states; for a given next state , can be non-zero only if is non-zero. The reward for applying control in state is

where is the (non-positive) reward associated with state .

is the Kullback-Leibler divergence between

and , penalizing controls that are significantly different from . Typically, is a random walk and acts as a temperature parameter. Large values of (high temperature) lead to solutions which are more stochastic, since deviating from the random dynamics is penalized more. Conversely, very small values of (low temperature) result in deterministic policies, since the state-dependent term dominates the immediate cost. LMDPs, in a sense, replace deterministic policies defined over stochastic actions with stochastic policies defined over deterministic actions. In what follows, unless otherwise stated, the next state is always drawn from the distribution .

We can now define the Bellman optimality equation:

For a given state , the set consists of control inputs that satisfy and for each . To bound the values in the absense of a discount factor, terminal states are absorbing, i.e.  for each .

Introducing we obtain

To obtain a KL divergence on the right-hand side, introduce a normalization term and insert it into the Bellman equation:

The KL term achieves a minimum of when the distributions are equal, i.e. the optimal policy is

Exponentiating the Bellman equation gives

We can write this equation in matrix form as

(1)

where is a diagonal matrix with the terms on the diagonal and is the transition probability matrix derived from the distribution . Unlike the Bellman optimality equation, this is a system of linear equations.

Since Equation (1

) is linear, we can solve its eigenvector problem using, for example, the power iteration method. As an alternative, TodorovNIPS2007 (TodorovNIPS2007,TodorovPNAS2009) proposed an online learning algorithm for LMDPs called Z-learning. Similar to Q-learning for MDPs, the idea of Z-learning is to follow a trajectory, record transitions and perform incremental updates to the value function.

Since LMDPs have no explicit actions, each transition consists of a state , a next state and a reward recorded during the transition. Z-learning maintains an estimate of the optimal value, and this estimate is updated after each transition as

(2)

where is a learning rate.

Naive Z-learning samples transitions from the passive dynamics , which essentially amounts to a random walk and leads to slow learning. A better alternative is to use importance sampling to guide exploration by sampling transitions from a more informed distribution. A natural choice is the estimated optimal policy derived from , resulting in the following corrected update rule [Todorov2006]:

(3)

Note that the importance weight requires access to the passive dynamics .

2.4 LMDPs With Transition-Dependent Rewards

In the original formulation of LMDPs, reward is state-dependent. To develop a hierarchical framework based on LMDPs, we have to account for the fact that each task may accumulate different amounts of reward. Hence reward is transition-dependent, depending not only on the current state but also on the next state. In this section we extend LMDPs to transition-dependent reward, i.e. the expected reward function is defined over pairs of states. Then the reward of applying control in state is

The Bellman equation becomes

Letting and yields

Normalizing as yields and results in the policy

Exponentiating the Bellman equation gives which can be written in matrix form as

(4)

where each entry of equals .

To solve Equation (4) we can either apply the power iteration method or Z-learning. It is trivial to extend Z-learning to LMDPs with transition-dependent reward. Each transition is still a triplet , and the only difference is that the reward now depends on the next state as well as the current state . If we compare to the target value , we see that the update rule in Equation (2) causes to tend towards the optimal value when using the uncontrolled distribution to sample transitions. The update for importance sampling in Equation (2.3) can also be directly applied to LMDPs with transition-dependent reward.

3 Hierarchical LMDPs

In this section we formalize a framework for hierarchical LMDPs based on MAXQ decomposition. We assume that there exists an underlying LMDP and a set of tasks , with being the root task. Each task , , consists of a termination set , a set of subtasks and a pseudo-reward . Each state is absorbing and produces reward . For clarity of presentation, we first assume that each task is deterministic, i.e. for each state , terminates in a unique state . We later show how to extend hierarchical LMDPs to non-deterministic tasks.

In MAXQ decomposition, since the actions of the original MDP are included as primitive tasks, the action set of task contains a subset of actions in . The analogy for hierarchical LMDPs is that each task contains a subset of the allowed transitions of the original LMDP , i.e. transitions with non-zero probability according to . Intuitively, the optimal control of each task can be viewed as a stochastic policy that selects between deterministic tasks and deterministic next states.

We associate each task with an LMDP with transition-dependent rewards. Task is primitive if and only if . For each state , let be the subset of next states of the original LMDP that are also present in . Further let be the set of subtasks of that are applicable in , i.e. for which is not a terminal state. Clearly, if is primitive.

For a given state , the passive dynamics and immediate reward of the task LMDP are defined in terms of transitions due to

and transitions due to

(5)
(6)

Just as in MAXQ decomposition, the reward associated with applying a subtask of in state equals the value function of in , i.e. . The transition associated with subtask has uniform probability, while transitions in have probabilities proportional to those of and produce the same reward as in the original LMDP .

Figure 2: The LMDP task graph of the Taxi domain.

For each task , the value function estimates the expected cumulative reward until terminates and defines the immediate reward for higher-level tasks. We can write the Bellman optimality equation for as

(7)

The task graph for hierarchical LMDPs is defined as for MAXQ decomposition, and has to be acyclic. Figure 2 shows the LMDP task graph of the Taxi domain. Compared to Figure 1, primitive actions no longer appear as tasks, and new sink nodes, e.g. Navigate(), correspond to primitive tasks of the hierarchical LMDP.

The above definitions implicitly consider the following assumptions that differ from the MAXQ formulation and are required for hierarchical LMDPs:

  1. First, we assume that terminal states of subtasks are mutually exclusive and do not overlap with next states in . The reason is that an LMDP is not allowed to have more than one transition between two states with different rewards: if this happens, the optimal policy is not identifiable, since one has to collapse both transitions into one and determine what the resulting immediate reward should be, an ill-defined problem.

  2. Another difference is that each LMDP task needs the equivalent of a no-op action, i.e. 

    , so that the corresponding Markov chain is aperiodic, needed for convergence of the power-iteration method.

  3. Finally, unlike MAXQ decomposition, the value function in Equation (7) includes KL terms due to differences between the control and uncontrolled dynamics . The reward also includes subtask dependent KL terms. Consequently, the value function of the root task includes KL terms from all other tasks. Although this introduces an approximation, we can control the relative importance of KL terms by adjusting the value of .

3.1 Task Compositionality for Non-Deterministic Tasks

In this subsection, we extend the definition of hierarchical LMDPs to non-deterministic tasks. As before, we associate with each task an LMDP with the important difference that (and it subtasks) can have more than one terminal state. Primitive subtasks (or transitions ) are addressed as before, and we omit them for clarity. We thus have to define passive dynamics and immediate rewards for non-primitive subtasks that can have more than one terminal state.

Denote , , as the -th terminal state of subtask . We define the counterparts of Equations (5) and (6) for multiple terminal states as

(8)
(9)

where the transition probability of subtask from state to terminal state and the value function can be expressed using compositionality of optimal control laws for LMDPs [Todorov2009a], as described below. Note that is different from the immediate transition probabilities for subtask , and that the total transition probability of subtask is still , but it is now distributed among the possible terminal states according to .

For each terminal state of task , we define a separate task . The new tasks are identical to and have terminal states that differ only in their pseudo-rewards . For task , the pseudo-reward for (goal) is zero and the remaining terminal states have the same (negative) pseudo-reward , e.g. and , .

Consider the optimal policy and the optimal value of each of the individual tasks. Using compositionality, the original task with multiple terminal states can be expressed as a weighted sum of the individual tasks . In particular, the composite optimal and policy are [Todorov2009a]

where the mixing weights for composing tasks are uniform and equals , since each task assigns the same pseudo-reward to non-goal terminal states.

The value function in Equation (9) is then given by

and the transition probability for Equation (8) is defined recursively for all states as

This defines the hierarchical LMDP framework for non-deterministic tasks. Note that each individual task is still deterministic; its purpose is to avoid terminating in a state different from .

3.2 Hierarchical Learning Algorithms

The aim of a hierarchical LMDP is to learn an estimated hierarchical control policy , i.e. an individual control policy for each task , . Similar to MAXQ decomposition, there are two main alternatives for learning a hierarchical policy:

  1. Learn each individual policy separately in a bottom-up fashion.

  2. Learn all policies simultaneously using a hierarchical execution in a top-down fashion.

Implementing an algorithm of the first type is straightforward: since each individual task is an LMDP, we can using the power iteration method or Z-learning. Since all subtasks of are solved before itself, the rewards of are known and fixed when solving .

To implement an algorithm of the second type, similar to MAXQ-Q learning, we start at the root task and sample a subtask to execute using the current estimate of the policy for . We then execute until termination, possibly applying subtasks of along the way. When terminates, we return the control to the root task and another subtask is sampled using . This continues until we reach an absorbing state of . During this process, the value function estimates of each task are updated using Z-learning.

As in MAXQ-Q learning, if a task has pseudo-rewards different from , we have to learn two estimates of the value function for : one estimate of the optimal value function that excludes the pseudo-reward , and another estimate that includes the pseudo-reward . The estimate defines the policy of , while is passed as reward to parent tasks of .

3.3 Intra-Task Z-Learning

In hierarchical MDPs, the aim is to learn a separate policy for each individual task. Since Q-learning is an off-policy algorithm, it is possible to use transitions recorded during one task to learn the policy of another task. Such intra-task learning is known to converge faster [Sutton and Precup1998]. In this section we describe an algorithm for intra-task Z-learning.

As described in Subsection 2.3, we can use importance sampling to improve exploration. Let be a transition sampled using the estimated policy of task , and consider an update to the estimated value of another task . Even though the sample distribution is different from the estimated policy of , we consider the update in Equation (2.3)

(10)

To see why the update rule is correct, simply substitute the expressions for and :

In other words, instead of moving in the direction of , the update rule moves in the direction of , which is precisely the desired value of . In particular, the importance weight can be shared by the different tasks in intra-task learning.

For LMDPs with transition costs, the same update rule can be used, but substituting the expressions for and leads to a slightly different result:

Recall that . The expectation results from the fact that the observed reward and expected reward may be different, but are equal in expectation. For LMDPs with transition costs, is the desired value of .

3.4 State Abstraction

In hierarchical LMDPs, we can apply the same forms of state abstraction as for MAXQ decomposition [Dietterich1999]. The most common form of state abstraction is projection or max node irrelevance. This form of state abstraction assumes that the state is factored, i.e.  where are the domains of state variables. Max node irrelevance identifies state variables that are irrelevant for a given task, implying that the values of these state variables remain the same until completion of the task. Irrelevant state variables can be ignored while learning the value function.

conf/nips/Dietterich99 (conf/nips/Dietterich99) identified other conditions under which it is safe to perform state abstraction. One condition, leaf irrelevance, does not apply to hierarchical LMDPs since actions are no longer included as leaves of the task graph. Another condition, result distribution irrelevance, does apply to hierarchical LMDPs: when two or more states have the same transition probabilities with respect to a given task , we only need to estimate a single value (or ) for these states.

Figure 3: Taxi problem: results the primitive tasks (NAVIGATE) for .

4 Experiments

We evaluate the proposed framework in two tasks commonly used in hierarchical MDPs: the taxi domain [Dietterich2000] and an autonomous vehicle guided task (AGV) [Ghavamzadeh and Mahadevan2007]. We compare the following methods:

Z:

Z-learning using naive sampling (i.e. random walk) without correction term, as in Equation (2).

Z-IS:

Z-learning with importance sampling but without intra-task learning, as in Equation (2.3),

Z-IS-IL:

Z-learning with importance sampling and intra-task learning, as in Equation (10).

Q-G:

-greedy Q-learning without intra-task learning.

Q-G-IL:

-greedy Q-learning with intra-task learning.

The Z-learning variants are evaluated using task LMDPs as described in Section 3   To compare with the Q-learning variants, for each task LMDP we construct a traditional MDP following the methodology of TodorovPNAS2009 TodorovPNAS2009. The resulting traditional MDP is guaranteed to have the same optimal value function as the original LMDP. Following TodorovNIPS2007 TodorovNIPS2007, we use dynamic learning rates, which decay as , where is optimized separately for each algorithm and is the current trial. The parameter for Q-learning is also optimized for best performance.

To compare performance, we calculate, for each iteration, the -norm of the differences between the learned and the optimal value function, which can be computed exactly in the tasks we consider here. More details about the experiments are described in the supplementary material.

4.1 The Taxi Domain

Figure 4: Taxi problem: results in the abstract task (ROOT) for .

The taxi domain is defined on a grid, with four distinguished locations. There is a passenger at one of the four locations, and that passenger wishes to be transported to one of the other three locations. There is also a taxi that must navigate to the passenger’s location, pick up the passenger, navigate to the destination location, and put down the passenger there. We use a variant of the taxi domain [Dietterich2000] with much larger state space, a grid.

We decompose the taxi domain as shown in Figure 2. Just like dietterich2000hierarchical (dietterich2000hierarchical), we apply state abstraction in the form of projection to the navigate tasks, ignoring the passenger’s location and destination. This results in state spaces of sizes and for the navigation and full task, respectively.

The primitive tasks (NAVIGATE) contain all state transitions associated with navigation actions: NORTH, SOUTH, EAST, WEST and IDLE (i.e. the no-op action). There are four of these primitive tasks, one for each location (corner) in the grid. The corresponding LMDPs are very similar to the grid example of TodorovNIPS2007 (TodorovNIPS2007): the passive dynamics is a random walk and the state-cost term is zero for the terminal states (the corresponding corner) and elsewhere.

Figure 3 shows the performance of different learning methods in the primitive task of this domain. The best results are obtained with Z-learning with importance sampling and intra-task learning (Z-IS-IL). The Z-learning variants outperform the Q-learning variants mainly because Z-learning, unlike Q-learning, does not need a maximization operator or state-action values [Todorov2006]. Remarkably, Z-IS (without intra-task learning) still performs better than Q-learning with intra-task learning. Naive Z-learning (Z) performs better than greedy Q-learning (Q-G) because in this particular task, random exploration is still useful to learn a location at one corner of the grid.

Figure 5: AGV problem: results in the primitive tasks (NAVIGATE) for .

The full task is composed of the four possible navigation subtasks plus the transitions resulting from applying the original PICKUP and PUTDOWN actions and the IDLE transition. Figure 4 shows the results in the full task. Since there is only one such task, intra-task learning does not apply. In this case, random exploration converges very slowly, as the curve of Z indicates. Also, the advantage of Z-learning with importance sampling over -greedy Q-learning is less pronounced than in the primitive tasks.

From these results, we can conclude that the proposed extensions of Z-learning outperform the state-of-the-art learning methods for this domain.

4.2 The Autonomous Guided Vehicle (AGV) Domain

The second domain we consider is a variant of the AGV domain [Ghavamzadeh and Mahadevan2007]. In this problem, an AGV has to transport parts between machines in a warehouse. Different parts arrive to the warehouse at uncertain times, and these parts can be loaded from the warehouse and delivered to the specific machines that can process and assemble them. Once a machine terminates, the AVG can pick up the assembly and bring it to the unload location of the warehouse.

The state space of the full problem has nine components: three components for the position of the AGV ( and angle), one for the type of the part that the AGV is carrying and five to represent the different parts that are available to pick up from the warehouse of from the assembly locations. To convert the overall problem into a first-exit task, we do not allow new parts to arrive at the warehouse, and the task is to assemble all parts and deliver them to the unload station. For more details, see the supplementary material.

An important feature of this problem is that the AVG can only navigate using transitions corresponding to primitive actions FORWARD, TURN-LEFT, TURN-RIGHT and STAY. Unlike the taxi domain, this significantly constrains the trajectories required to navigate from one location to another in the warehouse. Similar to the taxi domain, we define six primitive NAVIGATE tasks for navigating to the six dropoff and pickup locations in the warehouse. As before, we apply state abstraction in the form of projection to these tasks, ignoring the location of all parts and assemblies.

Figure 5 shows the result of different learning methods on the NAVIGATE tasks. They are similar to those in the taxi domain, although Q-learning with intra-task learning performs comparatively better, while naive Z-learning performs comparatively worse. The latter result can be explained by the need of guided exploration for navigating in this domain.

Since the total number of states is large, we also apply state abstraction in the form of result distribution irrelevance to the overall task. Since NAVIGATE tasks always terminate in a predictable state, it is not necessary to maintain a value function for other than dropoff and pickup locations.

We have also implemented an online algorithm similar to MAXQ-Q learning. Instead of using the value function of subtasks to estimate transition rewards, we execute each subtask until termination, recording the reward along the way. The reward accumulated during the subtask is then used as the observed immediate reward for the abstract task. For performance we measure the throughput, i.e. the number of assemblies delivered to the unload station per time step.

Figure 6: AGV problem: comparison between Q-Learning and Z-learning in terms of throughtput.

Figure 5 shows the relative performance of Z-learning with importance sampling and Q-learning for this MAXQ-Q variant. We omit naive Z-learning, since the throughput of a random walk is constant over time. The number of time steps includes all primitive transitions, including those of the NAVIGATE subtasks. As the figure shows, Z-learning converges more quickly to a suboptimal policy compared to Q-learning, illustrating the benefits of hierarchical LMDPs.

5 Conclusions

We have presented a framework for hierarchical reinforcement learning that combines the MAXQ decomposition and formulates each task as a linearly-solvable MDP. The framework has been illustrated in two domains, in which the hierarchical, intra-task Z-learning algorithm outperforms the state-of-the-art methods for hierarchical MDPs.

Hierarchical Linearly-Solvable Markov Decision Problems
(Supplementary Material) Anders Jonsson Vicenç Gómez
Department of Information and Communication Technologies
Universitat Pompeu Fabra
Roc Boronat 138, 08018 Barcelona, Spain
{anders.jonsson,vicen.gomez}@upf.edu

Appendix A Experimental Setup

To compare with the Q-learning variants, for each task LMDP we construct a traditional MDP following the methodology of TodorovPNAS2009 TodorovPNAS2009. For each state , we define a symbolic action with transition probability distribution matching the optimal , which is computed using power iteration method in the original LMDP. We also define as many other symbolic actions as number of possible states following . Their transition probabilities are obtained from by circular shifting. The reward of the symbolic actions is . The resulting traditional MDP is guaranteed to have the same optimal value function that the original LMDP.

Following TodorovNIPS2007 TodorovNIPS2007, we use dynamic learning rates, which decay as , where the constant is optimized separately for each algorithm and is the current trial. The parameter for Q-learning is also optimized for best performance.

To compare performance, we calculate, for each iteration, the -norm of the differences between the learned and the optimal value function,

a.1 The Taxi Domain

The taxi domain is defined on a grid, with four distinguished locations. There is a passenger at one of the four locations, and that passenger wishes to be transported to one of the other three locations. There is also a taxi that must navigate to the passenger’s location, pick up the passenger, navigate to the destination location, and put down the passenger there. We use a variant of the taxi domain [Dietterich2000] with much larger state space, a grid, as shown in Figure 7.

Figure 7: Taxi domain used for the experiments. In color, the value function of the composite task NAVIGATE.

The state space is composed of , where and are the horizontal and vertical coordinates of the taxi location and is the location of the passenger for the different pickup locations and when the passenger is in the taxi. Just like dietterich2000hierarchical (dietterich2000hierarchical), we apply state abstraction in the form of projection to the navigate tasks, ignoring the passenger’s location and destination. This results in state spaces of sizes and for the navigation and full task, respectively.

The primitive tasks (NAVIGATE) are composed of all state transitions corresponding to navigation actions, namely, NORTH, SOUTH, EAST, WEST and IDLE (i.e. the no-op action). There are four of these primitive tasks, one for each location (corner) in the grid. The corresponding LMDPs are very similar to the grid example of (TodorovNIPS2007): the passive dynamics is a random walk and the state-cost term is zero for the terminal states (the corresponding corner) and elsewhere.

In terms of scalability, the bottleneck of the algorithm is the numerical precision required for computing the exact optimal value function. This precision strongly depends on the absolute difference between the maximum and the minimum values in the matrices (or ). In order to obtain good estimates, one needs to set sufficiently small, as mentioned in TodorovNIPS2007 (TodorovNIPS2007), which increases the required numerical precision. Although we obtained correct policies for larger grids, For grids we started to have numerical problems, since the threshold required to check convergence of the power iteration method is .

a.2 The Autonomous Guided Vehicle (AGV) Domain

Figure 8: AGV domain used for the experiments.

The second domain we consider is a variant of the AGV domain [Ghavamzadeh and Mahadevan2007]. In this problem, an AGV has to transport parts between machines in a warehouse. Different parts arrive to the warehouse at uncertain times, and these parts can be loaded from the warehouse and delivered to the specific machines that can process and assemble them. Once a machine terminates, the AVG can pick up the assembly and bring it to the unload location of the warehouse.

We simplified the original problem by reducing the number of machines from to and setting the processing time of machines to to make the task fully observable (see Figure 8).

The state space of the full problem has the following components:

  • coordinate of the AGV position

  • coordinate of the AGV position

  • orientation of the AGV (up,right,down,left)

  • Num. parts at the input buffer of Machine 1

  • Num. parts at the output buffer of Machine 1

  • Num. parts at the input buffer of Machine 2

  • Num. parts at the output buffer of Machine 2

  • Part of type 1 available at the warehouse

  • Part of type 2 available at the warehouse

To convert the overall problem into a first-exit task, we do not allow new parts to arrive at the warehouse, and the task is to assemble all parts and deliver them to the unload station. The resulting task has approximately states.

The bottleneck of the algorithm is again a numerical precision error for calculating the optimal value function using the power iteration method.

References

  • [Abbasi-Yadkori et al.2015] Abbasi-Yadkori, Y.; Bartlett, P. L.; Chen, X.; and Malek, A. 2015. Large-scale Markov decision problems with KL control cost and its application to crowdsourcing. In

    32nd International Conference on Machine Learning (ICML) 2015

    , 1053–1062.
  • [Botvinick, Niv, and Barto2009] Botvinick, M. M.; Niv, Y.; and Barto, A. C. 2009. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition 113(3):262–280.
  • [Dietterich1999] Dietterich, T. G. 1999. State abstraction in MAXQ hierarchical reinforcement learning. In Advances in Neural Information Processing Systems 12 (NIPS), 994–1000.
  • [Dietterich2000] Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition.

    Journal of Artificial Intelligence Research

    13:227–303.
  • [Friston et al.2013] Friston, K.; Schwartenbeck, P.; FitzGerald, T.; Moutoussis, M.; Behrens, T.; and Dolan, R. J. 2013. The anatomy of choice: active inference and agency. Frontiers in Human Neuroscience 7:598.
  • [Ghavamzadeh and Mahadevan2007] Ghavamzadeh, M., and Mahadevan, S. 2007. Hierarchical average reward reinforcement learning. Journal of Machine Learning Research 8:2629–2669.
  • [Kappen, Gómez, and Opper2012] Kappen, H. J.; Gómez, V.; and Opper, M. 2012. Optimal control as a graphical model inference problem. Machine Learning 87(2):159–182.
  • [Kappen2005] Kappen, H. J. 2005. Linear theory for control of nonlinear stochastic systems. Physical Review Letters 95:200201.
  • [Kinjo, Uchibe, and Doya2013] Kinjo, K.; Uchibe, E.; and Doya, K. 2013. Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task. Frontiers in Neurorobotics 7:1–13.
  • [Matsubara, Gómez, and Kappen2014] Matsubara, T.; Gómez, V.; and Kappen, H. J. 2014. Latent Kullback Leibler control for continuous-state systems using probabilistic graphical models. In 30th Conference on Uncertainty in Artificial Intelligence (UAI), 583–592. AUAI Press.
  • [Ortega and Braun2013] Ortega, P. A., and Braun, D. A. 2013. Thermodynamics as a theory of decision-making with information-processing costs. In Proceedings of the Royal Society of London A, volume 469, 20120683. The Royal Society.
  • [Sutton and Precup1998] Sutton, R. S., and Precup, D. 1998. Intra-option learning about temporally abstract actions. In Proceedings of the 15th International Conference on Machine Learning (ICML), 556–564. Morgan Kaufman.
  • [Theodorou, Buchli, and Schaal2010] Theodorou, E.; Buchli, J.; and Schaal, S. 2010. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research 11:3137–3181.
  • [Todorov2006] Todorov, E. 2006. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems 19 (NIPS) 1369–1376.
  • [Todorov2009a] Todorov, E. 2009a. Compositionality of optimal control laws. Advances in Neural Information Processing Systems 22 (NIPS) 1856–1864.
  • [Todorov2009b] Todorov, E. 2009b. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences of the United States of America 106(28):11478–11483.
  • [Todorov2009c] Todorov, E. 2009c. Eigenfunction approximation methods for linearly-solvable optimal control problems. In Proceedings of the 2nd IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 161–168.
  • [Watkins1989] Watkins, C. J. C. H. 1989. Learning from Delayed Rewards. Ph.D. Dissertation, King’s College, Cambridge, UK.