# Compositional planning in Markov decision processes: Temporal abstraction meets generalized logic composition

In hierarchical planning for Markov decision processes (MDPs), temporal abstraction allows planning with macro-actions that take place at different time scale in form of sequential composition. In this paper, we propose a novel approach to compositional reasoning and hierarchical planning for MDPs under temporal logic constraints. In addition to sequential composition, we introduce a composition of policies based on generalized logic composition: Given sub-policies for sub-tasks and a new task expressed as logic compositions of subtasks, a semi-optimal policy, which is optimal in planning with only sub-policies, can be obtained by simply composing sub-polices. Thus, a synthesis algorithm is developed to compute optimal policies efficiently by planning with primitive actions, policies for sub-tasks, and the compositions of sub-policies, for maximizing the probability of satisfying temporal logic specifications. We demonstrate the correctness and efficiency of the proposed method in stochastic planning examples with a single agent and multiple task specifications.

## Authors

• 9 publications
• 41 publications
• ### Interpretable Apprenticeship Learning with Temporal Logic Specifications

Recent work has addressed using formulas in linear temporal logic (LTL) ...
10/28/2017 ∙ by Daniel Kasenberg, et al. ∙ 0

• ### Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning

Many problems in sequential decision making and stochastic control often...
12/05/2012 ∙ by Jake Bouvrie, et al. ∙ 0

• ### Policy Synthesis for Factored MDPs with Graph Temporal Logic Specifications

We study the synthesis of policies for multi-agent systems to implement ...
01/24/2020 ∙ by Murat Cubuktepe, et al. ∙ 0

• ### Tableaux for Policy Synthesis for MDPs with PCTL* Constraints

Markov decision processes (MDPs) are the standard formalism for modellin...
06/30/2017 ∙ by Peter Baumgartner, et al. ∙ 0

• ### Autonomous Extraction of a Hierarchical Structure of Tasks in Reinforcement Learning, A Sequential Associate Rule Mining Approach

Reinforcement learning (RL) techniques, while often powerful, can suffer...
11/17/2018 ∙ by Behzad Ghazanfari, et al. ∙ 0

• ### Metareasoning for Planning Under Uncertainty

The conventional model for online planning under uncertainty assumes tha...
05/03/2015 ∙ by Christopher H. Lin, et al. ∙ 0

• ### Strengthening Deterministic Policies for POMDPs

The synthesis problem for partially observable Markov decision processes...
07/16/2020 ∙ by Leonore Winterer, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Temporal logic is an expressive language to describe desired system properties: safety, reachability, obligation, stability, and liveness [18]. The algorithms for planning and probabilistic verification with temporal logic constraints have developed, with both centralized [2, 7, 17] and distributed methods [10]. Yet, there are two main barriers to practical applications: 1) The issue of scalability: In temporal logic constrained control problems, it is often necessary to introduce additional memory states for keeping track of the evolution of state variables with respect to these temporal logic constraints. The additional memory states grow exponentially (or double exponentially depending on the class of temporal logic) in the length of a specification [11] and make synthesis computational extensive. 2) The lack of flexibility: With a small change in the specification, a new policy may need to be synthesized from scratch.

To improve scalability for planning given complex tasks, composition is an idea exploited in temporal abstraction and hierarchical planning in Markov decision process (MDP)s [24, 1]. To accomplish complex tasks, temporal abstraction allows planning with macro-actions—policies for simple subtasks—with different time scales. A well-known hierarchical planner is called the options framework [20, 26, 22]. An option is a pre-learned policy for a subtask given the original task that can be completed by temporally abstracting subgoals and sequencing the subtasks’ policies.

Once an agent learns the set of options from an underlying MDP

, it can use conventional reinforcement learning to learn the global optimal policy with the original action set augmented with the set of options, also known as sub-policies or macro-actions. In light of the options framework, hierarchical planning in MDPs is evolving rapidly, with both model-free

[15] and model-based [24, 25], and with many practical applications in robotic systems [13, 14]. The option-critic method [1] integrates approximate dynamic programming [3] with the options framework to improve its scalability.

Since temporal logic specifications describe temporally extended goals and the options framework uses temporally abstracted actions, it seems that applying the options framework to planning under temporal logic constraints is straightforward. However, a direct application does not take full advantages of various compositions observed in temporal logic. The options framework captures the sequential composition. However, it does not consider composition for conjunction or disjunction in logic. In this paper, we are interested in answering two questions: Given two options that maximize the probabilities of and , is there a way to compose these two options to obtain a “good enough” policy for maximizes the probability of , or ? If there exists a way to compose, what shall be the least set of options that one needs to generate? Having multiple ways of composition enables planning becomes more flexible and modular given temporal logic constraints. For example, consider a specification , i.e., eventually reaching the region after visiting any of the two regions and . With composition for sequential tasks only, we may generate an option that maximizes the probability of reaching and an option that maximizes the probability of reaching . With both compositions of sequencing, conjunction, and disjunction of tasks, we may generate options that maximize the probabilities of reaching , , and , respectively, and compose the first two to obtain the option for . In addition, we can compose options to not only for but also have , , etc. When the task changes to , the new option for needs not to be learned or computed, but composed.

In a pursuit to answering these two questions, the contribution of this paper is two-fold: we develop an automatic decomposition procedure to generate a small set of primitive options from a given temporal logic specification. We formally establish a equivalence relation between Generalized Conjunction/Disjunction (GCD) functions [8] in quantitative logic and composable solutions of MDPs using entropy regulated Bellman operators [24, 19]. This equivalence enables us to compose policies for simple formulas/tasks to maximize the probability for satisfying formulas obtained via GCD composition of these simple formulas. Last, we use these novel composition operations to develop a hierarchical planning method for MDPs under temporal logic constraints. We demonstrate the efficiency and correctness of the proposed method with several examples.

## Ii Preliminaries

Notation: Let be the set of nonnegative integers. Let be an alphabet (a finite set of symbols). Given , indicates a set of strings with length , indicates a set of finite strings with length smaller or equal to , and is the empty string. is the set of all finite strings (also known as Kleene closure of ). Given a set , let be a set of probabilistic distributions with as the support.

In this paper, we consider temporal logic formulas for specifying desired properties in a stochastic system. Given a set of atomic propositions, a syntactically co-safe linear temporal logic (sc-LTL) [16] formula over is inductively defined as follows:

.

The above formula is composed of unconditional true true, state predicates and its negation , conjunction () and disjunction (), temporal operators next (), and until (). Temporal operator “eventually” () is defined by: . However, temporal operator “always” cannot be expressed in sc-LTL. A detailed description of the syntax and semantics of sc-LTL can be found in [21]. A sc-LTL formula is evaluated over finite words. In addition to the above notation, we use a backslash () between two propositions to represent the logic exclusion, i.e.rewrite to .

Given a sc-LTL formula , there exists a deterministic finite-state automaton (DFA) that accepts all strings that satisfy the formula [11]. The DFA is a tuple , where is a finite set of states, is a finite alphabet, is a deterministic transition function such that when the symbol is read at state , the automaton makes a deterministic transition to state , is the initial state, and is a set of final, accepting states. The transition function is extended to a sequence of symbols, or a word , in the usual way: for and . A finite word satisfies if and only if . The set of words satisfying is the language of the automaton , denoted .

We consider stochastic systems modeled by MDPs. The specification is given by an sc-LTL formula and related to paths in an MDP via a labeling function.

###### Definition 1 (Labeled Mdp).

A labeled MDP is a tuple where and are finite state and action sets. is the initial state distribution. The transition probability function is defined such that for any state and any action . is a finite set of atomic propositions and is a labeling function which assigns to each state a set of atomic propositions that are valid at the state . can be extended to state sequences in the usual way, i.e., for .

A finite-memory, stochastic policy in the MDP is a function . A Markovian, stochastic policy in the MDP is a function . Given an MDP and a policy

, the policy induces a Markov chain

where

as the random variable for the

-th state in the Markov chain and it holds that and and .

Given a finite (resp. infinite) path (resp. ), we obtain a sequence of labels (resp. ). A path satisfies the formula , denoted , if and only if . Given a Markov chain induced by policy , the probability of satisfying the specification, denoted is the sum of the probabilities of paths satisfying the specification.

 Prob(Mπ⊨φ)\coloneqqE[∞∑t=01(ρt⊨φ)].

where is a path of length in .

We relate each subset of atomic propositions to a propositional logic formula . A set of states satisfying the propositional logic formula for is denoted . Given subsets of proposition sets , let . Slightly abusing the notation, we use to refer to the propositional logic formula that corresponds to. The optimal planning problem for MDP under sc-LTL constraints is defined as follows.

###### Problem 1.

Given an MDP and an sc-LTL formula , design a policy that maximizes the probability of satisfying the specification, i.e.,

 π←argmaxπProb(Mπ⊨φ).

Problem 1 can be solved with dynamic programming methods in a product MDP. The idea is to augment the state space of the MDP with additional memory states—the states in the automaton , and reformulate the problem into a stochastic shortest path problem in the product MDP with the augmented state space. The reader is referred to [7] for more details. In this paper, our goal is to develop an efficient and hierarchical planner for solving Problem 1.

###### Remark 1.

The extension from sc-LTL to the class of LTL formulas can be made by expressing the specification formula using a deterministic Rabin automaton [9, 12] and perform two-step synthesis approach: The first step is to compute the maximal accepting end components, and the second step is to solve the Stochastic Shortest Path (SSP) MDP in the product MDP (assigning reward to reaching a state in any maximal accepting end component). The details of the method can be found in [4, 7, 6]. Particularly, the tools facilitate symbolic computation of maximal accepting end components have been developed [5]. In the scope of this paper, we only consider sc-LTL formulas. Yet, the generalization can be made to handle planning for general LTL formulas with similar two-step approaches.

## Iii Hiererachical and compositional planning under sc-LTL constraints

In this section, we present a compositional planning method for solving Problem 1. First, we propose a task decomposition method to identify a set of modular and reusable primitive options. Second, we establish a relation between logical conjunction/disjunction and composition of primitive options. Building on the options framework, we develop a hierarchical and compositional planning method for temporal logic constrained stochastic systems.

### Iii-a Automata-guided generation of primitive options

We present a procedure to decompose the task in sc-LTL into a set of primitive tasks. These primitive tasks will be composed in Sec III-B to generate the set of options in hierarchical planning.

We first present an algorithm to identify primitive tasks. Given a specification automaton , let the rank of a state be the minimal number of transitions to the set of final states. Let be the set of states of rank . Thus, we have

• , and

• .

By definition, if DFA is coaccessible, i.e., for every state there is a word that takes us from to a final state, then for any state , there exists with a finite rank that includes state . Any DFA can be made coaccessible by trimming [23]. Finally, for a coaccessible DFA, we introduce a sink state to make it complete: For a state and symbol , if is undefined, then let .

Based on the ranking, for each state , we distinguish two types of transitions from the state:

• A transition is progressing: and if then .

• A transition is unsafe: , where is a non-accepting state with self-loops on all symbols.

Note that the DFA may have self-loops which are not included in either progressing transition or unsafe transitions. However, we shall see later that ignoring these self-loops will not affect the optimality of the planning algorithm.

A state may have multiple progressing and unsafe transitions. Let be the set of labels for unsafe transitions on . Let be the set of labels for progressing transitions on . A conditional reachability formula is defined for as

 ¬φUnsafe(q)UφProg(q),

where and and . This subformula is further decomposed into

 φqi\coloneqq¬φUnsafe(q)Uσi, for each σi∈Prog(q).

We define the decomposition of be the collection of conditional reachability formulas

 Φcr={φqi,q∈Q∣i=1,…,|Prog(q)|}.

Next, we prune to obtain the set of primitive tasks: if and only if there does not exist a set of formulas , such that .

For each primitive task, the policy for maximizing the probability of satisfying a conditional reachability formula can be solved through stochastic shortest path problem in MDP , referred to as SSP MDP, with a formal definition follows.

###### Definition 2.

A (discounted) SSP MDP is defined as a tuple where is a set of absorbing goal states and is a set of absorbing unsafe states. The transition probability function satisfies for all , for all . The planning problem is to maximize the (discounted) probability of reaching while avoiding , which is equivalent to maximizing the total (discounted) reward with the reward function defined as: For each , for all , for . is the discounting factor.

Given , the corresponding SSP MDP shares the same state and action sets with the underlying MDP that models the system. The transition function is revised from the transition function in the original MDP by making and absorbing states. Recall that is a set of states satisfying the propositional logic formula . Note when , the solution of SSP MDP is the discounted stochastic shortest path problems. The expected total reward becomes the discounted probability of satisfying the conditional reachability formula.

For stochastic shortest path problems, there exists a deterministic, optimal, Markov policy. However, to compose policies, we use a class of policies called entropy regulated policies, where softmax Bellman operator is used instead of hardmax Bellman operator. Given as the temperature parameter, the optimal value function with softmax Bellman operator satisfies:

 V∗(s)=τlog∑a∈Aexp{(r(s,a)+Es′∼P(⋅|s,a)V∗(s′))/τ}.

The Q-function is:

 Q∗(s,a)=r(s,a)+Es′∼P(⋅|s,a)V∗(s′),

and the entropy-regulated optimal policy is

 π∗(a|s) =exp((Q∗(s,a)−V(s))/τ) =exp(Q∗(s,a)/τ)∑a′exp(Q∗(s,a′)/τ).

In the following, by optimal policy/value function, we mean the entropy-regulated optimal policy/value function unless otherwise specified.

It is noted that the softmax Bellman operator is also proved equivalent to the following form [19]:

 V∗(s)=∑a∈Aπ∗(a|s)[r(s,a)−τlogπ∗(a|s) +γEs′∼P(⋅|s,a)V∗(s′)]

which means in softmax optimal planning the objective has to trade off maximizing the reward and minimizing the total entropy of the stochastic policy—such a trade off is reflected in the choice of . In this case, the construction of reward function needs to be different from the reward defined in Def. 2 to reduce the value diminishing problem, which means the entropy of the stochastic policy outweights the total reward for small reward signals in softmax optimal planning. In this paper, we define the reward function for entropy regulated MDP as:

 r(s,a)=α⋅Es′1Goal(s′),∀s∉Goal, (1)

where is a large constant.

For each conditional reachability subtask , the optimal policy in the corresponding stochastic shortest path MDP is an option , following the definition in [26, 22] in which is an initiation set and is the domain of and is the termination function, defined by only if . We refer the option for task as .

###### Example 1.

Consider the DFA in Fig. 1 and the corresponding scLTL task specification is (reach and then reach regions and , always avoid ). The set of atomic propositions are . The level sets are , , and . For a given state, for example, , the set of labels for progressing transitions are . The set of primitive tasks are . However, , and are not a primitive task. Three primitive options are computed as: , for .

### Iii-B Composition of options with disjunction and conjunction

For a given state , we have obtained a set of conditional reachability formulas , for where is the total number of primitive tasks generated from . However, a progress transition can be made by satisfying any of the conditional reachability formula. That is to say, we may be interested in synthesizing option that maximizes , or potentially the conjunction/disjunction of a subset of . A naive approach is to take the new specification , construct a DFA, and synthesize the optimal policy using methods [7, 2] for MDPs under temporal logic constraints. However, we are interested in finding a “good enough” policy given the new specification via composing existing policies. The problem is formally stated as follows.

###### Problem 2.

Given two conditional reachability formulas , and , construct a good enough policy given the goal of maximizing the probability of satisfying the disjunction: ; or b) the conjunction: ; or c) the exclusion or

The definition of “good enough” policies will be provided later. Here, we consider the case when and share the same set of unsafe states. Particularly, if , and for some and , then it is always the case that and share the same set of unsafe states. Next, we propose a method for policy composition based on generalized logic conjunction/disjunction [8], which is briefly introduced below.

#### Generalized conjunction/disjunction

Generalized Conjunction-Disjunction (GCD) was introduced in [8] for quantitative reasoning with logic formulas. GCD is a mapping , , that has properties similar to logic conjunction and disjunction. The level of similarity is adjustable using a parameter , called the conjunction degree (andness). Formally, let be variables representing the level of truthfulness for a set of logic formulas , the GCD formula , which unifies conjunction and disjunction, is defined as,

 λ(x1,…,xn|η)=1ηlog(n∑i=1Wiexp(ηxi)), (2) 0<|η|<+∞,

where when , recovers the conventional disjunction, and when , recovers the conventional conjunction. For any , returns a level of truthfulness of a GCD. In addition, parameter is the corresponding weight (or relative importance) of the -th formula, for .

We use GCD to compose a “good enough” policy, that is, the optimal policy in semi-MDP planning.

###### Definition 3.

[26, 22] Given an MDP and a set of options where is a set of initial states, is a termination condition and is a policy in the MDP . An option policy in is a function . let be the set of option policies in . Given a reward function , an option policy is optimal if and only if it maximizes the total discounted reward:

 πo,∗=argmaxπo∈ΠoEπo∞∑t=1r(st,ot) (3)

where is the total accumulated rewards when the policy of option is applied to the MDP for the duration of steps.

###### Assumption 1.

Given a conditional reachability subtask , the optimal policy that maximizes the probability of satisfying induces in an absorbing Markov chain .

In other words, with probability one, the system will visit an absorbing state.

###### Lemma 1.

Assuming 1, given a set of options where , is the softmax optimal policy for maximizing the probability of satisfying , i.e., the SSP MDP where is the termination function. In the MDP , the optimal option policy for maximizing the GCD for any , with unit weights, i.e., , , is

 πo(s)[j]=∑ni=1exp(ηQi(s,oj))∑ok∈O∑ni=1exp(ηQi(s,ok)), for j=1,…,n. (4)

where is the evaluation of policy with respect to specification .

###### Proof.

For state and an option , by definition of GCD, we have

 λ(φ1,…,φn∣η;s,oj)=1ηlog(n∑i=1exp(ηEj[1(s⊨φi)])).

where the expectation is taken over paths in the Markov chain . Given the option , —the evaluation of policy with respect to specification and thus .

Next, we distinguish two cases between conjunction and disjunction:

Case I (Disjunction): : To maximize given an option-only decision rule , based on the softmax operator, when , where is a temperature parameter. When , then

 πo(s)[j] =∑ni=1exp(ηQi(s,oj))∑ok∈O∑ni=1exp(ηQi(s,ok)), (5) for j=1,…,n.

which is the same as in (4).

Case II(Conjunction): : In this case, we have

 λ(φ1,…,φn∣η;s,oj)=−1|η|log(n∑i=1exp(−|η|Ej[1(s⊨φi)])). (6)

To maximizes is equivalent to minimize where is the level of truthfulness for formula . Further, minimizing is equivalent to minimizing as , which is exactly the opposite case to that of disjunction. Thus, the optimal option policy satisfies (softmin operator) where is a temperature parameter. When , then given , for ,

 πo(s)[j] =∑ni=1exp(−Qi(s,oj)/τ)∑ok∈O∑ni=1exp(−Qi(s,ok)/τ) =∑ni=1exp(ηQi(s,oj))∑ok∈O∑ni=1exp(ηQi(s,ok)),

which is the same as in (4). Thus the proof is completed. ∎

Next we show that the GCD method is indeed invertible to compute the exclusion . Since , apply generalized disjunction in Eq. (2) to the MDP , we have

 exp(ηEj[1(s⊨φ1)]) ≈exp(ηλ(φ1∧φ2,φ1∖φ2|η;s,oj) =exp(ηEj[1(s⊨φ1∧φ2)])+exp(ηEj[1(s⊨φ1∖φ2)]).

Therefore, the policy of task exclusion can be computed by

 1|η|log(exp(ηEj[1(s⊨φ1)])−exp(ηEj[1(s⊨φ1∧φ2)])).

Intuitively, for the case of disjunction, this policy makes sense because given is the optimal policy for satisfying , selects policy with a likelihood proportional to plus some bonus obtaining by satisfying other specifications. Given two specifications and , since the disjunction can be satisfied by satisfying only one of these two, then this policy exponentially prefers to if has a higher probability to be satisfied.

The situation is complicated for conjunction. The conjunction of two formulas, and , is . For any state, the planner will select the option with a probability that is inverse proportional to , i.e., if has a lower probability to be satisfied, then option has a higher probability to be chosen. Once it reaches , it will select option with a higher probability because for to force a visit to . However, without memory, the planner will alternate between two options indefinitely, or until it reaches an unsafe region. Thus the conjunction on multiple memoryless options will require additional memory to manage the switching condition of terminating function among goals. However, when the intersection and either option has a nonzero probability of reaching the intersection, a memoryless composed option may eventually reach a state in . Thus, we may approximate the solution of with a memoryless composed option for the conjunction. In this paper, we only focus on the memoryless option, further discussions on the additional memory method will be included in the future work.

Finally, the set of options includes both primitive options—one for each primitive tasks and composed options using GCD. The set of actions is now augmented with options , and the optimal policy can be obtained by solving the following planning problem in the product MDP with an augmented action space.

###### Definition 4.

Given a labeled MDP and a linear temporal logic formula (LTL) co-safe formula , represented by a DFA , a set of options, the product MDP with macro- and micro-actions is defined by

 M⋉Aφ=(S×Q,A∪O,¯P,¯μ0),

where the probabilistic transition function is defined by: and where and , is the number of steps that are taken under the (policy) of option before being interrupted by triggering a discrete transition in the automata, is Markov chain induced by the policy of option , and . The reward function is defined by , when , and is the total accumulated rewards when the policy of option is applied to the MDP for the duration of steps before it is interrupted.

Note that when the chain is absorbing for any policies of options, then the discounting factor can be set to . By setting , we will encourage the behavior of satisfying the specification in a less number of steps.

###### Remark 2.

The planning is performed in the product MDP with both actions and options. It is ensured to recover the optimal policy had only actions being used. Having options helps to speed up the convergence. Note that even if self-loops in the DFA have not been considered in generating primitive and composed options, the optimality of the planner will not be affected.

## Iv Case Study

This section illustrates our compositional planning method using robotic motion planning problems. All experiments in this section are performed on a computer equipped with an Intel Core i7-5820K and 32GB of RAM running a python 3.6 script on a 64-bit Ubuntu 16.04 LTS.

The environment is modeled as a 2D grid world, shown in Fig. 2. The robot has actions: Up, Down, Left, and Right. With probability 0.7, the robot arrives at the cell it intended with the action and has a probability of to transit to other adjacent locations, where is the number of adjacent cells, including the current cell when the action is applied. Especially, when the transition hits the boundary of the grid world, the probability under that transition adds to the probability of staying put. The discounting factor is fixed . Fig. 2 shows a grid world of size . In this grid world, there are a set of obstacles (unsafe grids) marked with cross signs and several regions of interests, marked with numbers. The cell marked with number is labeled with symbol , for . Region marked with number is the nonempty intersection satisfying .

We consider three sc-LTL tasks where

 φ1\coloneqq¬CU(◊(σ1∧(◊σ2∧◊σ3))),

(reach and then reach regions and , while avoiding .)

 φ2\coloneqq¬CU(◊((σ1∨σ3)∧◊σ2)),

(reach either or and then reach , while avoiding .)

 φ3\coloneqq¬CU(◊((σ1∨σ2)∧◊(σ2∧σ3))),

(reach either or and then reach either or , while avoiding .)

Figure 1.a shows the DFA of . We omit the DFAs for and given the limited space. Based on the set of tasks, the following set of primitive tasks are generated:

 ϕ1\coloneqq¬CUσ1,ϕ2\coloneqq¬CUσ2,ϕ3\coloneqq¬CUσ3.

For each conditional reachability specifications, we formulate the SSP MDP and compute the softmax optimal policy using temperature parameter and the reward function defined in Eq. 1, where parameter is selected to be to prevent the reward being outweighted by the entropy term.

Our first experiment is to demonstrate the composition of policies based on GCD.

#### Policy composition

We use composition of , to generate the options where . To validate that the composed policies are “good enough”, we compare the values of the optimal policies for these two formulas, computed using standard value iteration, and the values of the composed policies obtained via Lemma 1. For comparison, we consider relative errors and . We have , , , .

Figure 3 shows heat maps comparing two option value functions for the case of disjunction or conjunction. The shaded areas represent globally unsafe regions with values always fixed zero during the iteration. In Fig. 3, all value distributions are in the range between to because we scaled the reward of by to avoid entropy term outweighing the total reward. From Fig. 3 (c), it is shown that the value of regions marked by either or is highest, corresponding to the disjunction. In the case of conjunction in Fig. 3 (d), the intersection of regions marked by and has the highest value.

Next, we compare the convergence between three different planning methods for three task specifications: Planning with only micro-action (action), Planning with macro-actions (primitive and composed options), and Planning with both micro- and macro-actions (mixed). In addition, we compared the optimality of these planners with optimal planning with only micro-action using hardmax Bellman operator as the baseline. The results to be compared are the speed of convergence and the optimality of the converged policy.

Figure 4 shows the convergence of value function evaluated at the initial state with given specification . It shows that among all three methods, the mixed planner converges the fastest. Both option and mixed planners converge much faster than action planner: The action planner converges after about 20 iterations, while the other two converges after 6-9 iterations. It is also interesting to notice that the policy obtained by the mixed planner achieves higher value comparing to the action planner. This is because the entropy of policy weighs less in the policy obtained by the mixed planner comparing to that obtained by the action planner in softmax optimal planning. Moreover, the influence of entropy can also be observed between softmax action planner (action) and hardmax action planner (optimal), where the two planners converge almost at the same rate but to different values since softmax adds additional policy entropy to the total value.

Table I compares the performance of three planners in the given three tasks. The entry refers to the probability of satisfying the specification from an initial state under the optimal policies obtained by three planners. Number is the number of value iterations taken for each method to converge with a pre-defined error tolerance threshold . Number shows the CPU time costs.

In converging iteration numbers and CPU times, the advantage of option planner outperforms the other two significantly. Considering the additional time cost from learning the primitive options, the experiment shows that every single option takes in average 30 iterations to converge in a option state space, and in total costs seconds to compute all the options for primitive tasks. However, these options only need to be solved for once and are reused across three tasks. The composition of options takes negligible computation time ( seconds on average for each composition). The performance loss of option planner, comparing with the global optimal planner (using hardmax Bellman), is only of the optimal value for task and negligible for tasks and . Composition makes temporal logic planning flexible: If we change a task from to , then the option and mixed planner can quickly generate new, optimal policies without reconstructing primitive optio