Compact Policies for Fully-Observable Non-Deterministic Planning as SAT

06/25/2018 ∙ by Tomas Geffner, et al. ∙ Universitat Pompeu Fabra 0

Fully observable non-deterministic (FOND) planning is becoming increasingly important as an approach for computing proper policies in probabilistic planning, extended temporal plans in LTL planning, and general plans in generalized planning. In this work, we introduce a SAT encoding for FOND planning that is compact and can produce compact strong cyclic policies. Simple variations of the encodings are also introduced for strong planning and for what we call, dual FOND planning, where some non-deterministic actions are assumed to be fair (e.g., probabilistic) and others unfair (e.g., adversarial). The resulting FOND planners are compared empirically with existing planners over existing and new benchmarks. The notion of "probabilistic interesting problems" is also revisited to yield a more comprehensive picture of the strengths and limitations of current FOND planners and the proposed SAT approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Planning is the model-based approach to autonomous behavior. A planner produces a plan for a given goal and initial situation using a model of the actions and sensors. In fully-observable non-deterministic (FOND) planning, actions may have non-deterministic effect and states are assumed to be fully observable [Cimatti et al.2003]

. FOND planning is closely related to probabilistic planning in Markov Decision Processes (MDPs), except that uncertainty about successor states is represented by sets rather than probabilities. However, the policies that achieve the goal with probability 1, the so-called proper policies

[Bertsekas and Tsitsiklis1996], correspond exactly to the strong cyclic policies of the associated FOND model [Daniele, Traverso, and Vardi1999] where the possible transitions are those with non-zero probabilities [Geffner and Bonet2013].

FOND planning has become increasingly important as a way of solving other types of problems. In generalized planning, one is interested in plans that provide the solution to multiple, and even, infinite collection of instances [Srivastava, Immerman, and Zilberstein2011, Hu and De Giacomo2011]. For example, the policy “if end not visible, then move” can take an agent to the end of a grid regardless of the initial location of the agent and the value of . In many cases general policies can be obtained effectively from suitable FOND abstractions [Srivastava et al.2011, Bonet and Geffner2015]. For example, the policy above can be obtained from a FOND abstraction over a grid where the “move” actions become non-deterministic and can leave the agent in the same cell. The non-determinism is a result of the abstraction [Bonet et al.2017]. Planning for extended temporal (LTL) goals like “forever, visit each of the rooms eventually” that require “loopy” plans have also been reduced to FOND planning in many cases [Camacho et al.2017]. In such a case, the non-determinism enters as a device for obtaining infinite executions. For example, the extended temporal goal “forever eventually ” can be reduced to reaching the dummy goal once the deterministic outcomes containing are replaced by the non-deterministic outcomes [Patrizi, Lipovetzky, and Geffner2013].

In spite of the increasing importance of FOND planning, research on effective computational approaches has slowed down in recent years. There are indeed OBDD-based planners like MBP and Gamer [Cimatti et al.2003, Kissmann and Edelkamp2009], planners relying on explicit AND/OR graph search like MyND and Grendel [Mattmüller et al.2010, Ramirez and Sardina2014], and planners that rely on classical algorithms like NDP, FIP, and PRP [Kuter et al.2008, Fu et al.2011, Muise, McIlraith, and Beck2012], yet recent ideas, formulations, and new benchmarks have been scarce. This may have to do with the fact one of these planners, namely PRP, does incredibly well on the existing FOND benchmarks. This is somewhat surprising though, given that non-determinism plays a passive role in the search for plans in PRP which is based on the computation of classical plans using the deterministic relaxation [Yoon, Fern, and Givan2007].

The goals of this work are twofold. On the one hand, we want to improve the analysis of the computational challenges in FOND planning by appealing to three dimensions of analysis: problem size, policy size, and robust non-determinism. For this last dimension we provide a precise measure that refines the notion of “probabilistic interesting problems” introduced for distinguishing challenging forms of non-determinism from trivial ones [Little and Thiebaux2007]. Non-determinism is trivial when there is no need to take non-determinism into account when reasoning about the future. This is the case when the risk involved is minimal. In on-line FOND planning, the risk is the probability of not reaching the goal. This is the type of risk that Little and Thiebaux had in mind when they introduced the notion of probabilistic interesting problems. On the other hand, in off-line FOND planning, any complete algorithm will reach the goal with probability 1 in solvable problems. The “risk” in such a case lies in the computational cost

of producing one such solution. We will see that this computational cost for FOND planners based on classical planning can be estimated and used to analyze current benchmarks and to introduce new ones.

The second goal of the paper is to introduce a new approach to FOND planning based on SAT. The potential advantage of SAT approaches to FOND planning is that while classical replanners reason about the possible executions of the plan one by one, the SAT approach performs inference about all branching executions in parallel (interleaved). Moreover, while previous SAT approaches to FOND planning rely on CNF encodings where there is a propositional symbol for each possible state [Baral, Eiter, and Zhao2005, Chatterjee, Chmelik, and Davies2016], we develop a compact encoding that can produce compact policies too. That is, the size of the encodings does not grow with the number of states in the problem, and the size of the resulting policies does not necessarily grow with the number of states that are reachable with the policy. Simple variations of the encoding are introduced for strong planning and for what we call dual FOND planning, where some non-deterministic actions are assumed to be fair (e.g., probabilistic) and others unfair (e.g., adversarial). The resulting SAT-based FOND planner is compared empirically with Gamer, MyND, and PRP.

The paper is organized as follows. We review FOND planning, classical approaches, and the challenge of non-determinism. We introduce then the new SAT approach, the formal properties, optimizations that preserve these properties, and the evaluation. We look then at the variations required for strong and dual FOND planning, and draw final conclusions.

FOND Planning

A FOND model is a tuple where is a finite set of states, is the initial state, is a non-empty set of goal states, is a set of actions, is the set of actions applicable in the state , and for represents the non-empty set of successor states that follow action in state . A FOND problem is a compact description of a FOND model in terms of a finite set of atoms, so that the the states in correspond to truth valuations over the atoms, represented by the set of atoms that are true. The standard syntax for FOND problems is a simple extension of the STRIPS syntax for classical planning. A FOND problem is a tuple where is a set of atoms, is the set of atoms true in the initial state , is the set of goal atoms, and is a set of actions with atomic preconditions and effects. If represents the set of positive and negative effects of an action in the classical setting, action effects in FOND planning can be deterministic of the form , or non-deterministic of the form . Alternatively, a non-deterministic action with effect can be regarded as a set of deterministic actions , …, with effects , …, respectively, written as , all sharing the same preconditions of . The application of results in the application of one of the actions chosen non-deterministically.

A policy for a FOND problem is a partial function mapping non-goal states into actions . A policy for induces state trajectories where and for . A trajectory induced by is complete if is the first state in the sequence such that is a goal state or , or if the trajectory is not finite. Similarly, the trajectory induced by the policy is fair if it is finite, or if infinite occurrences of states in the trajectory with being the action are followed an infinite number of times by the state that results from and , for each . A policy is a strong solution for if the complete state trajectories induced by are all goal reaching, and it is a strong cyclic solution for if the complete state trajectories induced by that are fair are all goal reaching. Strong and strong cyclic solutions are also called strong and strong cyclic policies for respectively.

The methods for computing strong and strong cyclic solutions to FOND problems have been mostly based on OBDDs [Cimatti et al.2003, Kissmann and Edelkamp2009], explicit forms of AND/OR search [Mattmüller et al.2010, Ramirez and Sardina2014], and classical planners [Kuter et al.2008, Fu et al.2011, Muise, McIlraith, and Beck2012]. Some of the planners compute compact policies in the sense that the size of the policies, measured by their representation, can be exponentially smaller than the number of states reachable with the policy. This is crucial in some benchmark domains where the number of states reachable in the solution is exponential in the problem size.

Classical Replanning for FOND Planning

The FOND planners that scale up best are built on top of classical planners. These planners, that we call classical replanners, all follow a loop where many classical plans are computed until the set of classical plans forms a strong cyclic policy. In this loop, non-determinism plays a passive role: it is not taken into account for computing the classical plans but for determining which plans are still missing if any. The good performance of these planners is a result of the robustness and scalability of classical planners and the types of benchmarks considered so far. We describe classical replanners for FOND first, and turn then to three dimensions for analyzing challenges and benchmarks.

The (all-outcome) deterministic relaxation of a FOND problem is obtained by replacing each non-deterministic action by the set of deterministic actions . A weak plan in refers to a classical plan for the deterministic relaxation of .

For a given FOND problem , complete classical replanners yield strong cyclic policies that solve by computing a partial function mapping non-goal states into classical plans for the deterministic relaxation of with initial state . We write to denote a plan for in the relaxation that starts with the action followed by the action sequence . The following conditions ensure that the partial function encodes a strong cyclic policy for [Geffner and Bonet2013]:

  1. Init: ,

  2. Consistency: If and , ,

  3. Closure: If , , .

In these conditions, denotes the single next state for actions in the relaxation, while denotes the set of possible successor states for actions in the original problem , with thus set to when .

A partial function that complies with conditions 1–3 encodes a strong cyclic solution to . If is the plan in the relaxation then if is a deterministic action in , else for . Any strong cyclic plan for can be expressed as a partial mapping of states into plans for the relaxation. The different classical replanners construct the function in different ways usually starting with a first plan , enforcing then consistency and closure. In a problem with no deadend states, the process finishes monotonically in a number of iterations and classical planner calls that is bounded by the number of states that are reachable with the policy. PRP uses regression to reduce this number, resulting in policies that map partial states into actions and may have an exponentially smaller size. In the presence of deadends, the computation in PRP is similar but the process is restarted from scratch with more action-state pairs excluded each time that the classical planner fails to find a plan and close the function . An additional component of PRP is an algorithm for inferring and generalizing deadends that in certain cases can exclude many weak plans from consideration in one shot.

Challenges in FOND Planning

The challenges in (exact) FOND planning have to do with three dimensions: problem size, policy size, and robust non-determinism.

Problem Size. The size of the state space for a FOND problem is exponential in the number of problem atoms. This is like in classical planning. Approaches relying on classical planners, in particular PRP, appear to be the ones that scale up best. This, by itself, however, is not surprising given that classical planning problems are (deterministic) FOND problems and FOND approaches that do not rely on classical planners won’t be as competitive. The trivial conclusion is that problem size alone is likely to exclude non-classical approaches to FOND planning in certain classes of problems.

Policy Size. Many FOND problems have solutions of exponential size. This situation is uncommon in classical planning (one example is Towers of Hanoi) but rather common in the presence of non-determinism. An example of such a domain is tireworld.111 In tireworld and variations, there are roads leading to the goal with spare tires at some locations. A drive action moves the car from one location to the next and may result in a flat tire. The fix action requires a spare at the location. The number of states reachable by the solution policy is exponential in the length of the road as, while the car moves from each location to the goal, it may leave a spare behind or not. Exponential policy size excludes all (exact) FOND approaches except symbolic methods like MBP and Gamer, and those using regression like PRP and Grendel. It is only for some problems like tireworld, however, that PRP can compute correct policies using regression without having to enumerate all reachable states. This is achieved by a fast but incomplete verification algorithm [Muise, McIlraith, and Beck2012]. In general, the correctness and the completeness of PRP rely on this enumeration, and this means that PRP, like methods that compute flat, non-compact policies, will not scale up in general to problems with policies that reach an exponential number of states.222This limitation of PRP could be addressed potentially by using a complete verification algorithm working on the compact representation. This however would require regression over non-deterministic actions [Rintanen2008] and not just over deterministic plans.

Robust Non-Determinism. In the first MDP planning competition, the planner that did best was a simple, classsical replanner used on-line, called FF-replan [Yoon, Fern, and Givan2007]. Little and Thiebaux argued then that the MDP and corresponding FOND evaluation benchmarks were not “probabilistic interesting” in general, as they seldom featured avoidable deadends, i.e., states with no weak plans which can be avoided in the way to the goal [Little and Thiebaux2007]. Avoidable deadends by themselves, however, present a challenge for incomplete, on-line planners like FF-replan but not for complete, off-line classical replanners such as PRP. The reason is that such planners rely on and require the ability to recover from bad choices. Without the ability to “backtrack” in one way or the other, these planners wouldn’t be complete. The computational challenge for complete replanners arises not from the presence of avoidable deadends but from the number of “backtracks” required to find a solution. In particular, FOND problems with avoidable deadends but a small number of weak plans, impose no challenge to complete replanners. There is indeed no need to take non-determinism into account when reasoning about the “future” in complete replanners when the failure to do so translates into a small number of backtracks.

The computational cost of reasoning about the future while ignoring non-determinism can be estimated. For this, let refer to the length of the shortest possible execution that reaches the goal of from its initial state following a policy that solves , and let be the minimum over all such policies . We refer to the weak plans that have length smaller than as misleading plans. A misleading plan is thus a weak plan that does not lead to a full policy, but which due to its length is likely to be found before weak plans that do. As a result, while non-classical approaches won’t scale up to problems of large size, and flat methods won’t scale up to problems with policies of exponential size, classical replanners will tend to fail on problems that have an exponential number of misleading plans, as the consideration of all such plans is the price that they may have to pay for ignoring non-determinism when reasoning about the future.

We refer to the ability to handle problems with an exponential number of misleading plans as robust non-determinism

. Classical replanners like PRP are not bound to generate and discard each of the misleading weak plans one by one given their ability to propagate and generalize deadends. Yet, this component isn’t spelled out in sufficient detail in the case of PRP, and from the observed behavior (see below), this is probably done in an heuristic and limited manner. Approaches that do not rely on classical planners but which make use of heuristics obtained from deterministic relaxations are likely to face similar limitations.

While few existing benchmark domains give rise to an exponential number of misleading plans, it is very simple to come with variations of them that do. Consider for example a version of triangle tireworld containing two roads to the goal: a short one of length with spare tires everywhere except in the last three locations, and a long road of length with spare tires everywhere. The car has capacity for a single spare, but unlike the original domain spares can loaded and unloaded. In the instances of this changed domain, the number of misleading plans grows exponentially in . These are weak plans where the agent takes the short road while moving spare tires around (for no good reason) in the way to the goal. We will see that in this revised domain, a planner like Gamer does much better than PRP. The same will be true for the proposed SAT approach.

SAT Approach to FOND Planning

We provide a SAT approach to FOND planning that is based on CNF encodings that are polynomial in the number of atoms and actions. It borrows elements from both the SAT approach to classical planning [Kautz and Selman1996] and previous SAT approaches to FOND and Goal POMDPs [Baral, Eiter, and Zhao2005, Chatterjee, Chmelik, and Davies2016] that have CNF encodings that are polynomial in the number of states and hence exponential in the number of atoms. Our approach, on the other hand, relies on compact, polynomial encodings, and may result in compact policies too, i.e., policy representations that are polynomial while reaching an exponential number of states.

While the SAT approach to classical planning relies on atoms and actions that are indexed by time, bounded by a given horizon, the proposed SAT approach to FOND planning relies on atoms and actions indexed by controller states or nodes , whose number is bounded by a given parameter that is increased until a solution is found. Each controller node stands for a partial state, and there are two special nodes: the initial node where executions start, and the goal node where executions end. The encoding only features deterministic actions , so that non-deterministic actions are encoded through the deterministic siblings . The atoms express that is one of the (deterministic) actions to be applied in the controller node , and constraints and express that all and only siblings of apply in when applies. If is a deterministic action in the problem, it has no siblings. The atoms express that is applied in node and the control passes to node . Below we will see how to get a strong cyclic policy from these atoms and how to execute it. For obtaining compact policies in this STRIPS non-deterministic setting where goals and action precondition are positive atoms (no negation), we propagate negative information forward and positive information backwards. So, for example, the encoding doesn’t force to be true in when is added by action and is true. Yet if there are executions from where is relevant and required, will be forced to be true in . On the other hand, if is false in and not added by , is forced to be false.

Basic Encoding

We present first the atoms and clauses of the CNF formula for a FOND problem and a positive integer parameter that provides the bound on the number of controller nodes. Non-deterministic actions in are encoded through the siblings . For deterministic actions in , . The atoms in are:

  • : atom true in controller state ,

  • : deterministic action applied in controller state ,

  • : is next after applying in ,

  • : there is path from to in policy,

  • : path from to with at most steps.

The number of atoms is quadratic in the number of controller states; this is different than the number of atoms in the SAT encoding of classical planning that is linear in the horizon. The clauses in are given by the following formulas, where is given by a set of atoms, the set of atoms true in the initial state , a set of actions with preconditions and non-deterministic effects, and the set of goals :

  1. if ; negative info in

  2. if ; goal

  3. if ; preconditions

  4. if and are siblings

  5. if and not siblings

  6. ; some next controller state

  7. if ; fwd prop.

  8. if ; fwd prop. neg. info

  9. ; reachability from

  10. , , reach in steps

  11. for all

  12. : if reaches , reaches .

The control nodes form a labeled graph where the labels are the deterministic actions , , for in . A control node represents a partial state comprised of the true atoms . Goals are true in and preconditions of actions applied in are true in . Negative information flows forward along the edges, while positive information flows backward, so that multiple system states will be associated with the same controller node in an execution. The clauses capture reachability from , while clauses capture reachability to in a bounded number of steps. The last clause states that any controller state reachable from , must reach the goal node . Formula 13 is key for strong cyclic planning: it says that the goal is reachable from in at most steps iff the goal is reachable in at most steps from one of its successors . For strong planning, we will change this formula so that the goal is reachable from in at most steps iff the goal is reachable in at most steps from all successors .

For computing policies for a FOND problem , a SAT-solver is called over where stands for the number of controller nodes . Starting with this bound is increased by until the formula is satisfiable. A solution policy can then be obtained from the satisfying truth assignment as indicated below. If the formula is unsatisfiable for , then has no strong cyclic solution.

Policy

A satisfying assignment of the formula defines a policy that is a function from controller states into actions of . If the atom is true in , if is a deterministic action in and if for a non-deterministic action in .

For applying the compact policy , however, it is necessary to keep track of the controller state. For this, it is convenient to consider a second policy determined by , this one being a standard mapping of states into actions over an extended FOND that denotes a FOND model . In this (cross-product) model, the states are pairs of controller and a system states, the initial state is , the goal states are for , and the set of actions applicable in is restricted to the singleton set containing the action for the compact policy above. The transition function results in the pairs where and is the unique controller state for which a) the atom is true in when is deterministic, or b) the atom is true in for with being the unique successor of in otherwise.

In the extended FOND there is a just one policy, denoted as , that over the reachable pairs selects the only applicable action . We say that the compact policy is a strong cyclic (resp. strong) policy for iff is a strong cyclic (resp. strong) policy for .

Properties

We show that the SAT approach is sound and complete for strong cyclic planning. We consider strong planning later.

Theorem 1 (Soundness).

If is a satisfying assignment for , the compact policy is a strongly cyclic solution for .

Proof sketch: Let be a satisfying assignment for and let be the only policy for the FOND above. We need to show that is a strong cyclic policy for . First, it can be shown inductively that if is reachable by and is true in , is true in . This also implies that if the extended state is reachable in and , then for each precondition of , is true in . If the policy reaches a joint state , i.e. there is a path , , , will be set to true by clause 10, thus forcing to be true (clause 15). In order to satisfy 13, there must be a path of at most steps, with and , such that is and is a goal state. This follows from clause 2.

For showing completeness, if is a strong cyclic policy for , let us define the relevant atoms in a state reachable by from as the atoms that are true in such that there is an execution from to a state such that is not deleted from to and either 1)  is a goal state and is a goal atom, or 2)  is not a goal state and is a precondition of the action . For a reachable state , let be the reachable partial state comprising the atoms in that are relevant given . We call these partial states the -reduced states. Then, completeness can be expressed as follows:

Theorem 2 (Completeness).

Let be a strong cyclic policy for and let represent the number of different -reduced states. Then if , there is an assignment that satisfies and is a compact strong cyclic policy for .

Proof sketch: Consider the problem whose states are the pairs , where is the set of relevant atoms in given the policy , which represents the original problem but with the states augmented with the set of atoms determined by and . The policy is clearly a strong cyclic solution to , and it can be shown that a compact policy can be defined over the “reduced states” only, such that is also a strong cyclic solution to . For this, if is the set of relevant atoms in , it suffices to set to for any such . It needs then to be shown that there is a truth assignment that satisfies for in the theorem, where the controller nodes are associated with the “reduced states” , is true in iff , and is true in iff for some states and .

Finally, the policy is compact in the following sense:

Theorem 3 (Compactness).

The size of the policy for a truth assignment satisfying can be exponentially smaller than the number of states reachable by .

Proof sketch: A single example suffices to show that the number of states reachable with the policy can be exponentially larger than the size of the policy. For this, consider a version of tireworld where there is a single road to the goal with locations where is the goal and there is a spare tire in each location. The number of states reachable by the solution policy is exponential in as the goal may be reached leaving behind any 0/1 distribution of spares over the locations . However, when the execution of the policy reaches a state where the car is at location , only the atoms with are relevant in , and these are atoms are then all true, with the possible exception of . The atoms for , which may be true or false, are not relevant then. As a result, the number of reachable -reduced states, unlike the number reachable states, is . Theorem 2 implies that for , there must be an assignment that satisfies , and hence a compact strong cyclic policy for with controller states that reaches a number of system states that is exponential in .

Optimizations

We introduced simple extensions and modifications to the SAT encoding to make it more efficient and scalable while maintaining its formal properties. The actual encodings used in the experiments feature extra variables that are true iff is true for some action . Also, since the number of variables grows quadratically with the number of control nodes, we substitute them by variables where is the action name for action without the arguments. It is assumed that siblings and of non-deterministic actions get different action names by the parser. As a result, the conjunction can be used in substitution of . Similarly, add lists of actions tend to be short, resulting in a huge number of clauses of type 7 for capturing forward propagation of negative information. These clauses are replaced by

  1. 7’.

  2. 7”. ,

the last clause for actions that do not add but have siblings that do only. Finally, extra formulas are added for breaking symmetries that result from exchanges in the names (numbers) associated with different control nodes, other than and , that result in equivalent controllers.

SAT approach PRP MYND GAMER
Domain (# inst) #at #acts %solve time size %solve time size %sol time size %solve time
acrobatics (8) 67 623 50 572.5 17 100 20.1 127 100 4.5 126 87 5.2
beam walk (11) 746 2231 27 43.2 20 100 27.6 1488 100 126.6 1487 90 41.2
faults (20) 43 35 100 7.2 12 100 0.1 9 100 46.7 45 100 2.5
faults (20) 84 92 65 684.4 19 100 0.2 16 100 69.9 261 70 128.8
faults (15) 129 173 - - 100 0.2 20 100 59.3 1258 20 152.9
first resp (30) 28 41 63 129.6 13 66 0.1 13 63 71.8 11 50 3.5
first resp (40) 99 436 57 194.2 13 82 1.6 17 77 240.2 18 17 1.2
first resp (30) 172 1333 30 174.4 12 76 0.3 24 36 433.4 17 3 1.0
t. tireworld (20) 669 1406 15 149.6 18 100 5.7 125 35 136.3 9382 30 1.7
t. tireworld (20) 4129 9006 - - 100 374.6 365 - - -
zenotravel (15) 377 8424 33 243.9 15 100 5.1 54 - - 6 0.0
elevators (10) 64 58 70 28.8 18 100 0.2 29 100 71.1 28 100 4.4
elevators (5) 123 116 - - 100 1.1 81 60 2221.9 91 20 464.9
blocks (15) 78 1350 66 27.6 13 100 0.2 15 93 33.2 16 66 34.8
blocks (15) 238 8116 - - 100 0.9 33 53 927.3 40 -
tireworld (15) 63 304 80 6.5 7 80 0.1 10 80 11.6 7 73 126.5
earth obs (20) 46 87 40 697.1 16 100 0.4 62 100 192.4 73 -
earth obs (20) 110 224 5 3510 38 100 1.8 234 50 459.4 138 -
miner (30) 587 1209 100 160.6 21 26 556.5 19 - - -
miner (21) 1410 2920 100 1102 25 6 721.3 25 - - -
ttire spiky (11) 256 484 90 911.0 26 - - - - 18 115.8
doors (15) 48 69 93 597.0 20 80 3.2 22 73 288.3 1486 100 4.2
islands (30) 100 333 100 8.1 8 76 167 5 30 127.1 4 43 1.9
islands (30) 388 1588 96 496.4 12 26 256.7 11 13 85.3 10 16 2.8
ttire truck (24) 61 107 100 6 14 37 185.1 18 33 73.8 12 -
ttire truck (25) 80 150 100 96.8 19 32 500 27 16 860.5 17 -
ttire truck (25) 101 198 88 193.8 19 16 384.5 21 8 24 17 -
Table 1: Results for strong cyclic planning. Each line contains the domain name, number of instances in parenthesis, then avg. number of atoms and actions per instance followed by % of instances solved, avg. time in seconds, and avg. policy size for each of the four planners. Domains that involve many instances of very different sizes are split in multiple lines. New domains in the bottom part. Coverage expressed by percentages as different number of instances per line. Best coverages in bold. When coverage is we write , , or to indicate if the problem is a time out, a memory out, or a parsing error.

Experimental Results

We have compared our SAT-based FOND solver with some of the best existing planners; namely, PRP, MyND, and Gamer.333 The version of PRP is the newest, 8/2017, from https://bitbucket.org/haz/deadend-and-strengthening. MyND was obtained from https://bitbucket.org/robertmattmueller/mynd, while we managed to get Gamer only from the authors of MyND. The four planners were run on an AMD Opteron 6300@2.4Ghz, with time and memory limits of 1h and 4GB (10GB for Gamer). The SAT solver used was MiniSAT [Een and Sorensson2004]. We used the FOND domains and instances available from previous publications, and added new domains of our own. We explain them briefly below.

Tireword Spiky: A modification of triangle tireworld. The main difference is that the agent (car) can drop spare tires, not just pick them up, while holding one spare at a time at most. In addition, not all roads can produce a flat tire (i.e., there are normal and spiky roads). In these instances there are two roads to the goal, one shorter with two spiky segments, one after the other, and not enough spares, and a longer path with one spiky segment only. On the first location of the short path there are several spare tires. The misleading plans take the short road to the goal, moving spares around with no purpose.

Tireworld Truck: A modification of Tireworld Spiky where there are a few spiky segments. In this version, all the spares are in the initial location and there is a truck there too that can load and unload tires, and is not affected by spiky roads. The truck and the car cannot be in the same location except for the initial location. The solution is for the truck to pick up the spares that the car will need and place them at their proper places, returning to the initial location, before the car leaves.

Islands. Two grid-like islands of size each are connected by a bridge. Initially the agent is in island 1 and the goal is to reach a specific location in island 2. There are two ways of doing this: the short way is to swim from island 1 to 2, and the long way is to go to the bridge and cross it. Swimming is non-deterministic as the agent may drown. Crossing the bridge is possible when the bridge is free, else the animals that block it have to be moved away first. The misleading plans are those where the agent moves some animals away and then swims to the other island.

Doors: A row of rooms one after the other connected through doors. The agent has to move from the first to the last. Every time the agent enters a room, the in and out doors of the room (except for first and last rooms) open or close non-deterministically. There are actions for entering a room when the door is open and when the door is closed, except for the last room that requires a key when closed. The key is initially in the first room. The agent cannot move backwards. The solution is to pick up the key first and then head to the end room. A version of this domain was considered in [Cimatti et al.2003].

Miner. An agent has to retrieve a number of items that can be found in two regions. In each region, an item can be digged out by moving stones. In the places that are closer to the agent these operations are not fully safe. The misleading plans are those where the items are sought at the close but unsafe sites, possibly moving stones around.

SAT approach MYND GAMER
Domain (# instances) #atoms #acts %solve time size %solve time size %solve time
zenotravel (15) 377 8424 33 130.7 15 - - 6 0.0
elevators (10) 64 58 70 14.5 18 80 9.3 19 80 6.5
elevators (5) 123 116 - - 20 34.0 66 -
miner (30) 587 1209 100 90.9 21 - - -
miner (21) 1410 2920 100 446.5 25 - - -
tireworld-spiky (11) 256 484 90 225.7 26 63 1149.5 27 18 99.8
doors (15) 48 69 86 160.6 20 73 166.4 1486 100 4.3
tireworld (15) 63 304 80 4.1 7 20 8.0 1 20 0.0
islands (30) 100 333 100 7.7 8 96 97.7 4 43 2.4
islands (30) 388 1588 100 334.4 12 26 338.8 10 20 5.1
triangle-tireworld (20) 669 1406 15 112.3 18 15 90.3 2182 30 1.7
Table 2: Results for strong planning over domains with strong solutions in Table 1

The results for the four planners over the existing and new domains are shown in Table 1. Domains that involve many instances of widely different sizes are split into multiple lines, and coverage is expressed by percentages as different lines involve different number of instances. The best coverages for each line are shown in bold. Overall, PRP does best, yet in order to understand the strength and limitations of the various planners, it is useful to consider the problem size, policy size, and type of non-determinism, and also whether the problems are old or new. Indeed, it turns out that PRP is best among the existing domains, most of which predate PRP. On the other hand, for the new domains, the SAT approach is best. Indeed, PRP can deal with very large problems (as measured by the number of atoms and actions), and can also produce large controllers, with hundreds and even thousands of partial or full states. To some extent, MyND is also pretty robust to problem and controller size, but does not achieve the same results. On other hand, the SAT approach has difficulties scaling up to problems that require controllers with more than 30 states, in particular, if problem size is large too. In classical planning, the SAT approach has a similar limitation with long sequential plans. In our SAT approach to FOND, this limitation is compounded by the fact that the CNF encodings are quadratic in the number of controller states. On the other hand, the table shows that the SAT approach is the most robust for dealing with problems with many misleading plans, as in several of the new domains, where not taking non-determinism into account when reasoning about the future makes the “optimistic” search for plans computationally unfeasible.

Discussion

Overall, PRP scales up best to problem size and even policy size, but it doesn’t scale up as well as the SAT approach on forms of non-determinism that involve many misleading plans. Clearly, PRP has excellent coverage on a domain like Triangle Tireworld that also involve an exponential number of misleading plans, but this is achieved by methods for identifying, generalizing, and propagating deadends that are not general. The SAT approach does better in handling richer forms of non-determinism because of its ability to reason in parallel about the different, branching futures arising from non-deterministic actions. The challenge for the SAT approach is to scale up more robustly to problem size, and in particular, to controller size. For classical planning, similar challenges have been addressed quite successfully through a number of techniques, including better encodings, different forms of parallelism, variable planning-specific selection heuristics, and alternative ways for increasing the time horizon [Rintanen2012]. For SAT approaches to FOND, this is all to be explored, and additional techniques like incremental SAT solving should be explored as well [Eén and Sörensson2003, Gocht and Balyo2017].

Strong Planning

The SAT encoding above is for computing strong cyclic policies. For computing strong policies instead, the formula 13 in the encoding has to be replaced by 13’:

meaning that for a node to be at less than steps from the goal, all its successors must be at less than steps from the goal. Table 2 shows the figures for the resulting SAT-based strong FOND planner in comparison with MyND and Gamer used in strong planning mode. The domains are those from Table 1 where at least one of the strong planners found a solution. In this case, the results over the existing domains are mixed, with the SAT approach doing best in one of the domains, and MyND and Gamer doing best on the other two. The SAT approach is best on the new domains with the exception of Doors where Gamer does better.

Strong and Cyclic Planning Combined

A feature of the SAT approach that is not shared by either classical replanners, OBDD-planners, or explicit AND/OR search approaches like MyND and Grendel, is that in SAT, it is very simple to reason with a combination of actions that can be assumed to be fair, with actions that cannot, leading to a form of planning that is neither strong nor strong cyclic. We call this Dual FOND planning.444Related issues are discussed in [Camacho and McIlraith2016].

Dual FOND planing is planning with a FOND problem where some of the actions are tagged as fair, and the others unfair. For example, consider a problem featuring a planning agent and an adversary, one in front of the other in the middle row of a grid (two columns): the agent on the left, the adversary on the right, and the agent must reach a position on the right. The agent can move up and down non-deterministically, moving 0, 1, or 2 cells, without ever leaving the grid, he can also wait, or he can move to the opposing cell on the right if that position is empty. Every turn however, the adversary moves 0 or 1 cells, up or down. The solution to the problem is for the agent to keep moving up and down until he is at vertical distance of 2 to the opponent, then moving right. This strategy is not a strong or a strong cyclic policy, but a dual policy.

A state trajectory is fair for a Dual FOND problem and a policy when infinite occurrences of a state in , where is a fair action, implies infinite occurrences of transitions in for each successor . A solution to a Dual FOND problem is a policy such that all the fair trajectories induced by are goal reaching. Strong cyclic and strong planning are special cases of Dual FOND planning when all or none of the actions are fair. A sound and complete SAT formulation of Dual FOND planning is obtained by introducing the atoms , that will be true if the action chosen in is fair,

  1. 16. , among fair actions

  2. 17. , among unfair actions

and replacing 13 and 13’ by:

  1. 13”.

where 13 and 13’ are the formulas above for strong cylic and strong planning. The above encoding captures dual FOND planning in the same way that the first encoding captures strong cyclic planning.

We have run some experiments for dual planning too, for the example above where the two agents move over a grid. We tried values of up , and the resulting dual policy is the one mentioned above, where the agent keeps moving up and down until leaving the adversary behind. Notice that strong, strong cyclic, and dual FOND planning result from simple changes in some of the clauses. This flexibility is a strength of the SAT approach that is not available in other approaches that require different algorithms in each case.

Conclusions

We have introduced the first SAT formulation for FOND planning that is compact and can produce compact policies. Small changes in the formulation account for strong, strong cyclic, and a combined form of strong and strong cyclic planning, that we call dual FOND planning, where some actions are assumed fair and the others unfair. From a computational point of view, the SAT approach performs well in problems that are not too large and that do not require large controllers, where it is not affected by the presence of a large number of misleading plans. Classical replanners like PRP and explicit AND/OR search planners like MyND can scale up to larger problems or problems with larger controllers respectively, but do not appear to be as robust to non-determinism.

Acknowledgments

We thank Miquel Ramírez and Jussi Rintanen for useful exchanges, Chris Muise for answering questions about PRP, and Robert Matmüller for the code of both MyND and Gamer. H. Geffner is partially funded by grant TIN2015-67959-P from MINECO, Spain.

References

  • [Baral, Eiter, and Zhao2005] Baral, C.; Eiter, T.; and Zhao, J. 2005.

    Using SAT and logic programming to design polynomial-time algorithms for planning in non-deterministic domains.

    In Proc. AAAI.
  • [Bertsekas and Tsitsiklis1996] Bertsekas, D., and Tsitsiklis, J. 1996. Neuro-Dynamic Programming. Athena Scientific.
  • [Bonet and Geffner2015] Bonet, B., and Geffner, H. 2015. Policies that generalize: Solving many planning problems with the same policy. In Proc. IJCAI, 2798–2804.
  • [Bonet et al.2017] Bonet, B.; De Giacomo, G.; Geffner, H.; and Rubin, S. 2017. Generalized planning: Non-deterministic abstractions and trajectory constraints. In Proc. IJCAI.
  • [Camacho and McIlraith2016] Camacho, A., and McIlraith, S. 2016. Strong-cyclic planning when fairness is not a valid assumption. In Proc. IJCAI Workshop on Knowledge-based techniques for Problem Solving.
  • [Camacho et al.2017] Camacho, A.; Triantafillou, E.; Muise, C.; Baier, J. A.; and McIlraith, S. 2017. Non-deterministic planning with temporally extended goals: LTL over finite and infinite traces. In Proc. AAAI, 3716–3724.
  • [Chatterjee, Chmelik, and Davies2016] Chatterjee, K.; Chmelik, M.; and Davies, J. 2016. A symbolic SAT-based algorithm for almost-sure reachability with small strategies in POMDPs. In Proc. AAAI, 3225–3232.
  • [Cimatti et al.2003] Cimatti, A.; Pistore, M.; Roveri, M.; and Traverso, P. 2003. Weak, strong, and strong cyclic planning via symbolic model checking. Artificial Intelligence 147(1-2):35–84.
  • [Daniele, Traverso, and Vardi1999] Daniele, M.; Traverso, P.; and Vardi, M. Y. 1999. Strong cyclic planning revisited. In Recent Advances in AI Planning, 35–48. Springer.
  • [Eén and Sörensson2003] Eén, N., and Sörensson, N. 2003. Temporal induction by incremental SAT solving. Electronic Notes in Theoretical Computer Science 89(4):543–560.
  • [Een and Sorensson2004] Een, N., and Sorensson, N. 2004. An extensible SAT-solver. Lecture notes in computer science 2919:502–518.
  • [Fu et al.2011] Fu, J.; Ng, V.; Bastani, F.and Yen, I.; et al. 2011. Simple and fast strong cyclic planning for fully-observable nondeterministic planning problems. In Proc. IJCAI.
  • [Geffner and Bonet2013] Geffner, H., and Bonet, B. 2013. A Concise Introduction to Models and Methods for Automated Planning. Morgan & Claypool Publishers.
  • [Gocht and Balyo2017] Gocht, S., and Balyo, T. 2017. Accelerating SAT based planning with incremental sat solving. In Proc. ICAPS.
  • [Hu and De Giacomo2011] Hu, Y., and De Giacomo, G. 2011. Generalized planning: Synthesizing plans that work for multiple environments. In Proc. IJCAI, 918–923.
  • [Kautz and Selman1996] Kautz, H., and Selman, B. 1996. Pushing the envelope: Planning, propositional logic, and stochastic search. In Proc. AAAI, 1194–1201.
  • [Kissmann and Edelkamp2009] Kissmann, P., and Edelkamp, S. 2009. Solving fully-observable non-deterministic planning problems via translation into a general game. KI 2009: Advances in AI 1–8.
  • [Kuter et al.2008] Kuter, U.; Nau, D.; Reisner, E.; and Goldman, R. 2008. Using classical planners to solve nondeterministic planning problems. In Proc. ICAPS, 190–197.
  • [Little and Thiebaux2007] Little, I., and Thiebaux, S. 2007. Probabilistic planning vs. replanning. In Proc. ICAPS Workshop on IPC: Past, Present and Future.
  • [Mattmüller et al.2010] Mattmüller, R.; Ortlieb, M.; Helmert, M.; and Bercher, P. 2010. Pattern database heuristics for fully observable nondeterministic planning. In Proc. ICAPS, 105–112.
  • [Muise, McIlraith, and Beck2012] Muise, C.; McIlraith, S.; and Beck, J. C. 2012. Improved non-deterministic planning by exploiting state relevance. In ICAPS.
  • [Patrizi, Lipovetzky, and Geffner2013] Patrizi, F.; Lipovetzky, N.; and Geffner, H. 2013. Fair LTL synthesis for non-deterministic systems using strong cyclic planners. In Proc. IJCAI, 2343–2349.
  • [Ramirez and Sardina2014] Ramirez, M., and Sardina, S. 2014. Directed fixed-point regression-based planning for non-deterministic domains. In Proc. ICAPS.
  • [Rintanen2008] Rintanen, J. 2008. Regression for classical and nondeterministic planning. In Proc. ECAI, 568–572.
  • [Rintanen2012] Rintanen, J. 2012. Planning as satisfiability: Heuristics. Artificial Intelligence 193:45–86.
  • [Srivastava et al.2011] Srivastava, S.; Zilberstein, S.; Immerman, N.; and Geffner, H. 2011. Qualitative numeric planning. In Proc. AAAI.
  • [Srivastava, Immerman, and Zilberstein2011] Srivastava, S.; Immerman, N.; and Zilberstein, S. 2011. A new representation and associated algorithms for generalized planning. Artificial Intelligence 175(2):615–647.
  • [Yoon, Fern, and Givan2007] Yoon, S.; Fern, A.; and Givan, R. 2007. FF-replan: A baseline for probabilistic planning. In Proc. ICAPS-07, 352–359.