 # Graph Planning with Expected Finite Horizon

Graph planning gives rise to fundamental algorithmic questions such as shortest path, traveling salesman problem, etc. A classical problem in discrete planning is to consider a weighted graph and construct a path that maximizes the sum of weights for a given time horizon T. However, in many scenarios, the time horizon is not fixed, but the stopping time is chosen according to some distribution such that the expected stopping time is T. If the stopping time distribution is not known, then to ensure robustness, the distribution is chosen by an adversary, to represent the worst-case scenario. A stationary plan for every vertex always chooses the same outgoing edge. For fixed horizon or fixed stopping-time distribution, stationary plans are not sufficient for optimality. Quite surprisingly we show that when an adversary chooses the stopping-time distribution with expected stopping time T, then stationary plans are sufficient. While computing optimal stationary plans for fixed horizon is NP-complete, we show that computing optimal stationary plans under adversarial stopping-time distribution can be achieved in polynomial time. Consequently, our polynomial-time algorithm for adversarial stopping time also computes an optimal plan among all possible plans.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graph search algorithms. Reasoning about graphs is a fundamental problem in computer science, which is studied widely in logic (such as to describe graph properties with logic [GradelBook, CourcelleBook]

[AIBook, LaValle]. Graph search/planning algorithms are at the heart of such analysis, and gives rise to some of the most important algorithmic problems in computer science, such as shortest path, travelling salesman problem (TSP), etc.

Finite-horizon planning. A classical problem in graph planning is the finite-horizon planning problem [LaValle], where the input is a directed graph with weights assigned to every edge and a time horizon . The weight of an edge represents the reward/cost of the edge. A plan is an infinite path, and for finite horizon the utility of the plan is the sum of the weights of the first edges. An optimal plan maximizes the utility. The computational problem for finite-horizon planning is to compute the optimal utility and an optimal plan. The finite-horizon planning problem has many applications: the qualitative version of the problem corresponds to finite-horizon reachability, which plays an important role in logic and verification (e.g., bounded until in RTCTL, and bounded model-checking [EMSS92, BCCSZ03]); and the more general quantitative problem of optimizing the sum of rewards has applications in artificial intelligence and robotics [AIBook, Chapter 10, Chapter 25]

, and in control theory and game theory

[FV97, Chapter 2.2][OR94, Chapter 6].

Solutions for finite-horizon planning. For finite-horizon planning the classical solution approach is dynamic programming (or Bellman equations), which corresponds to backward induction [Howard, FV97]

. This approach not only works for graphs, but also for other models (e.g., Markov decision processes

[PT87]). A stationary plan is a path where for every vertex always the same choice of edge is made. For finite-horizon planning, stationary plans are not sufficient for optimality, and in general, optimal plans are quite involved, and represented as transducers optimal plans require storage proportional to at least (see Example 1). Since in general optimal plans are involved, a related computational question is to compute effective simple plans, i.e., plans that are optimal among stationary plans.

Expected finite-horizon planning. A natural variant of the finite-horizon planning problem is to consider expected time horizon, instead of the fixed time horizon. In the finite-horizon problem the allowed stopping time of the planning problem is a Dirac distribution at time . In expected finite-horizon problem the expected stopping time is . A well-known example where the fixed finite-horizon and the expected finite-horizon problems are fundamentally different is playing Prisoner’s Dilemma: if the time horizon is fixed, then defection is the only dominant strategy, whereas for expected finite-horizon problem cooperation is feasible [Nowak, Chapter 5]. Another classical example that is very well-studied is the notion of discounting

, where at each time step the stopping probability is

, and this corresponds to the case that the expected stopping time is  [FV97].

Specified vs. adversarial distribution. For the expected finite-horizon problem there are two variants: (a) specified distribution: the stopping-time distribution is specified; and (b) adversarial distribution: the stopping-time distribution is unknown and decided by an adversary. The expected finite-horizon problem with adversarial distribution represents the robust version of the planning problem, where the distribution is unknown and the adversary represents the worst-case scenario. Thus this problem presents the robust extension of the classical finite-horizon planning that has a wide range of applications.

Results. In this work we consider the expected finite-horizon planning problems in graphs. To the best of our knowledge this problem has not been studied in the literature.

• Our first simple result is that for the specified distribution problem, the optimal value can be computed in polynomial time (Theorem 1). However, since the specified distribution generalizes the fixed finite-horizon problem, the optimal plan description as an explicit transducer is of size . Hence the output complexity is not polynomial in general. Second, we consider the decision problem whether there is a stationary plan to ensure a given utility. We show that this problem is NP-complete (Theorem 2).

Our most interesting and surprising results are for the adversarial distribution problem, which we describe below:

• We show that stationary plans suffice for optimality (Theorem 3). This result is surprising and counter-intuitive. Both in the classical finite-horizon problem and the specified distribution problem the adversary does not have any choice, and in both cases stationary plans do not suffice for optimality. Surprisingly we show that in the presence of an adversary the simpler class of stationary plans suffices for optimality.

• For the expected finite-horizon problem with adversarial distribution, the backward induction approach does not work, as there is no a-priori bound on the stopping time. We develop new algorithmic ideas to show that the optimal value can still be solved in polynomial time (Theorem 4). Moreover, our algorithm also computes and outputs an optimal stationary plan in polynomial time. Note that our algorithm also computes stationary optimal plans (which are as well optimal among all plans) in polynomial time, whereas computing stationary optimal plans for fixed finite horizon is NP-complete.

Our results are summarized in Table 1 and are relevant for synthesis of robust plans for expected finite-horizon planning.

## 2 Preliminaries

Weighted graphs. A weighted graph consists of a finite set of vertices, a set of edges, and a function that assigns a weight to each edge of the graph.

Plans and utilities. A plan is an infinite path in from a vertex , that is a sequence of edges such that for all . A path induces the sequence of utilities where for all . We denote by the set of all sequences of utilities induced by the paths of . For finite paths (i.e., finite prefixes of paths), we denote by and the initial and last vertex of , and by the length of .

Plans as transducers. A plan is described by a transducer (Mealy machine or Moore machine [HU79]) that given a prefix of the path (i.e., a finite sequence of edges) chooses the next edge. A stationary plan is a path where for every vertex the same choice of edge is made always. A stationary plan as a Mealy machine has one state, and as a Moore machine has at most states. Given a graph we denote by the set of all sequences of utilities induced by stationary plans in .

Distributions and stopping times. A sub-distribution is a function such that . The value is the probability mass of . Note that . The support of is , and we say that is the sum of two sub-distributions and , written , if for all . A stopping-time distribution (or simply, a distribution) is a sub-distribution with probability mass equal to . We denote by the set of all stopping-time distributions, and by the set of all distributions with , called the bi-Dirac distributions.

Expected utility and expected time. The expected utility of a sequence of utilities under a sub-distribution is . In particular, the expected utility of the identity sequence is called the expected time, denoted by .

## 3 Expected Finite-horizon: Specified Distribution

Given a stopping-time distribution with finite support, we show that the optimal expected utility can be computed in polynomial time. This result is straightforward.

###### Theorem 1.

Let be a weighted graph. Given a stopping-time distribution , with all numbers encoded in binary, the optimal expected utility can be computed in polynomial time.

A special case of the problem in Theorem 1 is the fixed-length optimal path problem, which is to find an optimal path (that maximizes the total utility) of fixed length , corresponding to the distribution . A pseudo-polynomial time solution is known for this problem, based on a value-iteration algorithm [LaValle, Section 2.3]. The algorithm runs in time (where is encoded in binary), and relies on the following recursive relation, where is the optimal value among the paths of length that start in :

 At(v)=maxv′∈V w(v,v′)+At−1(v′).

A polynomial algorithm running in to obtain is to compute, in the max-plus algebra111In the max-plus algebra, the matrix product is defined by ., the -th power of the transition matrix of the weighted graph, where if , and otherwise. The power can be computed in time by successive squaring of and summing up according to the binary representation of , which gives a polynomial algorithm to compute  since it is the largest element in the column of  corresponding to  (note that the entries of the matrix are bounded by , where is the largest absolute weight in the graph). We now present the proof of Theorem 1.

###### Proof of Theorem 1.

Given the weighted graph and the distribution , we reduce the problem to finding an optimal path of length in a layered graph where the transitions between layer and layer mimic sequences of transitions in the original graph. For , define the -th power of recursively by where . Let be the transition matrix of the original weighted graph. We construct the graph where

• ,

• where , and

• .

The optimal expected utility is the same as the optimal fixed-length path value for length in . The correctness of this reduction relies on the fact that the probability of not stopping before time is and the largest utility of a path of length from to is . Given a path of length in (that induces a sequence of weights), we can construct a path of length in (visiting at time and inducing a sequence of utilities), and we show that the value of the path of length in is the same as the expected utility of the corresponding path in with stopping time distributed according to , as follows (where ):

 k−1∑i=0w′i =k−1∑i=0(k∑j=i+1pj)⋅(uti+1−uti) =k∑j=1pj⋅j−1∑i=0(uti+1−uti) =k∑j=1pj⋅utj

Conversely, given an arbitrary path in , let be the vertex visited at time , and consider the path in , which has a total utility at least the same as the expected utility of the given path in .

Therefore, the problem can be solved by finding the optimal fixed-length path value for length in , which can be done in polynomial time (see the remark after Theorem 1). ∎

In the fixed-horizon problem with , the optimal plan need not be stationary. The example below shows that in general the transducer for optimal plans require states as Mealy machine, and states as Moore machine.

###### Example 1.

Consider the graph of Figure 1 with vertices, and time bound (for some constant ). The optimal plan from is to repeat times the cycle and then switch to . This path has value , and all other paths have lower value: if only the cycle is used, then the value is at most , and the same holds if the cycle on is ever used before time . The optimal plan can be represented by a Mealy machine of size that counts the number of cycle repetitions before switching to . A Moore machine requires size as it needs a new memory state at every step of the plan.

###### Example 2.

In the example of Figure 2 the optimal plan needs to visit several different cycles, not just repeating a single cycle and possible switching only at the end. The graph consists of three loops on with weights  and respective length , , and , and an edge to with weight . For expected time , the optimal plan has value and needs to stop exactly when reaching (to avoid the negative self-loop on ). It is easy to show that the remaining length can only be obtained by visiting each cycle once: as

is not an even number, the path has to visit a cycle of odd length, thus the cycle of length

; analogously, as is not a multiple of , the path has to visit the cycle of length , etc. This example can be easily generalized to an arbitrary number of cycles by using more prime numbers.

We now consider the complexity of computing optimal plans among stationary plans.

###### Theorem 2.

Let be a weighted graph and be a rational utility threshold. Given a stopping-time distribution , whether (i.e., whether there is a stationary plan with utility at least ) is NP-complete. The NP-hardness holds for the fixed-horizon problem , even when and all weights are in , and thus expressed in unary.

###### Proof.

The NP upper bound is easily obtained by guessing a stationary plan (i.e., one edge for each vertex of the graph) and checking that the value of the induced path is at least .

The NP hardness follows from a result of [FHW80] where, given a directed graph and four vertices , the problem of deciding the existence of two (vertex) disjoint simple paths (one from to and the other from to ) is shown to be NP-complete. It easily follows that given a directed graph, and two vertices , the problem of deciding the existence of a simple cycle that contains and is NP-complete. We present a reduction from the latter problem, illustrated in Figure 3. We construct a weighted graph from , by adding two vertices start and sink, and all edges have weight except those from with weight , and the edge with weight where is the number of vertices in . Let and the utility threshold .

If there exists a simple cycle containing and in , then there exists a stationary plan from start that visits then in at most steps. This plan can be prolonged to a plan of steps by going to sink and using the self-loop. The total weight is .

If there is no simple cycle containing and in , then no stationary plan can visit first then . We show that every stationary plan has value at most . First if a stationary plan uses the edge , then is not visited and all weights are except the weight from to sink. Otherwise, if a stationary plan does not use the edge , then all weights are at most , and the total utility is at most . In both cases, the utility is smaller than , which establishes the correctness of the reduction. ∎

## 4 Expected Finite-horizon: Adversarial Distribution

We now consider the computation of the following optimal values under adversarial distribution. Given a weighted graph  and an expected stopping time , we define the following:

• Optimal values of plans. For a plan that induces the sequence of utilities, let

 val(ρ,T)=val(u,T)=infδ∈Δ:Eδ=T Eδ(u).
• Optimal value. The optimal value is the supremum value over all plans:

 val(G,T)=supu∈UGval(u,T).

Our two main results are related to the plan complexity and a polynomial-time algorithm.

###### Theorem 3.

For all weighted graphs and for all we have

 val(G,T)=supu∈UGval(u,T)=supu∈SGval(u,T),

i.e., optimal stationary plans exist for expected finite-horizon under adversarial distribution.

###### Remark 1.

Note that in contrast to fixed finite-horizon problem, where stationary plans do not suffice, we show in the presence of an adversary, the simpler class of stationary plans are sufficient for optimality in expected finite-horizon. Moreover, while optimal plans require -size Mealy (resp., -size Moore) machines for fixed-length plans, our results show that under adversarial distribution optimal plans require -size Mealy (resp., -size Moore) machines.

###### Theorem 4.

Given a weighted graph and expected finite-horizon , whether can be decided in time, and computing can be done in time.

### 4.1 Theorem 3: Plan Complexity

In this section we prove Theorem 3. We start with the notion of sub-distributions. Two sub-distributions are equivalent if they have the same probability mass, and the same expected time, that is and . The following result is straightforward.

###### Lemma 1.

If are equivalent sub-distributions, and is a sub-distribution, then and are equivalent sub-distributions.

#### 4.1.1 Bi-Dirac distributions are sufficient

By Lemma 1, we can decompose distributions as the sum of two sub-distributions, and we can replace one of the two sub-distributions by a simpler (yet equivalent) one to obtain an equivalent distribution. We show that, given a sequence of utilities, for all sub-distributions with three points in their support (see Figure 4), there exists an equivalent sub-distribution with only two points in its support that gives a lower expected value for . Intuitively, if one has to distribute a fixed probability mass (say ) among three points with a fixed expected time , assigning probability at point , then we have and , i.e.,

 p1⋅(t1−t3)p′1+p2⋅(t2−t3)p′2=T−t3.

The expected utility is

 p1⋅ut1+p2⋅ut2+p3⋅ut3=p′1⋅ut1−ut3t1−t3+p′2⋅ut2−ut3t2−t3+ut3

which is a linear expression in variables where the sum is constant. Hence the least expected utility is obtained for either , or . This is the main argument222This argument works here because , which implies that when , and vice versa. A symmetric argument can be used in the case , to show that then either , or . to show that bi-Dirac distributions are sufficient to compute the optimal expected value.

###### Lemma 2 (Bi-Dirac distributions are sufficient).

For all sequences of utilities, for all time bounds , the following holds:

 inf{Eδ(u)∣δ∈Δ∧Eδ=T}= inf{Eδ(u)∣δ∈Δ⇈∧Eδ=T},

i.e., the set of bi-Dirac distributions suffices for the adversary.

###### Proof.

First, we show that for all distributions with ,

• there exists an equivalent distribution such that and , i.e., only one point before in the support is sufficient, and

• there exists an equivalent distribution such that and , i.e., only one point after in the support is sufficient.

The result of the lemma follows from these two claims.

To prove claim , first consider an arbitrary sub-distribution with where . Then and either , or .

We show that among the sub-distributions equivalent to and with , the smallest expected utility of is obtained for . We present below the argument in the case , and show that either , or . A symmetric argument in the case shows that either , or .

Let , , and . Since and are equivalent, we have

 x+y+z=pδ x⋅t1+y⋅t2+z⋅t3=pδ⋅Eδ

Hence

 z=pδ−x−y x⋅(t1−t3)x′+y⋅(t2−t3)y′=pδ⋅(Eδ−t3)

The expected utility of under is

 Eδ′(u) =x⋅ut1+y⋅ut2+z⋅ut3 =x⋅(ut1−ut3)+y⋅(ut2−ut3)+ut3⋅pδ =x′⋅ut1−ut3t1−t3+y′⋅ut2−ut3t2−t3+ut3⋅pδ (1)

Since is constant and , the least value of is obtained either for (if ), or for (otherwise), thus either for , or for . Note that for , we have and , which is a feasible solution as and since , and . Symmetrically, for we have a feasible solution.

As an intermediate remark, note that for and , we get (for , and symmetrically for )

 Eδ′(u)=ut3+T−t3t1−t3⋅(ut1−ut3). (2)

To complete the proof of Claim , given an arbitrary distribution with , we use the above argument to construct a distribution equivalent333Equivalence follows from Lemma 1. to with smaller expected utility and one less point in the support. We repeat this argument until we obtain a distribution with support that contains at most two points in the interval where is such that . Such a value of exists since . By the construction of , we have and therefore at most one point in the support of lies in the interval , which completes the proof of Claim .

To prove claim , consider a distribution with , and by claim  we assume that for some , and for all with . Let , and we consider two cases:

• if for all such that , we have , then by an analogous of Equation (1), we get

 Eδ(u) =ut0+∑t≥Tδ(t)⋅(t−t0)⋅ut−ut0t−t0 =ut0+ν⋅∑t≥0δ(t)⋅(t−t0)=ut0+ν⋅(T−t0)

which is the expected utility of under a bi-Dirac distribution with support where is any element of (see Equation (2));

• otherwise there exists such that and . By an analogous of Equation (1), we have

 Eδ(u)−ut0=∑t≥Tδ(t)⋅(t−t0)⋅ut−ut0t−t0 where ∑t≥Tδ(t)⋅(t−t0)=T−t0,

that is is a convex combination of elements greater than or equal to , among which one is greater than . It follows that , and thus there exists such that .

Consider such that (which exists by definition of ), and let be the bi-Dirac distribution with support and expected time . By an analogous of Equation (2), we have

 Eδ′(u)−ut0 =T−t0t1−t0⋅(ut1−ut0) <(T−t0)⋅(ν+ϵ)

Therefore, which concludes the proof since is a bi-Dirac distribution with .

#### 4.1.2 Geometric interpretation

It follows from the proof of Lemma 2 (and Equation (2)) that the value of the expected utility of a sequence of utilities under a bi-Dirac distribution with support (where ) and expected time is

 ut1+T−t1t2−t1⋅(ut2−ut1).

In Figure 5, this value is obtained as the intersection of the vertical axis at and the line that connects the two points and . Intuitively, the optimal value of a path is obtained by choosing the two points and such that the connecting line intersects the vertical axis at as down as possible.

###### Lemma 3.

For all sequences of utilities, if for all , then the value of the sequence is at least .

###### Proof.

By Lemma 2, it is sufficient to consider bi-Dirac distributions, and for all bi-Dirac distributions with arbitrary support the value of under is

 ut1+T−t1t2−t1⋅(ut2−ut1) = ut1⋅(t2−T)+ut2⋅(T−t1)t2−t1 ≥ (a⋅t1+b)⋅(t2−T)+(a⋅t2+b)⋅(T−t1)t2−t1 ≥ a⋅T+b

It is always possible to fix an optimal value of (because is to be chosen among a finite set of points), but the optimal value of may not exist, as in Figure 5. The value of the path is then obtained as . In general, there exists such that it is sufficient to consider bi-Dirac distributions with support containing to compute the optimal value. We say that is a left-minimizer of the expected value in the path. Given such a value of , let , and we show in Lemma 4 that , for all . This motivates the following definition.

Line of equation . Given a left-minimizer , we define the line of equation as follows:

 fu(t)=ut1+(t−t1)⋅ν.

Note that the optimal expected utility is

 min0≤t1≤Tinft2≥Tut1+T−t1t2−t1⋅(ut2−ut1)=min0≤t1≤Tut1+(T−t1)⋅ν=fu(T).

In other words, is the optimal value.

###### Lemma 4 (Geometric interpretation).

For all sequences of utilities, we have for all , and the expected value of is .

###### Proof.

The result holds by definition of for all . For , assume towards contradiction that . Let such that . We obtain a contradiction by showing that there exists a bi-Dirac distribution under which the expected value of is smaller than the optimal value of . Consider a bi-Dirac distribution with support where the value is defined later.

We need to show that

 ut+T−tt2−t⋅(ut2−ut)

that is

 ut⋅(t2−T)+ut2⋅(T−t)t2−t

which, since , holds if (successively)

 ut1⋅(t2−T)+(t−t1)⋅(t2−T)⋅ν+ut2⋅(T−t)≤\omitε⋅(t2−T)+ut1⋅(t2−t)+(t2−t)⋅(T−t1)⋅νut1⋅(t−T)+ut2⋅(T−t)≤\omitε⋅(t2−T)−ν⋅(t⋅t2+t1⋅T−t2⋅T−t⋅t1)(ut2−ut1)⋅(T−t)+ν⋅(t2−t1)⋅(t−T)≤\omitε⋅(t2−T)(T−t)⋅(ut2−ut1t2−t1−ν)⋅(t2−t1)≤ε⋅(t2−T)

We consider two cases: if the infimum is attained, then we have for some , and the inequality holds; otherwise, we can choose arbitrarily, and large enough to ensure that is smaller than , so that the inequality holds. ∎

A corollary of the geometric interpretation lemma is that the value of a path can be obtained as the intersection of the vertical line at point