Markov decision processes. Markov decision processes (MDPs) [Howard1960]
are a standard mathematical model for sequential decision making, with a wide range of applications in artificial intelligence and beyond[Puterman1994, Filar and Vrieze1997, Bertsekas2005]
. An MDP consists of a set of states, a finite set of actions (that represent the nondeterministic choices), and a probabilistic transition function that describes the transition probability over the next states, given the current state and action. One of the most classical optimization objectives in MDPs is the stochastic shortest path (SSP) problem, where the transitions of the MDP are labeled with rewards/costs, and the goal is to optimize the expected total rewards until a target set is reached.
Curse of dimensionality.
In many typical applications, the computational analysis of MDPs suffers from the curse of dimensionality. The state space of the MDP is huge as it represents valuations to many variables that constitute the MDP. A well-studied approach for algorithmic analysis of large MDPs is to consider factored MDPs[Guestrin et al.2003, Delgado et al.2011], where the transition and reward function dependency can be factored only on few variables.
Succinct MDPs. In the spirit of factored MDPs, which aim to deal with the curse of dimensionality, we consider succinct MDPs, where the MDP is described implicitly by a set of variables, and a set of rules that describe how the variables are updated. The rules can be chosen non-deterministically from this set at every time step to update the variables, until a target set (of valuations) is reached. The rules and the target set represent the succinct description of the MDP. We consider the SSP problem for succinct MDPs.
Our contributions. Our main contributions are as follows:
First, we show that many examples from the AI literature (e.g., Gambler’s Ruin, Robot Planning, and variants of Roulette) can be naturally modeled as succinct MDPs.
Second, we present mathematical and computational results for the SSP problem for succinct MDPs. For the SSP problem the -value (resp. -value) represents the expected shortest path value with supremum (resp., infimum) over all policies. We consider linear bounds for the SSP problem and our algorithmic bounds are as follows: (a) for the -value (resp. -value) we show that an upper (resp., lower) bound can be computed in polynomial time in the implicit description of the MDP; (b) for the -value (resp.
-value) we show that a lower (resp., upper) bound can be computed by a polynomial-time (in the implicit description) reduction to quadratic programming. Our approach is as follows: we use results from probability theory to establish a mathematical approach to compute bounds for the SSP problem for succinct MDPs (Section3), and reduce the mathematical approach to constraint solving to obtain a computational method (Section 4).
Finally, we present experimental results on several classical examples from the literature where our method computes tight bounds (i.e., lower and upper bounds are quite close) on the SSP problem extremely efficiently.
Comparison with approaches for factored MDPs. Some key advantages of our approach are the following. First, our approach gives a provably polynomial-time algorithm or polynomial-time reduction to quadratic programming in terms of the size of the implicit description of the MDP. Second, while algorithms for factored MDPs are typically suitable for finite-state MDPs, our method can handle MDPs with countable state space (e.g., when the domains of the variables are integers), or even uncountable state space (e.g., when the domains of the variables are reals). See Remark 1 for further details.
A full version of this paper, including details and proofs, is available at [arxivVersion].
2 Definitions and Examples
We define succinct Markov decision processes, the stochastic shortest path problem, and provide illustrative examples.
2.1 Succinct Markov Decision Processes
Markov decision processes (MDPs). MDPs are a standard mathematical model for sequential decision making. An MDP consists of a state space and a finite action space , and given a state and an action permissible in
, there is a probability distribution functionthat describes the transition probability from the current state to the next states (i.e., is the transition probability from to given action ). Moreover, there is a reward assigned to each state and action pair.
Factored MDPs. In many applications of MDPs, the state space of the MDP is high-dimensional and huge (i.e., the state space consists of valuations to many variables). For computational analysis it is often considered that the MDP can be factored, i.e., the transition and reward probabilities depend only on a small number of variables. For algorithmic analysis of finite-state factored MDPs see [Guestrin et al.2003].
Succinct MDPs. In this work, we consider succinct MDPs, which are related to factored MDPs. First, we present an informal description and then the details. A succint MDP is described by a set of variables, and a set of rules that update them. The update rule can be chosen non-deterministically or stochastically. Thus the MDP is described implicitly with the set of rules and a condition that describes the target set of states, and the MDP terminates when the target set is reached. We describe succinct MDPs as a special class of programs with a single while loop.
Simple-while-loop succinct MDPs. We consider succinct MDPs described by a simple while loop program. There are two types of variables, namely, program and sampling variables. Program variables are normal variables, while sampling variables are those whose values are sampled independently wrt some probability distribution. In general, both program and sampling variables can take integer, or even real values. The succinct MDP is described by a simple while loop of the form:
is the loop guard defined as a single comparison between linear arithemtic expressions over program variables (e.g. );
in the loop body, are sequential compositions of assignment statements, grouped by the non-determinism operator .
Every assignment statement has a program variable as its left-hand-side and a linear arithmetic expression over program and sampling variables as its right-hand-side. The operator is the nondeterministic choice which means that the decision as to which will be executed in the current loop iteration depends on a scheduler (or policy) that resolves nondeterminism. We first provide the formal syntax and then a simple example.
Formal syntax of simple-while-loop programs. A succinct MDP is specified by a simple-while-loop program equipped with probability distributions for sampling variables. We now formalize the intuitive description provided in Equation (1). Formally, a simple-while-loop program can be produced using the following grammar, where each <pvar> is chosen from a finite fixed set of program variables, each <svar> from a finite fixed set of sampling variables and each <constant> denotes a floating point number:
<simple-while-loop-program> ::= ‘while’ <guard> ‘do’ ‘’ <nondet-block-list> ‘’ ‘od’
<guard> ::= <linear-pvar-expr> <cmp> <linear-pvar-expr>
<linear-pvar-expr> := <constant> | <constant> ‘*’ <pvar> <linear-pvar-expr> ‘+’ <linear-pvar-expr>
<cmp> ::= ‘>=’ | ‘>’ | ‘<=’ | ‘<’
<nondet-block-list> ::= <block> <block> <nondet-block-list>
<block> ::= <assignment> | <assignment> <block>
<assignment> ::= <pvar> ‘:=’ <linear-expr> ‘;’
<linear-expr> ::= <constant> | <constant> ‘*’ <pvar> <constant> ‘*’ <svar> <linear-expr> ‘+’ <linear-expr>
Consider the following program
where is a program variable and is a sampling variable that observes the two-point distribution . Informally, the program performs either decrement or random increment/decrement on until its value is zero.
Informal description of the semantics. Given a simple-while-loop program in the form (1), an MDP is obtained as follows: the state space consists of values assigned to program variables (i.e., valuations for program variables); the action space correponds to the non-deterministic choice between ; and the transition function depends on the sampling of the sampling variables, which given the current valuation for program variables probabilistically updates the valuation. The assignments are linear functions, and the loop guard describes the target states as those which do not satisfy . Given the above components, the notion of a policy (or scheduler) that resolves the non-deterministic choices, and that of the probability space given a policy is standard [Puterman1994]. For a more formal treatment of the semantics, see Section 2.4.
Remark 1 (Simple-while-loop MDPs and Factored MDPs).
The principle behind factored MDPs and simple-while-loop succinct MDPs is similar. Both aim to consider high-dimensional and large MDPs described implicitly in a succinct way. In factored MDPs the transitions and reward functions can be factored based on small sets of variables, but the dependency on the variables in these sets can be arbitrary. In contrast, in simple-while-loop succinct MDPs, we only allow linear arithmetic expressions as guards and assignments. Moreover, our MDPs do not allow nesting of while loops. The goal of our work is to consider linear upper and lower bounds, and nesting of linear loops can result in super-linear bounds. hence we do not consider nesting of loops. Therefore, simple-while-loop succinct MDPs are a special class of factored MDPs.
For algorithmic approaches with theoretical guarantees on computational complexity, the analysis of factored MDPs has typically been restricted to finite-state MDPs. However, we will present solutions for simple-while-loop programs, where the variables can take integer or real values, and thus the underlying state space is infinite or even uncountable. Thus the class of simple-while-loop MDPs consists of large finite-state MDPs (when the variables are bounded); countable state MDPs (variables are integer); and even uncountable state MDPs (variables are real-valued). Moreover, our algorithmic approaches provides computational complexity guarantee on the input size of the implicit representation of the MDP. Note that for finite-state MDPs, the implicit representation can be exponentially more compact than the explicit MDP. For example, boolean variables lead to a state-space of size .
In the sequel we consider MDPs obtained from simple-while-loop programs and, for brevity, call them succinct MDPs.
2.2 Stochastic Shortest Path on Succinct MDPs
We consider the stochastic shortest path (SSP) problem on succinct MDPs. Below we fix a succinct MDP described by a while-loop in the form (1).
Reward function. We consider a reward function that assigns a reward when the sampling valuation for sampling variables is and the non-deterministic choice is (in a loop iteration). We assume that there is a maximal constant such that for all . The rewards need not be nonnegative as our approach is able to handle negative rewards as well.
Stochastic shortest path. Given an initial valuation for program variables and a policy , the definition of expected total reward/cost until termination is standard. The -value (resp., -value) of a succinct MDP, given an initial valuation for program variables, is the infimum (resp., supremum) expected reward value over all policies that ensure finite expected termination time, which we denote as (resp., ).
Computational problem. We consider the computational problem of obtaining upper and lower bounds for the -value and -value for succinct MDPs. Due to the similarity of the problems, we will focus only on computing lower and upper bounds for the -value, and the results for -value are similar and omitted.
2.3 Illustrative Examples
Example 2 (Gambler’s Ruin).
We start with a simple and classical example. A gambler has tokens for gambling. In each round he can choose one of the two types of gambling. In type 1, he wins with probability and in type 2 with probability . A win leads to a reward of and an extra token. A loss costs one token. The player gambles as long as he has tokens left. His goal is to maximize overall expected reward. Letting and , a succinct MDP for this example is shown in Figure 1.
Note that above we use probabilistic if as syntactic sugar, where the assignments in if run with probability and those in the else part with probability . Given that the assignments are linear, this can be translated back to a succinct MDP, e.g. the first if-else block in Figure 1 is equivalent to:
where r is a sampling variable with and .
Example 3 (Continuous variant).
We can also consider a continuous variant of the example where the sampling variable
is chosen from some continuous distribution with expected value (e.g., uniform distribution
(e.g., uniform distribution).
2.4 Technical Details of Semantics and SSP
2.4.1 Formal Semantics
We formalize our semantics by introducing the notion of valuations which specify current values for program and sampling variables. Below we fix a program in the form (1) and let (resp. ) be the set of program (resp. sampling) variables appearing in . The sizes of and are denoted by and , respectively. We also impose arbitrary linear orders on both of and , and assume that for each sampling variable , a distribution is given and each time appears in the program, its value is stochastically sampled according to .
Valuations. A program valuation
is a vector. Intuitively, a valuation specifies that for each with rank in the linear order, the value assigned is the th coordinate of . Likewise, a sampling valuation is a vector .
For each program valuation , we say that satisfies the loop guard , denoted by , if the formula holds when every appearance of a program variable is replaced by its corresponding value in . Moreover, each in () now encodes a function which transforms the program valuation before the execution of and the sampled values in into the program valuation . By our linear setting, each is also linear.
Our semantics are based on paths. Intuitively, a path is an infinite sequence of valuations where the valuations at the th step reflects the current values for program and sampling variables at the th step.
Paths. A path is a finite or infinite sequence of triples such that each (resp. ) is a program valuation (resp. sampling valuation, nondeterministic choice). The intuition is that each (resp. ) is the current program valuation (resp. sampling valuation, nondeterministic choice) right before the th loop-iteration of .
The program may involve nondeterministic choices (i.e., the operator ) which are still unspecified. To resolve nondeterminism, we need the standard notion of schedulers.
Schedulers. A scheduler is a function which maps each finite path and current program valuation to a number in representing the choice of () at the th loop iteration.
Intuitively, a scheduler resolves the nondeterministic choice at each iteration of by choosing which to run in the loop body. The resolution at the th iteration may depend on all previous valuations has traversed before.
Intuitive Semantics. Consider an initial program valuation and a scheduler . An infinite path is constructed as follows. Initially, . Then at each step (): first, a sampling valuation is obtained through samplings for all sampling variables, where the sampling of each sampling variable observes a predefined probability distribution for the variable; second, if then the program enters the loop and , , otherwise the program terminates and .
Probability Space. A probability space is a triple , where is a nonempty set (so-called sample space), is a -algebra over (i.e., a collection of subsets of that contains the empty set and is closed under complementation and countable union), and is a probability measure on , i.e., a function such that (i) and (ii) for all set-sequences that are pairwise-disjoint (i.e., whenever ) it holds that . Elements are usually called events. An event is said to hold almost surely (a.s.) if .
Formal Semantics. Given an initial program valuation and a scheduler , We build the probability space for the program as follows. First, is the set of all infinite paths. Then, we construct
through general state-space Markov chains. In detail, we build the kernel function on the set of all finite paths. The construction of the kernel function follows exactly from our aformentioned intuitive semantics. Then the probability space on infinite paths is generated uniquely from the kernel function.
2.4.2 Formal Details of the SSP Problem
We consider accumulated cost until program termination. We first define several classic notions of probability theory.
Random Variables. A random variable for a probability space is an -measurable function , i.e., a function satisfying the condition that for all , the set belongs to ; By convention, we abbreviate as .
Below we fix a program in the form (1) with program variables and sampling variables . We first establish the notion of cost functions for which measures the cost consumed in one loop iteration. In this paper, we consider that the cost is only related to the sampling valuation and the nondeterministic choice before the execution of the loop body.
A cost function is a function .
The next definition introduces the notion of accumulated cost.
For each , we define the random variable for cost at the th step by:
for any infinite path . Then the random variable for accumulated cost until termination is defined as for all infinite paths .
Expectation. The expected value of a random variable from a probability space , denoted by , is defined as the Lebesgue integral of w.r.t , i.e., ; the precise definition of Lebesgue integral is somewhat technical and is omitted here (cf. [Williams1991, Chapter 5] for a formal definition).
We study the expected accumulated cost , which is an important criterion for measuring how much cost the program consumes upon termination. In the presence of nondeterminism, we consider maximum and minimum accumulated cost over all schedulers.
The maximum (resp. minimum) expected accumulated cost (resp. ) w.r.t an initial program valuation is defined as (resp. ), where ranges over all schedulers and is the expectation under the probability space for infinite paths.
3 Theoretical Results
In this section we present the main theoretical results which forms the basis of our algorithms of the following section.
Notation. Given a succinct MDP in the form (1), we let be the set of program variables in the program, be the size of , be the set of -dimensional real-valued vectors, be a valuation for program variables such that the value for the th program variable is the th coordinate of , be a valuation for sampling variables, be the set of non-deterministic choices, be a non-deterministic choice, and () be the function such that is the valuation for program variables resulting from executing with the valuation for program variables and the sampled valuation for sampling variables. Table 1 summarizes the notation.
|the set of program variables|
|the size of|
|the set of -dimensional real column vectors|
|a valuation for program variables|
|a valuation for sampling variables|
|the -value for initial valuation|
|the -value for initial valuation|
|the set of non-deterministic choices|
|a non-deterministic choice in|
|the transformation function of mapping valuations before its execution to the resulting valuation after it|
|the reward function|
|an upperbound for the absolute value of the rewards|
3.0.1 Upper bounds
We first introduce the main concept for computing an upper bound for formally, and then present the informal description.
Definition 4 (Linear Upper Potential Functions (LUPFs)).
A linear upper potential function (LUPF) is a function that satisfies the following conditions:
is linear, i.e., there exist and such that for all , we have ;
for all such that , and , we have for some fixed constants and ;
for all and all valuations such that ,
where are the expected values over the sampling when fixing and ;
for all such that , all sampling valuations and all , we have for some fixed constant .
Informally, (C1) specifies the linearity of LUPFs, (C2) specifies that the value of the function at terminating valuations should be bounded, (C3) specifies that the current value of the function at is no less than that of the next step at plus the cost/reward at the current step, and finally (C4) specifies that the change of value between the current step and the next step is bounded. Note that is linear in since and are linear. Note that the function (only) depends on the valuations of the variables before the loop execution, and hence it is only loop-dependent (but not dependent on each assignment).
The following theorem shows that LUPFs indeed serve as upper bounds for .
If is an LUPF, then for all valuations such that .
Proof sketch. The key ideas of the proof are as follows: Fix any scheduler that ensures finite expected termination time.
We first construct a stochastic process based on . Using the condition (C3) which is non-increasing property we establish that the stochastic process obtained is a supermartingale (for definitions of supermartingale see [Williams1991, Chapter 10]). The supermartingale in essence preserve the non-increasing property.
Given the supermartingale, we apply Optional Stopping Theorem (OST) ([Williams1991, Chapter 10.10]), and use condition (C4) to establish the required boundedness condition of OST, to arrive at the desired result.
While conditions (C1) and (C2) are not central to the proof, the condition (C1) ensures linearity, which will be required by our algorithms, and the condition (C2) is the boundedness after termination, that derives the desired upper bounds (i.e., contribution of the term comes from condition (C2)).
Formal proof of Theorem 1. In order to formally prove the theorem, we need the fundamental mathematical notions of filtrations, stochastic processes and conditional expectation.
Filtrations . A filtration of a probability space is an infinite sequence of -algebras over such that for all .
Discrete-Time Stochastic Processes. A discrete-time stochastic process is a sequence of random variables where ’s are all for some probability space (say, ); is adapted to a filtration of sub--algebras of if for all , is -measurable.
Conditional Expectation. Let be any random variable for a probability space such that . Then given any -algebra , there exists a random variable (for ), conventionally denoted by , such that
is -measurable, and
for all , we have .
The random variable is called the conditional expectation of given , and is unique in the sense that if is another random variable satisfying (E1)–(E3), then . See [Williams1991, Chapter 9] for more details.
Proof of Theorem 1.
Fix any scheduler and initial valuation for our simple while loop. Let be the random variable that measures the number of loop iterations. By our assumption, under . Define the following sequences of (vectors of) random variables:
where each represents the valuation before the th loop iteration of the while loop (so that );
where each represents the sampled valuation for the th loop iteration of the while loop;
where each represents the nondeterministic choice from the scheduler for the th loop iteration.
We also recall the random variables where each represents the cost/reward during the th iteration.
By the execution of the loop, we have:
We also have that . Then we define the stochastic process by:
We accompany with the filtration such that each is the smallest sigma-algebra that makes all random variables from and measurable. Hence,
where is the random variable such that
and . Hence, is a supermartingale. Moreover, we have from (C4) that . Thus, by applying Optional Stopping Theorem, we obtain immediately that . By definition,
It follows from (C2) that . Since the scheduler is chosen arbitrarily, we obtain that . ∎
3.0.2 Lower bounds
For the lower bound, we have the following definition:
Definition 5 (Linear Lower Potential Functions (LLPFs)).
A linear lower potential function (LLPF) is a function that satisfies (C1),(C2),(C4) and in addition (C3’) (instead of (C3)) as follows:
there exists such that
for all satisfying .
Similar to Theorem 1, we obtain the following result on lower bounds for .
If is an LLPF, then for all valuations such that .
This theorem can be proved in the same manner as Theorem 1. ∎
4 Computational Results
By Theorem 1 and Theorem 2, to obtain tight upper and lower bounds for the SSP problem, we need an algorithm to obtain good LUPFs and LLPFs, respectively. We present the results for upper and lower bounds separately.
4.1 Computational Approach for Upper Bound
The key steps to obtain an algorithmic approach is as follows: (i) we first establish a linear template with unknown coefficients for a LUPF from (C1); (ii) then we transform logical conditions (C2)–(C4) equivalently into inclusion assertions between polyhedrons; (iii) next we transform inclusion assertions into linear inequalities through Farkas’ Lemma; (iv) finally we solve the linear inequalities through linear programming, where the solution for unknown coefficients in the template synthesizes a concrete LUPF that serves as an upper bound for the-value. Below we recall the well-known Farkas’ Lemma.
Theorem 3 (Farkas’ Lemma [Farkas1894]).
Let , , and . Suppose that . Then
iff there exists such that , and .
Intuitively, Farkas’ Lemma transforms the inclusion problem of a nonempty polyhedron within a halfspace into a feasibility problem of a system of linear inequalities. As a result, one can decide the inclusion problem in polynomial time through linear programming.
The Algorithm . Consider as input a succinct MDP in the form (1).
Template. The algorithm sets up a column vector of fresh variables and a fresh scalar variable such that the template for an LUPF is .
Constraints on and . The algorithm first encodes the condition (C2) for the template as the inclusion assertion
parameterized with for every , where are fresh unknown constants. Then for every , the algorithm encodes (C3) as
where are unique linear combinations of unknown coefficients satisfying that is equivalent to . Finally, the algorithm encodes (C4) as inclusion assertions with a fresh unknown constant using similar transformations. All the inclusion assertions (with parameters ) are grouped conjunctively so that these inclusions should all hold.
Applying Farkas’ Lemma. The algorithm applies Farkas’ Lemma to all the inclusion assertions generated in the previous step and obtains a system of linear inequalities involving the parameters .
Constraint Solving. The algorithm calls a linear programming solver on the linear program consisting of the system of linear inequalities generated in the previous step and the minimizing objective function where is an initial valuation for program variables.
Correctness and running time. The above algorithm obtains concrete values for and leads to a concrete LUPF . The correctness that is an upper bound for the -value follows from Theorem 1. The main optimization solution required by the algorithm is linear programming, and thus our algorithm runs in polynomial time in the size of the input succinct MDP.
Given a succinct MDP and the SSP problem, the best linear upper bound (wrt an initial valuation) on the sup-value can be computed in polynomial-time in the implicit description of the MDP.
Let be an LUPF for this example, we have:
Note that for condition (C2) we need to quantify over , as if is not in this range, then the loop does not terminate in the next iteration. Given condition (C1), the two (C4) conditions are equivalent to . Also, the (C2) condition is equivalent to and or more precisely , , and . By expanding the occurrences of in the first (C3) condition and simplifying, we get and we can drop the quantification given that does not appear. Similarly, the second (C3) condition is equivalent to . In our method, such equivalences are automatically obtained by applying Farkas’ Lemma rather than manual inspection of the inequalities. Now that all necessary conditions are replaced by equivalent linear inequalities, we can solve the linear program to find an LUPF. An optimal answer (with minimal ) is the following: . Therefore by Theorem 1, we have for all initial valuations that satisfy the loop guard. ∎
Note that our approach is applicable to succinct MDPs with integer as well as real-valued variables, (i.e., the underlying state-space of the MDP is infinite). Even when we consider integer variables, since gives upper bounds, the reduction is to linear programming, rather than integer linear programming. Note that our approach only depends on expectation of sampling variables, and thus applicable even to continuous sampling variables with same expectation, e.g., our results apply uniformly to Example 2 and Example 3, given that the sampling variable has same expectation.
4.2 Computational Approach for Lower Bound
The algorithm for lower bound is similar to the upper bound, however, there are some subtle and key differences. An important difference is that while in Step 2 of the algorithm for upper bound, there is a conjunction of contraints, for the lower bound problem it requires a disjunction. This has two important implications: first, we need to consider a generalization of Farkas’ lemma and in this case we use Motzkin’s Transposition Theorem (which extends Farkas’ Lemma with strict inequalities) and second, instead of linear programming we require quadratic programming.
Theorem 5 (Motzkin’s Transposition Theorem [Motzkin1936]).
Let , and . Assume that . Then
iff there exist and such that , , and .
The Algorithm . The algorithm here is similar to described previously. For the sake of brevity, we explain the differences only.
Template . Same as in the Algorithm .
Constraints on and . The algorithm first encodes (C2) and (C4) as inclusion assertions in the same way as in and transforms them into linear inequalities over through Farkas’ Lemma. Then the algorithm transforms (C3’) equivalently into the inclusion assertion
where are determined in the same way as in . Furthermore, this inclusion assertion is equivalently written as
and then transformed into a system of quadratic inequalities over and through Motzkin’s Transposition Theorem. The system may involve quadratic inequalities since contains the unknown parameters and .
Constraint Solving. The algorithm calls a nonlinear-programming solver on the system of linear and quadratic inequalities generated in the previous step with the maximizing objective function where is an appropriate initial valuation.
Correctness and optimization problem. As in the upper bound algorithm, once are found, we obtain an concrete lower bound for the -value from Theorem 2, establishing the correctness of our algorithm. The reduction leads to a quadratic optimization problem of polynomial size in the size of the succinct MDP, implying the following result.
Given a succinct MDP and the SSP problem, the best linear lower bound (wrt an initial valuation) on the sup-value can be computed via a polynomial (in the implicit description of the MDP) reduction to quadratic programming.
Let be an LLPF for the Gambler’s Ruin example (Figure 1 in Section 2.3). The conditions of the form (C1), (C2) and (C4) are exactly the same as in Example 4. In addition, must also satisfy the following condition:
Expanding the occurrences of in the condition above using , and discarding the quantification, we obtain the following equivalent disjunctive system of inequalities: or . This system is obviously equivalent to . Note that in general we have disjunction of linear inequalities, which require quadratic programming. As explained previously, such equivalences are automatically obtained by our algorithm using Motzkin’s transposition theorem.
Adding the equivalent linear forms of conditions (C1), (C2) and (C4) as in Example 4 and considering the resulting linear program with the objective of maximizing leads to the following solution: . Therefore, by Theorem 2, for every initial valuation that satisfies the loop guard. Given that this is the exact same upper bound we found in Example 4, the bound is tight.∎
Remark 4 (Sampling Distributions).
For simplicity, we only considered discrete and uniform sampling distributions in our examples in this paper. However, as mentioned in Remark 3, since we only use the mean of random variables, our approach extends to other sampling distributions with known means.
Remark 5 (Dependence of Rewards on Program Variables).
Our proof (of Theorem 1 that is the basis of all results) depends on the Optional Stopping Theorem that requires bounded difference in all steps. We considered that the rewards depend only on sampling variables and non-determinism. This assumption is sufficient, but not always necessary, to ensure bounded difference. Our approach is also applicable if the rewards depend on program variables, provided that they remain bounded.
Our approach computes the best linear upper and lower bounds wrt the initial valuations. However, a succinct MDP might collect logarithmic reward. For example, the succinct MDP shown in Figure 2 halves the variable until it becomes less than or equal to , and has a unit reward at each step. Hence, it collects reward in total. In such cases, the obtained bounds, which are linear, can be arbitrarily bad.
Future work. While in this work we consider succinct MDPs, which are special case of factored MDPs, extending our approach to factored MDPs with linear dependency on the variables, but without restriction on nesting structure of while-loops, is an interesting direction of future work.
5 Case Studies and Experiments
We present more case studies and our experimental results.
5.1 Additional Examples
We consider several other classical problems in probabilistic planning that can be described as succinct MDPs. We provide examples of Robot Planning and two variants of Roulette as typically played in casinos.
Two-dimensional Robot Planning. Consider a robot that is placed on an initial point of a two-dimensional grid. A player controls the robot and at each step, can order it to move one unit in either direction (left, right, up, down). However, the robot is not perfect. It follows the order with probability and ignores it and goes to the left with probability . The process ends when the robot leaves the half-plane and the player’s objective is to keep the robot in this half-plane. The player gets a reward of each time the robot moves. The half-plane was chosen arbitrarily for demonstration of our approach. Our method can handle any half-plane and starting point.
Multi-robot Planning. Our approach can handle as many variables as necessary and is only polynomially dependent on the succinct representation of the MDP. To demonstrate this, we consider a scenario similar to the previous case, in which there are now two robots and starting at positions and . The robot follows the orders with probability and malfunctions and goes right with probability . Similarly, follows the commands with probability and goes left with probability . The player’s goal is to keep to the left of , i.e. to keep the robots in the four-dimensional half-space . The process ends when the robots leave this half-space and the player gets a reward of per step as long as the process has not terminated.
Mini-roulette. We model Mini-roulette which is a popular casino game based on a 13-slot wheel. A player starts the game with chips. She needs one chip to make a bet and she bets as long as she has chips. If she loses a bet, the chip will not be returned, but a winning bet will not consume the chip and results in a specific amount of (monetary) reward, and possibly even more chips. The following types of bets can be placed at each round: (i) Even money bets: In these bets, 6 specific slots are chosen. Then the ball is rolled and the player wins the bet if it lands in one of the 6 slots. So the player has a winning probability of . Winning them gives a reward of one unit and one extra chip. (ii) 2-to-1 bets: These bets correspond to 4 slots and winning them gives a reward of 2 and 2 extra chips. (iii,iv,v) 3-to-1, 5-to-1 and 11-to-1 bets: These are defined similarly and have winning probabilities of and respectively.
American Roulette. This game is similar to Mini-roulette in all aspects, except that there are more complicated types of bets available to the player and that the wheel has 38 slots. The player can now have half-chips and can bet as long as he has at least one chip remaining. A bet can lead to one of three outcomes. A definite loss, a partial loss or a win. A definite loss costs one chip and a partial loss half a chip. A win leads to some reward and more chips. Table 2 summarizes the types of bets available in the game.
|Type||Winning Probability||Partial Loss Probability||Winning Reward||Winning Tokens|
5.2 Experimental Results
|MDP||Parameters||Upper bound||Lower bound||Time|
|2D Robot Planning||ms|
We implemented our approach in Java and obtained experimental results on all examples mentioned previously. The results are summarized in Table 3, where “Parameters” shows concrete parameters for our examples (where for the last two examples there is no parameter), “Upper bound” (resp. “Lower bound”) presents the LUPFs (resp. LLPFs) obtained through our approach, and finally “Time” shows the running time in milliseconds. The reported upper bounds on -values are the results of as in Theorem 1. Similarly, the reported lower bounds on -values are obtained from as in Theorem 2. All initial valuations lead to the same results in our experiments. Finally, our approach is not sensitive to parameters as varying parameters will only change coefficients of our LUPFs/LLPFs.
Runtime and Platform. Note that our approach is extremely efficient and can handle all these MDPs, even those with large succinct representation such as the American Roulette MDP, in less than a second. The results were obtained on a machine with Intel Core i5-2520M dual-core processor (2.5GHz), running Ubuntu Linux 16.04.3 LTS. We used lpsolve [Berkelaar et al.2004], JavaILP [Lukasiewycz2008] and JOptimizer [Tivellato2017] for solving linear and quadratic optimization tasks.
Significance of our results. First, observe that in most experimental results the upper and lower bounds are tight. Thus our approach provides precise bounds on the SSP problem for several classical examples. Second, our results apply to infinite-state MDPs: in all the above examples, we consider infinite-state MDPs, where algorithmic approaches for factored MDPs do not work. Finally, in the above examples, instead of infinite-state MDPs if we consider large finite-state MDP variants (e.g., bounding the variable in Gambler’s Ruin with a large domain), then as the MDP becomes larger, the SSP value of the finite-state MDP approaches the infinite-state value. Since we provide tight bounds on the SSP value for this infinite-state limit, our approach provides efficient approximation even for large finite-state MDP variants.
6 Related Works
MDPs. MDPs have been studied quite widely and deeply in the AI literature [Sigaud and Buffet2010, Dean et al.1997, Singh et al.1994, Williams and Young2007, Poupart et al.2015, Gilbert et al.2017, Topin et al.2015, Perrault and Boutilier2017, Boutilier and Lu2016, Ferns et al.2004]; and factored MDPs have also been considered as an efficient algorithmic approach [Guestrin et al.2003]. In this line of work, we introduce succinct MDPs and efficient algorithms for them which are applicable to several problems in AI.
PPDDL and RDDL. There are a variety of languages for specifying MDPs and especially factored MDPs. Two of the most commonly used are PPDDL [Younes and Littman2004] and RDDL [Sanner2010]. These languages are general languages that capture all factored MDPs. Instead, we consider succinct MDPs where guards and assignments are linear expressions and we do not consider nested while-loops, and thus our language is simpler than PPDDL and RDDL.
Programming language results. Besides the AI community, research in programming languages also considers probabilistic programs and algorithmic approaches [Chakarov and Sankaranarayanan2013, Chatterjee et al.2016]; but the main focus is termination with probability 1 or in finite time, whereas we consider the SSP problem and compute precise bounds for it. While both approaches use the theory of martingales as the underlying mathematical tool, the key differences are as follows:
Problem difference: The work of [Chatterjee et al.2016] considers the number of steps to termination. There is no notion of rewards or stochastic shortest path (SSP). In contrast, we consider rewards and SSP. In particular, we have negative rewards that cannot be modeled by the notion of steps.
Result difference: [Chatterjee et al.2016] considers the qualitative question of whether expected termination time is finite or not, and then applies Azuma’s inequality to martingales for concentration bounds. In contrast, we present upper and lower bounds on expected SSP. Thus our results provide quantitative (rather than qualitative) bounds for expected SSP that significantly generalize expected termination time. However, our results are applicable to a more restricted class of programs.
Proof-technique difference: [Chatterjee et al.2016] considers qualitative expected termination time characterization, and the main mathematical tool is martingale convergence that does not handle negative rewards. In contrast, we present quantitative bounds for SSP and our mathematical tool is Optional Stopping Theorem.
Comparison with [Hansen and Abdoulahi2015].
This work provides convergence tests for heuristic search value-iteration algorithms. While both approaches provide bounds for SSPs, the main differences are as follows: (i) our approach can handle negative costs whereas[Hansen and Abdoulahi2015] can handle only positive costs; (ii) our results are on the implicit representation of MDPs, while [Hansen and Abdoulahi2015] evaluates parts of the explicit MDP; and (iii) our approach presents polynomial reductions to optimization problems and is not dependent on value-iteration.
We consider succinct MDPs, which can model several classical examples from the AI literature, and present algorithmic approaches for the SSP problem on them. There are several interesting directions for future work. The first direction would be to consider other algorithmic approaches for succinct MDPs. In our work, we consider linear templates for efficient algorithmic analysis. Generalization of our approach to more general templates is another interesting direction for future work. Finally, whether our approach can be extended to other class of MDPs is also an interesting problem to investigate.
- [Berkelaar et al.2004] M. Berkelaar, K. Eikland, P. Notebaert, et al. LPsolve: Open source (mixed-integer) linear programming system. Eindhoven U. of Technology, 2004.
- [Bertsekas2005] D. P. Bertsekas. Dynamic programming and optimal control, 3rd Ed. Athena Scientific, 2005.
- [Boutilier and Lu2016] C. Boutilier and T. Lu. Budget allocation using weakly coupled, constrained Markov decision processes. In UAI, 2016.
- [Chakarov and Sankaranarayanan2013] A. Chakarov and S. Sankaranarayanan. Probabilistic program analysis with martingales. In CAV, 2013.
- [Chatterjee et al.2016] K. Chatterjee, H. Fu, P. Novotný, and R. Hasheminezhad. Algorithmic analysis of qualitative and quantitative termination problems for affine probabilistic programs. In POPL, 2016.
- [Dean et al.1997] T. Dean, R. Givan, and S. Leach. Model reduction techniques for computing approximately optimal solutions for Markov decision processes. In UAI, 1997.
- [Delgado et al.2011] K. Delgado, S. Sanner, and L. Nunes de Barros. Efficient solutions to factored MDPs with imprecise transition probabilities. Artif. Intell., 175(9-10):1498–1527, 2011.
- [Farkas1894] J. Farkas. A Fourier-Féle mechanikai elv alkalmazásai (Hungarian). Mathematikaiés Természettudományi Értesitö, 12:457–472, 1894.
- [Ferns et al.2004] N. Ferns, P. Panangaden, and D. Precup. Metrics for finite Markov decision processes. In UAI, 2004.
- [Filar and Vrieze1997] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer, 1997.
[Gilbert et al.2017]
H. Gilbert, P. Weng, and Y. Xu.
Optimizing quantiles in preference-based Markov decision processes.In AAAI, 2017.
- [Guestrin et al.2003] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Efficient solution algorithms for factored MDPs. J. Artif. Intell. Res., 19:399–468, 2003.
- [Hansen and Abdoulahi2015] E. Hansen and I. Abdoulahi. Efficient bounds in heuristic search algorithms for stochastic shortest path problems. In AAAI, 2015.
- [Howard1960] H. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
- [Lukasiewycz2008] M. Lukasiewycz. JavaILP - java interface to ILP solvers, http://javailp.sourceforge.net/, 2008.
- [Motzkin1936] T. S. Motzkin. Beiträge zur Theorie der linearen Ungleichungen (German). PhD thesis, Basel, Jerusalem, 1936.
- [Perrault and Boutilier2017] A. Perrault and C. Boutilier. Multiple-profile prediction-of-use games. In IJCAI, 2017.
- [Poupart et al.2015] P. Poupart, A. Malhotra, P. Pei, K. Kim, B. Goh, and M. Bowling. Approximate linear programming for constrained partially observable Markov decision processes. In AAAI, 2015.
- [Puterman1994] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.
- [Sanner2010] S. Sanner. Relational dynamic influence diagram language (RDDL): Language description. Technical Report, Australian National University, 2010.
- [Sigaud and Buffet2010] O. Sigaud and O. Buffet. Markov Decision Processes in Artificial Intelligence. Wiley-IEEE Press, 2010.
[Singh et al.1994]
S.P. Singh, T. Jaakkola, and M.I. Jordan.
Learning without state-estimation in partially observable Markovian decision processes.In Machine Learning Proceedings, pages 284 – 292. Morgan Kaufmann, 1994.
- [Tivellato2017] A. Tivellato. JOptimizer - java convex optimizer, http://www.joptimizer.com/, 2017.
- [Topin et al.2015] N. Topin, N. Haltmeyer, S. Squire, R.J. Winder, M. desJardins, and J. MacGlashan. Portable option discovery for automated learning transfer in object-oriented Markov decision processes. In IJCAI, 2015.
- [Williams and Young2007] J.D. Williams and S. Young. Partially observable Markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393 – 422, 2007.
- [Williams1991] D. Williams. Probability with Martingales. Cambridge University Press, 1991.
- [Younes and Littman2004] H. Younes and M. Littman. PPDDL1. 0: The language for the probabilistic part of IPC-4. In Proc. International Planning Competition, 2004.