Markov decision processes (MDPs) provide a class of stochastic optimisation models that have found wide applicability to problems in Operational Research. The standard methods for computing an optimal policy are based on value iteration, policy iteration and linear programming algorithms[Whi93]. Each approach has its advantages and disadvantages. In particular, each step in value iteration is relatively computationally inexpensive but the value function may take some time to converge and the algorithm provides no direct check that it has computed the optimal value function and an optimal policy. Conversely, each step in policy iteration may be computationally expensive but the algorithm can be proved to converge in a finite number of steps, confirms when it has converged and automatically identifies the optimal value function and an optimal policy on exit.
Here we focus on models with special structure, in that they are skip-free in the negative direction [Kei65, p.10] or skip-free to the left [StWe89]; i.e. whatever the action taken, the process cannot pass from one state to a ‘lower’ state without passing through all the intervening states. Such skip-free models arise naturally in many areas where OR is applied. The most obvious examples are the control of discrete time random walks and continuous time birth and death processes [Ser81] such as queueing control problems with single unit arrivals and departures (see, for example, StWe89 and references therein). In these basic one-dimensional models, the state space is (a subset of) the integer lattice and transitions are only possible to the next higher or lower integer state. However there are several other standard OR models that fall within the wider one-dimensional skip-free framework, including examples from the areas of queueing control with batch arrivals [StWe89], inventory control [Mil81] and reliability and maintenance [Der70, Tho82].
Previous treatments of controlled skip-free processes have considered only the one-dimensional formulation. For processes with the ‘skip-free to the left’ property, work has focused on qualitative properties, in particular the existence of monotone optimal policies for models with appropriately structured cost functions [StWe89, StWe99]. Conversely, work on processes with the corresponding ‘skip-free to the right’ property has concentrated on analysis of an approximating bisection method for countable state space models [WiSt86, WiSt00]. We note that skip-free type ideas have also been exploited in a different direction by [Whi05] and citing authors, where the emphasis has been on reducing the computational complexity associated with policy iteration for quasi birth-death processes.
An intuitive way of characterising the essential features of our finite skip-free recurrent model is that the model is skip-free if and only if the state space can be identified with the graph of a finite tree, rooted at , with each state corresponding to a unique node in the tree, and such that for every action , the only possible transitions from state under action are either to its ‘parent’ state or to a state in the subtree rooted at , with appropriate modifications for state which has no parent and for terminal nodes which have only a parent and no descendants.
In this setting, the one-dimensional skip-free model above, with state space , corresponds to the simplest case where the tree reduces to a single linearly ordered branch connecting the root node through states to the terminal node , and transitions from state are possible only to states . However, the analysis extends easily to cases with a richer, possibly multidimensional, state space, where the appropriate model is in terms of transitions on a finite tree. Examples of genuinely skip-free models with multidimensional state spaces arise in simple multi-class queueing systems with batch arrivals [YeSe94, He00, and references therein], but such treatments have focused mainly on describing the behaviour of the process for a fixed set of parameters (actions) rather than comparing actions in an optimality framework.
The rest of the paper is organized as follows. We start by describing models for average cost finite state recurrent MDPs that are skip-free in the negative direction, illustrating our approach with a motivating example. We then propose a skip-free algorithm that combines the advantages of values iteration and policy iteration: the computational effort required for each iteration step is comparable with that for value iteration, but the algorithm is guaranteed to converge after a finite number of iterations and automatically identifies the optimal value function and an optimal policy on exit. We go on to show that the algorithm can be also be used to solve discounted cost models and continuous time models, and that a suitably modified algorithm can be used to solve communicating models. Finally, we build on the relationship between the average cost problem and a corresponding -revised first passage problem to provide a proof of the main theorem and identify other possible variants of the algorithm.
2 The skip-free MDP model
Consider a discrete time Markov decision process (MDP) with finite state space over an infinite time horizon . Associated with each state is a non-empty finite set of possible actions; since is finite, we assume without loss of generality that the set of actions is the same for each . If action is chosen when the process is in state at time , then the process incurs an immediate cost and the next state is
A policy is a sequence of (possibly history dependent and randomised) rules for choosing the action at each given time point . A deterministic decision rule corresponds to a function and specifies taking action when the process is in state . A stationary deterministic policy is one which always uses same the deterministic decision rule at each time point . Where the meaning is clear from the context, we use the same notation for both the decision rule and the corresponding stationary deterministic policy.
The expected average cost incurred by a policy with initial state is given by where is the state at time and is the action chosen at time under . Similarly, for a given discount factor , the total expected discounted cost incurred by a policy with initial state is given by
We say an MDP model is recurrent if the transition matrix corresponding to every stationary deterministic policy consists of a single recurrent class. We say an MDP model is communicating if, for every pair of states and in , is reachable from under some (stationary deterministic) policy ; i.e. there exists a policy , with corresponding transition matrix , and an integer , such that .
When is a subset of the integer lattice, we say the MDP model is skip-free in the negative direction [Kei65, StWe89] if for all and , i.e. the process cannot move from state to a state with index without passing through all the intermediate states. We will often find it easier to work in terms of the upper tail probabilities . To avoid degeneracy, we assume that for and that for each , for at least one . In this setting, a recurrent model requires that, for all , for and for all . In contrast a communicating model allows there to be and with and /or .
To apply this idea in a wider context, we note that the essence of a skip free model is that: (i) there is a single distinguished state, say ; (ii) for any other state there is a unique shortest path from to ; (iii) from each state the process can only make transitions to either the adjacent state in the unique path from to , or to some state for which lies in the unique shortest path from to .
In the finite one dimensional case, for each there is exactly one state for which the shortest path to state has length . Thus there is a – mapping of the states to the integers such that the distinguished state maps to and the state for which the shortest path had length maps to . In a more general setting, for each there may be more than one state for which the shortest path has length . In this case, rather than mapping to the integer lattice, there is a fixed tree (in the graph theoretic sense) such that each state corresponds to a unique node of the tree, with the distinguished state mapping to the root node. It may help to visualise movement between states in terms of the corresponding movement between nodes on the tree.
To formalise this general model, we start by considering a finite rooted tree with nodes labelled , with root node , and with a given edge set. The tree structure implies that for each pair of nodes and there is a unique minimal path (set of edges) in the tree that connects and . Thus the nodes in the tree can be partitioned into level sets such that, for , if and only if the minimal path from to passes through exactly intermediate nodes.
For adjacent nodes and , we say is the parent of and is a child of if the minimal path from to passes through . More generally, for and , we say is a descendant of if the minimal path from to passes through . Each node has a unique parent. We write for the parent of , we write for the set of descendants of , and we write for (the nodes of the) sub-tree rooted at , so . A state with no descendants is said to be a terminal state, so all states in the highest level are terminal states. For simplicity of presentation we will assume that these are the only terminal states; the analysis easily extends to cases where intermediate levels can also contain some terminal states. For each , we write for the set of states following in the unique minimal path in the tree connecting to , so if the path passes through intermediate states and takes the form , then .
Now consider a finite MDP with state space and action space . Assume we can construct a rooted tree such that (i) the states in correspond to the nodes of , and (ii) for every state and action , the only possible transitions from state under action are either to its parent state or to a state in the subtree rooted at , with appropriate modifications for state which has no parent and for terminal nodes which have only a parent and no descendants. We will say that such an MDP is skip-free (in the negative direction) on the tree . As with the integer lattice model above, it is often convenient work in terms of the the upper tail probabilities corresponding to the probability that the next transition from state under action is to a state in the subtree rooted at .
To illustrate and motivate the general case, where a multidimensional model is required, consider ([He00, YeSe94]) a single-server multi-class queueing system with customer classes and finite capacity (including the job, if any, in service). Assume the service discipline is pre-emptive but otherwise takes no account of class. A job that arrives when the system is not full enters service immediately and the job currently in service at that point returns to the head of the buffer. When a job completes service, the server next serves the job at the head of the buffer. Any job that arrives when the system is full is lost.
The model is most naturally formulated in continuous time, with exponential inter-arrival and service time distributions, though it can easily be translated to a discrete time setting using the methods of section 4.2. Assume class jobs arrive at rate and complete service at class and action dependent rate , where different actions
correspond to different service levels. Since the model needs to keep track of the class of each job as it enters service, we take the state to be the multidimensional vectorwhere denotes the class of the job currently in service, denotes the class of the job waiting for service in the buffer in place , and if the th place is empty. Assume costs are incurred at rate reflecting both holding costs and action costs.
The possible transitions under the model are the completion of the job currently in service, corresponding to the transition , or the arrival of a class job () to a partially full system, corresponding to the transition .
For this model cannot be represented as a skip-free MDP with linear structure, i.e. with each state having exactly one child with . To see this, let denote the state with , let denote the state , differing from in only the first component, and let denote the state . The only possible direct transitions to and from are from and to . Similarly for . If is restricted to having just one child, then the only possibilities are either (i) has no parent (so is the root state), and , or (ii) has no parent (so is the root state), and . In case (i), can have no children so none of the other states can reach the root state as they cannot reach in a skip-free manner under any policy; in case (ii) can have no children and a similar argument applies.
However we can represent the model as a skip-free MDP on a tree as follows. We take to contain the state corresponding to the empty queue and take the level sets to each contain the states of the form . Given a state we assign it parent and assign it children of the form (with appropriate modifications for and ). The set of descendants is the set of all states of the form for (where there are trailing s). The possible transitions under the model correspond exactly to transitions from to its parent or to one of its children, so the MDP satsifies the conditions required for it to be skip free in the negative direction on the tree . Figure 1 illustrates the tree corresponding to the state space for a system with job classes and capacity . Extensions with direct transitions to more general descendants, of form are possible if batch arrivals are allowed, subject to appropriate capacity constraints.
3 The skip-free algorithm
For finite recurrent MDP models, the solution to the expected average cost problem can be characterised by the corresponding average cost optimality equations [Put94, §8.4]
in that (i) there exist real numbers and satisfying the optimality equations; (ii) the optimal average cost is the same for each initial state and is given by ; (iii) the optimality equations uniquely determine and determine the up to an arbitrary additive constant; (iv) the stationary deterministic policy is an average cost optimal policy, where, for each , is an action achieving
It follows from (iv) above that there is an optimal policy in the class of stationary deterministic policies. We therefore restrict attention from now on to stationary deterministic policies, writing ‘policy’ as a shorthand for ‘stationary deterministic policy’ and writing for the average cost under a given stationary deterministic policy .
For each , we can interpret as the asymptotic relative difference in the total cost that results from starting the process in state rather than state , under the stationary deterministic policy . Thus the quantities are uniquely defined, but the quantities are defined only up to an arbitrary additive constant. We focus on the particular solution normalised by setting and refer to the corresponding as the normalised relative costs under an optimal policy.
In general, the optimality equations (1) cannot be solved directly. Instead an optimal policy in the class of stationary deterministic policies is usually found by methods based on value iteration, policy iteration or linear programming, or combinations of these approaches [Put94]. For skip-free models, however, we have the following simplification.
For finite recurrent skip-free average cost MDPs, the optimality equations (1) are equivalent to the equations
in that (i) these equations also have unique solutions and ; (ii) the optimal average cost is and the normalised relative costs under an optimal policy satisfy ; (iii) an optimal stationary deterministic policy is given by , where is any action minimising the rhs of the corresponding equation for and is an action minimising the rhs in (2c).
Proof For skip-free models, the only possible transitions from state are to state , to state itself, or to a state . Thus equations (1) take the form
with appropriate modification to give the normalised solution with . Values and satisfy (3) if and only if in each equation the rhs for all , with equality for at least one . With appropriate modifications for the root node and for terminal nodes, simple rearrangement in shows that if and only if and that equality in one expression implies equality in the other.
Now write for and for each write for . For each , write for the states following in the unique minimal path from to . For each , is the parent of so that . Hence . Now if is a descendant of and is in the path connecting and , then is a descendant of and is in the subtree rooted at , and vice versa. Thus for fixed and we have that .
In the optimality equations (2), the value of depends only on , and in each subsequent equation the value of depends only on and the values of for . Thus, if the value of was known, it would be easy to compute the in turn for and to determine the corresponding policy which takes the optimal action in each state .
This observation motivates an iterative approach to finding an average cost optimal policy:
(i) choose an initial policy and compute its expected average cost ;
(ii) given a current policy with expected average cost ,
compute an updated policy by setting
and solving (2a) and (2b),
and compute its expected average cost ;
(iii) iterate until convergence.
This approach forms the basis for the following skip-free algorithm.
Its properties are set out in the subsequent theorem.
Choose an arbitrary initial policy . Perform a single iteration of step 2 below, with and with restricted to the single value , . Set .
Set for and set .
If then return to step 2.
If then stop. Return as an optimal policy, return as the optimal average cost, and for each return as the corresponding normalised relative cost.
Consider the skip-free algorithm above applied to a finite recurrent
skip-free average cost MDP model. Then:
(i) At each iteration either , so is a strict improvement on , or . In the latter case , is an optimal average cost policy, and the corresponding normalised relative costs are given by .
(ii) The algorithm converges after a finite number of iterations.
Remarks (1) The motivation for the particular choice of action in state is given in the remarks following the proof of the theorem. (2) The updates are particularly simple in the one dimensional case where . Here simplifies to and simplifies to . (3) The computational requirement for each iteration in step 2 of the algorithm is clearly similar to that of the corresponding step in value iteration, in that it only requires simple evaluations rather than the solution of a set of equations. While the algorithm is also similar to policy evaluation in that it returns the average cost of policy at the end on the th iteration, it differs from standard policy iteration in that it the values of returned do not correspond to the relative costs under . Only at convergence do the relative costs and average cost correspond to the same (optimal) policy. (4) The basic principle underlying this iterative approach appears to be similar to that used in [Low74], but the results there were restricted to a very specific model with simple birth and death structure. Other treatments of skip-free models [WiSt86, StWe89, StWe99, WiSt00] have used iterative methods to search for a good approximation for the average cost , based on the value of current and previous approximations, or used the form of the optimality equations to derive qualitative properties of the solution, in particular monotonicity of optimal policies, but neither approach explicitly identified the simple skip-free improvement algorithm described here.
4 Discounted, continuous and communicating models
The skip-free algorithm can also be used to solve discounted cost and continuous time problems, in each case by transforming the problem into an equivalent average cost problem. Moreover, a suitably modified algorithm can be used to solve communicating models. For ease of presentation, we focus on the one dimensional case, indicating how the argument can be extended to the general model as required.
4.1 Discounted cost models
Consider a recurrent MDP model that is skip-free in the negative direction, with state space , finite action space , transition probabilities , immediate costs and discount factor . Following [p.31]Der70, we construct an average cost MDP with modified state space and modified transition probabilities and immediate costs given by:
In the spirit of similar models [Low74, WiSt86], we note that this new average cost MDP inherits from the original model the property of being skip-free in the negative direction.
Let and be the optimal average cost and the corresponding relative costs for the new average cost problem, normalised by setting . From above, and , are the unique solutions to the optimality equations (1), and any set of actions achieving the minimum on the rhs defines an optimal policy. In terms of the original parameters, these equations take the form
Now set . Then rewriting the equations for in terms of , we see that the satisfy the equations
Thus the satisfy the optimality equations for the discounted cost problem, and so represent the unique optimal discounted cost function [Put94, p.148].
Finally, let and be solutions to the policy iteration algorithm applied to the new skip-free average cost problem. Then and . Thus the optimal value function for the discounted problem is given explicitly in terms of the output of the policy iteration algorithm by
and a policy which is optimal for the modified average cost problem is also optimal for the original discounted cost problem.
The extension to the general skip-free MDP tree model is straightforward, requiring just the addition of an extra state for each terminal state (node) to preserve the skip-free property. This extra state now becomes the terminal node in that branch. Transitions from this extra state are to the corresponding previous terminal node, with probability , or back to itself, with probability . Transition probabilities from non-terminal states are modified as above, by setting if is a non-terminal node of the modified tree and by assigning the remaining transition probability to the newly added terminal nodes of the modified sub-tree rooted at . The precise assignment may be chosen arbitrarily – for example, each new terminal node in the modified sub-tree may be chosen with equal probability – as long as the total probability sums to .
4.2 Continuous time models
Consider a continuous time Markov decision process (CTMDP) with finite state space and finite action space . Assume that when the current action is and the process is in state , the process incurs costs at rate and makes transitions to state at rate
(where transitions back to the same state are allowed). For infinite horizon problems, under either an average cost or a discounted cost criterion, we can restrict attention to stationary policies and to models in which decisions are made only at transition epochs[Put94, p.560]. For simplicity of presentation we again restrict attention to recurrent models and defer treatment of unichain and communicating models to Section 4.3. As for MDPs, we say a CTMDP is skip-free in the negative direction if the process cannot move from each state to a state without passing through all the intermediate states, i.e. for all and .
To apply the skip-free algorithm, we first convert the model to an equivalent uniformised model [Lip75] with rate . In this model, when the current action is and the process is in state , transitions back to state occur at rate while transitions to state occur at rate , so that overall transitions occur at uniform rate . Next we construct a discrete time problem with the same state and action space, where for and the transition probabilities and immediate costs are given by . If the original CTMDP is recurrent and skip-free, then the discretised model is recurrent and skip-free and can be solved using the algorithm.
Finally, let and be the optimal policy and the optimal average cost identified by the algorithm for the discrete time problem. Then the optimal policy and the optimal average cost for the uniformised continuous time problem are the same as and , and the normalised relative costs for the uniformised problem are given in terms of those for the discrete problem by [Put94, §11.5].
4.3 Communicating models
So far we have assumed the MDP model is recurrent. There are natural applications for which this assumption excludes sensible policies, such as policies that are recurrent only on a strict subset of . Simple examples include: maintenance/replacement problems where a policy might specify replacing an item when the state reached some lower level with a item of level ; inventory problems where a policy might reorder when the stock reached some lower level and/or reorder up to level ; queueing control problems where a policy might turn the server off when the queue size reached some lower level and/or might refuse to admit new entrants when the queue size reached level . In each case, determining optimal values for and might be part of the problem. In this section we extend our result to the wider class of communicating MDP models, to enable us to address examples like these.
We say an MDP model is communicating if, for every pair of states and in , is reachable from under some (stationary deterministic) policy ; i.e. there exists a policy , with corresponding transition matrix , and an integer , such that . We say that is unichain if it decomposes into a single recurrent class plus a (possibly empty) set of transient states; if there is more than one recurrent class we say is multichain. Let be a multichain policy and, for each , let denote the average cost under starting in a state in , and let be a recurrent set with smallest average cost, say . Because the model is skip-free, must consist of a sequence of consecutive states ; again, because the model is skip-free, the action in each each state greater than can be changed if necessary so that is reachable from ; finally, because the model is communicating, the action in each state less than can be changed if necessary so that is reachable from . Denote by the new policy created by changing actions in this way, if necessary, but leaving the actions in unchanged. Then is unichain by construction, and the average cost starting in each state is , which is no greater than the average cost starting in under . Thus, for average cost skip-free communicating models, nothing is lost by restricting attention to unichain policies.
In contrast to recurrent models, communicating models allow there to be and with and/or . For each , let be the (possibly empty) set of unichain policies for which but for (where we take for all for ). Every unichain policy must be in for some . Partition the possible actions for each state into and its complement , where may be empty but is non-empty by the assumptions of the skip free model in Section 2. Then for a unichain policy , we have that ; that state is recurrent and by definition; and that states are transient.
Thus the minimum average cost over policies in
is the same as the minimum average cost for a modified skip-free MDP model
with the same transition probabilities and immediate costs but
with reduced state space
and with state-dependent action spaces
for and .
In this notation, the model of Section 2 corresponds to
and state plays the same role as the recurrent distinguished
state in that state plays in .
If we compare the result of applying the skip-free algorithm to
with the result of applying it to ,
we see that, for the same current value of ,
the algorithm computes the same values of
, , and in states .
However, in state , the skip-free algorithm applied to
computes quantities appropriate to the distinguished state,
say and , where
and computes an updated ‘minimising’ policy
with average cost ,where
This motivates the following modified skip-free algorithm. First, it includes these extra computations for each state , so that, in a single iteration, it simultaneously computes the optimal policy and its average cost for each . Secondly, at the end of the th iteration it sets , and sets to be the corresponding policy, where ties are broken by choosing the with the smallest index . Say the minimum average cost at this stage is achieved by a policy with index Then, by the properties of the skip-free algorithm applied to , at the end of the next iteration either (i) , in which case ; or (ii) and , so and is an optimal average cost policy for starting states . In this case, because the model is communicating, it is possible [Put94, p.351] to modify the actions chosen by the policy in the, now transient, states so that the modified satisfies the optimality equations for all states and is an average cost optimal policy. We summarise this discussion in the following theorem.
Consider the skip-free algorithm modified as above applied to a finite communicating discrete time
average cost skip-free MDP model with state space . Then:
(i) At each iteration of the skip-free algorithm either and is a strict improvement on , or and for some the policy satisfies the optimality equations for states .
(ii) The modified skip-free algorithm converges after a finite number of iterations.
Finally, note that it is easy to check if a skip-free model is communicating. An assumption of the (non-degenerate) skip-free model was that each state was reachable from . It follows that a skip-free MDP with state space is communicating if and only if is reachable from under at least one stationary deterministic policy . Let , let be the index of the maximum state for which for some , and for let be the index of the maximum state for which for some and . As the state space is finite, the sequence terminates, say with state . Since the model is skip-free, is the largest state that is reachable by all states below it, and the model is communicating if and only if .
The extension to a general skip-free communicating models is straightforward. Again, the idea is that for each state the skip-free algorithm is modified so that in passing it solves the corresponding sub-problem with state space and with state as the distinguished state, and then computes the optimal updated average cost and policy by minimising over the costs and policies for each of the sub-problems.
5 Proof of Theorem 2
We start our analysis of the average cost MDP model by defining a related problem (or class of problems) that we will call the -revised first return problem. The model for this problem has the same state space , the same action space and the same transition probabilities as the average cost model. However, for each fixed , the immediate costs in the corresponding -revised problem are revised downward by , so is revised to . Whereas the original problem was to find a policy that minimised the expected average cost , the objective for this new problem is to find a policy that minimises the expected -revised cost until first return to state , where, for a process starting with , we define the first return epoch to state to be the smallest value such that and . The MDP is assumed recurrent under any stationary deterministic policy, so is well defined and almost surely finite.
For a fixed policy , starting in state , write for the expected first return epoch under , for the expected first return cost under , and for the expected -revised first return cost under . The average costs and the -revised costs under are related by the equations
where the first equation follows from viewing the average cost problem from a renewal-reward perspective [Ros70, p.160] and noting that state is recurrent under any stationary deterministic policy , and the second follows from noting that the expected -revised cost under until first return to state is just the original expected cost adjusted downwards by an amount for an expected time period .
Proof Since the process is Markov and skip-free in the negative direction, it follows that a policy minimises the expected -revised cost until first return to state if and only if it also minimises the expected -revised total cost until first passage to state for each starting state , i.e. , and hence minimises the expected cost until first passage from to to for each . For the one-dimensional case where , this problem has been called the -revised first passage problem [StWe89]. For fixed and , let be actions minimising the rhs in equations (2a) and (2b) and let be the corresponding values. Then they show that the policy that takes action in state is optimal for the -revised first passage problem and the minimal expected cost until first passage from to is given by . With only minor notational changes, their results extend directly to the general case where corresponds to the nodes of a tree, is replaced by and is replaced by . It follows that the policy that uses actions in has the property that for each state it also minimises the expected total -revised cost until first passage to state and that the minimum expected -revised total cost until first passage to state , starting in state , is given by the sum of the values along the path from to , i.e. .
Now consider a process that starts in state . Under a policy that specifies action in state , the expected time until the process first leaves state is