1 Introduction
We consider a discretetime dynamic system whose state transition depends on a control. We assume that there is a state space of finite size . When at state , the control is chosen from a control space of finite size^{1}^{1}1In the works of ye; ye2; hansen that we reference, the integer “” denotes the total number of actions, that is with our notation. When we restate their result, we do it with our own notation, that is we replace their by . . The control specifies the
transition probability
to the next state . At each transition, the system is given a reward where is the instantaneous reward function. In this context, we look for a stationary deterministic policy (a function that maps states into controls^{2}^{2}2Restricting our attention to stationary deterministic policies is not a limitation. Indeed, for the optimality criterion to be defined soon, it can be shown that there exists at least one stationary deterministic policy that is optimal (puterman).) that maximizes the expected discounted sum of rewards from any state , called the value of policy at state :(1) 
where is a discount factor. The tuple is called a Markov Decision Process (MDP) (puterman; ndp), and the associated problem is known as optimal control.
The optimal value starting from state is defined as
For any policy , we write for the stochastic matrix whose elements are and
the vector whose components are
. The value functions and can be seen as vectors on . It is well known that is the solution of the following Bellman equation:that is is a fixed point of the affine operator . It is also well known that satisfies the following Bellman equation:
where the max operator is componentwise. In other words, is a fixed point of the nonlinear operator . For any value vector , we say that a policy is greedy with respect to the value if it satisfies:
or equivalently . With some slight abuse of notation, we write for any policy that is greedy with respect to . The notions of optimal value function and greedy policies are fundamental to optimal control because of the following property: any policy that is greedy with respect to the optimal value is an optimal policy and its value is equal to .
Let be some policy. We call advantage with respect to the following quantity:
We call the set of switchable states of the following set
Assume now that is nonoptimal (this implies that is a nonempty set). For any nonempty subset of , we denote a policy satisfying:
The following result is well known (see for instance puterman).
Lemma 1.
Let be some nonoptimal policy. If for some nonempty subset of , then and there exists at least one state such that .
This lemma is the foundation of the wellknown iterative procedure, called Policy Iteration (PI), that generates a sequence of policies as follows.
(2) 
The choice for the subsets leads to different variations of PI. In this paper we will focus on two specific variations:

When for all iteration , , that is one switches the actions in all states with positive advantage with respect to , the above algorithm is known as Howard’s PI; it can be seen then that .

When for all , is a singleton containing a state , that is if we only switch one action in the state with maximal advantage with respect to , we will call it SimplexPI^{3}^{3}3
In this case, PI is equivalent to running the simplex algorithm with the highestpivot rule on a linear program version of the MDP problem
(ye)..
Since it generates a sequence of policies with increasing values, any variation of PI converges to the optimal policy in a number of iterations that is smaller than the total number of policies . In practice, PI converges in very few iterations. On random MDP instances, convergence often occurs in time sublinear in . The aim of this paper is to discuss existing and provide new upper bounds on the number of iterations required by Howard’s PI and SimplexPI that are much sharper than .
In the next sections, we describe some known results—see ye for a recent and comprehensive review—about the number of iterations required by Howard’s PI and SimplexPI, along with some of our original improvements and extensions.For clarity, all proofs are deferred to the later sections.
2 Bounds with respect to a Fixed Discount Factor
A key observation for both algorithms, that will be central to the results we are about to discuss, is that the sequence they generate satisfies some contraction property^{4}^{4}4A sequence of nonnegative numbers is contracting with coefficient if and only if for all , .. For any vector , let be the maxnorm of . Let be the vector of which all components are equal to 1.
Lemma 2 (Proof in Section 5).
The sequence built by Howard’s PI is contracting with coefficient .
Lemma 3 (Proof in Section 6).
The sequence built by SimplexPI is contracting with coefficient .
Though this observation is widely known for Howard’s PI, it was to our knowledge never mentioned explicitly in the literature for SimplexPI. These contraction properties have the following immediate consequence^{5}^{5}5For Howard’s PI, we have: . Thus, a sufficient condition for is , which is implied by . For SimplexPI, we have , and the conclusion is similar to that for Howard’s PI..
Corollary 1.
Let be an upper bound on for all policies . In order to get an optimal policy, that is a policy satisfying , Howard’s PI requires at most iterations, while SimplexPI requires at most iterations.
These bounds depend on the precision term , which means that Howard’s PI and SimplexPI are weakly polynomial for a fixed discount factor . An important breakthrough was recently achieved by ye who proved that one can remove the dependency with respect to , and thus show that Howard’s PI and SimplexPI are strongly polynomial for a fixed discount factor .
Theorem 1 (ye).
SimplexPI and Howard’s PI both terminate after at most iterations.
The proof is based on the fact that PI corresponds to the simplex algorithm in a linear programming formulation of the MDP problem. Using a more direct proof, hansen recently improved the result by a factor for Howard’s PI.
Theorem 2 (hansen).
Howard’s PI terminates after at most iterations.
Our first two results, that are consequences of the contraction properties (Lemmas 2 and 3), are stated in the following theorems.
Theorem 3 (Proof in Section 7).
Howard’s PI terminates after at most iterations.
Theorem 4 (Proof in Section 8).
SimplexPI terminates after at most iterations.
Our result for Howard’s PI is a factor better than the previous best result of hansen. Our result for SimplexPI is only very slightly better (by a factor 2) than that of ye, and uses a proof that is more direct. Using more refined argument, we managed to also improve the bound for SimplexPI by a factor .
Theorem 5 (Proof in Section 9).
SimplexPI terminates after at most iterations.
Compared to Howard’s PI, our bound for SimplexPI is a factor larger. However, since one changes only one action per iteration, each iteration may have a complexity lower by a factor : the update of the value can be done in time through the ShermanMorrisson formula, though in general each iteration of Howard’s PI, which amounts to compute the value of some policy that may be arbitrarily different from the previous policy, may require time. Overall, both algorithms seem to have a similar complexity.
It is easy to see that the linear dependency of the bound for Howard’s PI with respect to is optimal. We conjecture that the linear dependency of both bounds with respect to is also optimal. The dependency with respect to the term may be improved, but removing it is impossible for Howard’s PI and very unlikely for SimplexPI. fearnley describes an MDP for which Howard’s PI requires an exponential (in ) number of iterations for and hollanders argued that this holds also when is in the vicinity of . Though a similar result does not seem to exist for SimplexPI in the literature, condon consider four variations of PI that all switch one action per iteration, and show through specifically designed MDPs that they may require an exponential (in ) number of iterations when .
3 Bounds for SimplexPI that are independent of
In this section, we will describe some bounds that do not depend on but that will be based on some structural assumptions of the MDPs. On this topic, ye2 recently showed the following result for deterministic MDPs.
Theorem 6 (ye2).
If the MDP is deterministic, then SimplexPI terminates after at most iterations.
Given a policy of a deterministic MDP, states are either on cycles or on paths induced by . The core of the proof relies on the following lemmas that altogether show that cycles are created regularly and that significant progress is made every time a new cycle appears; in other words, significant progress is made regularly.
Lemma 4.
If the MDP is deterministic, after at most iterations, either SimplexPI finishes or a new cycle appears.
Lemma 5.
If the MDP is deterministic, when SimplexPI moves from to where involves a new cycle, we have
(3) 
Indeed, these observations suffice to prove^{6}^{6}6This can be done by using arguments similar to the proof of Theorem 4 in Section 8. that SimplexPI terminates after . Removing completely the dependency with respect to the discount factor —the term in —requires a careful extra work described in ye2, which incurs an extra term of order .
At a more technical level, the proof of ye2 critically relies on some properties of the vector that provides a discounted measure of state visitations along the trajectories induced by a policy starting from a uniform distribution:
(4) 
where denotes the uniform distribution on the state space . For any policy and state , we trivially have . The proof exploits the fact that belongs to the set when is on path of , while belongs to the set when is on a cycle of . As we are going to show, it is possible to extend the proof of ye2 to stochastic MDPs. Given a policy of a stochastic MDP, states are either in recurrent classes or transient classes (these two categories respectively generalize those of cycles and paths). We will consider the following structural assumption.
Assumption 1.
Let and be the smallest constants such that for all policies and all states ,
if is transient for , and  (5)  
(6) 
The constant (resp. ) can be seen as a measure of the time needed to leave transient states (resp. the time needed to revisit states in recurrent classes). In particular, when tends to , it can be seen that is an upper bound of the expected time needed to “Leave the set of transient states”, since for any policy ,
(7)  
(8) 
Similarly, when is in the vicinity of , is the minimal asymptotic frequency^{7}^{7}7If the MDP is aperiodic and irreducible, and thus admits a stationary distribution for any policy , one can see that
(9)  
(10) 
With Assumption 1 in hand, we can generalize Lemmas 45 as follows.
Lemma 6.
If the MDP satisfies Assumption 1, after at most iterations either SimplexPI finishes or a new recurrent class appears.
Lemma 7.
If the MDP satisfies Assumption 1, when SimplexPI moves from to where involves a new recurrent class, we have
(11) 
From these generalized observations, we can deduce the following original result.
Theorem 7 (Proof in Section 10).
Remark 1.
An immediate consequence of the above result is that SimplexPI is strongly polynomial for sets of MDPs that are much larger than the deterministic MDPs mentioned in Theorem 6.
Corollary 2.
For any family of MDPs indexed by and such that and are polynomial functions of and , SimplexPI terminates after a number of steps that is polynomial in and .
4 Similar results for Howard’s PI?
One may then wonder whether similar results can be derived for Howard’s PI. Unfortunately, and as quickly mentioned by ye2, the line of analysis developed for SimplexPI does not seem to adapt easily to Howard’s PI, because simultaneously switching several actions can interfere in a way that the policy improvement turns out to be small. We can be more precise on what actually breaks in the approach we have described so far. On the one hand, it is possible to write counterparts of Lemmas 4 and 6 for Howard’s PI (see Section 11).
Lemma 8.
If the MDP is deterministic, after at most iterations, either Howard’s PI finishes or a new cycle appears.
Lemma 9.
If the MDP satisfies Assumption 1, after at most iterations, either Howard’s PI finishes or a new recurrent class appears.
However, on the other hand, we did not manage to adapt Lemma 5 nor Lemma 7. In fact, it is unlikely that a result similar to that of Lemma 5 will be shown to hold for Howard’s PI. In a recent deterministic example due to hansen2 to show that Howard’s PI may require at most iterations, new cycles are created every single iteration but the sequence of values satisfies^{8}^{8}8This MDP has an even number of states . The goal is to minimize the long term expected cost. The optimal value function satisfies for all , with . The policies generated by Howard’s PI have values . We deduce that for all iterations and states , . for all iterations and states ,
(12) 
Contrary to Lemma 5, as grows, the amount of contraction gets (exponentially) smaller and smaller. With respect to SimplexPI, this suggests that Howard’s PI may suffer from subtle specific pathologies. In fact, the problem of determining the number of iterations required by Howard’s PI has been challenging for almost 30 years. It was originally identified as an open problem by schmitz. In the simplest—deterministic—case, the question is still open: the currently best known lower bound is the bound by hansen2 we have just mentioned, while the best known upper bound is (valid for all MDPs) due to mansour.
On the positive side, an adaptation of the line of proof we have considered so far can be carried out under the following assumption.
Assumption 2.
The state space can be partitioned in two sets and such that for all policies , the states of are transient and those of are recurrent.
Indeed, under this assumption, we can prove for Howard’s PI a variation of Lemma 7 introduced for SimplexPI.
Lemma 10.
And we can deduce the following original bound (that also applies to SimplexPI).
Theorem 8 (Proof in Section 12).
It should however be noted that Assumption 2 is rather restrictive. It implies that the algorithms converge on the recurrent states independently of the transient states, and thus the analysis can be decomposed in two phases: 1) the convergence on recurrent states and then 2) the convergence on transient states (given that recurrent states do not change anymore). The analysis of the first phase (convergence on recurrent states) is greatly facilitated by the fact that in this case, a new recurrent class appears every single iteration (this is in contrast with Lemmas 4, 6, 8 and 9 that were designed to show under which conditions cycles and recurrent classes are created). Furthermore, the analysis of the second phase (convergence on transient states) is similar to that of the discounted case of Theorems 3 and 4. In other words, if this last result sheds some light on the practical efficiency of Howard’s PI and SimplexPI, a general analysis of Howard’s PI is still largely open, and constitutes our main future work.
5 Contraction property for Howard’s PI (Proof of Lemma 2)
For any , using the notation “” for “ is positive definite”, we have
(14)  
(15)  
(16) 
Since is non negative, we can take the max norm and get:
(17) 
6 Contraction property for SimplexPI (Proof of Lemma 3)
We begin by proving a useful identity.
Lemma 11.
For all pairs of policies and ,
(18) 
Proof.
We have:
(19)  
(20)  
(21) 
∎
7 A bound for Howard’s PI when (Proof of Theorem 3)
Though the overall line or arguments follows those given originally by ye and adapted by hansen, our proof is slightly more direct and leads to a better result. For any , we have:
(34)  
(35) 
Since is non negative, we can take the max norm and get:
By definition of the maxnorm, there exists a state such that . We deduce that for all ,
(40)  
(41)  
(42) 
As a consequence, the action must be different from when , that is for all values of satisfying
In other words, if some policy is not optimal, then one of its nonoptimal actions will be eliminated for good after at most iterations. By repeating this argument, one can eliminate all nonoptimal actions (they are at most ), and the result follows.
8 A bound for SimplexPI when (Proof of Theorem 4)
The overall line or arguments follows those given originally by ye, and is similar to that of the previous section. Still the result we get is slightly better. For any , we have:
(43)  
(44)  
(45)  
(46)  
(47)  
(48) 
Similarly to the proof for Howard’s PI, we deduce that a nonoptimal action is eliminated after at most
(49) 
and the overall number of iterations is obtained by noting that there are at most nonoptimal actions to eliminate.
9 Another bound for SimplexPI when (Proof of Theorem 5)
This second bound for SimplexPI is a factor better, but requires a slightly more careful analysis.
At each iteration , let be the state in which an action is switched. We have (by definition of the algorithm):
(50) 
Starting with arguments similar to those for the contraction property of SimplexPI, we have:
(51)  
(52)  
(53) 
which implies that
(54) 
On the other hand, we have:
(55)  
(56) 
which implies that
(57) 
Write . From Equations (54) and (57), we deduce that:
(58)  
(59) 
This implies in particular that
(60) 
but also—since and are nonnegative—that
(61) 
Now, write the vector on the state space such that is the number of times state has been switched until iteration (including ). Since by Lemma 1 the sequence
Comments
There are no comments yet.