We consider a discrete-time dynamic system whose state transition
depends on a control. We assume that there is a state space
of finite size . When at state , the control is chosen from a control space
of finite size111In the works of ye; ye2; hansen that we reference, the integer “” denotes the total number of actions, that is with our notation. When we restate their result, we do it with our own notation, that is we replace their by . . The control specifies the transition probability
transition probabilityto the next state . At each transition, the system is given a reward where is the instantaneous reward function. In this context, we look for a stationary deterministic policy (a function that maps states into controls222Restricting our attention to stationary deterministic policies is not a limitation. Indeed, for the optimality criterion to be defined soon, it can be shown that there exists at least one stationary deterministic policy that is optimal (puterman).) that maximizes the expected discounted sum of rewards from any state , called the value of policy at state :
where is a discount factor. The tuple is called a Markov Decision Process (MDP) (puterman; ndp), and the associated problem is known as optimal control.
The optimal value starting from state is defined as
For any policy , we write for the stochastic matrix whose elements are and
the vector whose components are. The value functions and can be seen as vectors on . It is well known that is the solution of the following Bellman equation:
that is is a fixed point of the affine operator . It is also well known that satisfies the following Bellman equation:
where the max operator is componentwise. In other words, is a fixed point of the nonlinear operator . For any value vector , we say that a policy is greedy with respect to the value if it satisfies:
or equivalently . With some slight abuse of notation, we write for any policy that is greedy with respect to . The notions of optimal value function and greedy policies are fundamental to optimal control because of the following property: any policy that is greedy with respect to the optimal value is an optimal policy and its value is equal to .
Let be some policy. We call advantage with respect to the following quantity:
We call the set of switchable states of the following set
Assume now that is non-optimal (this implies that is a non-empty set). For any non-empty subset of , we denote a policy satisfying:
The following result is well known (see for instance puterman).
Let be some non-optimal policy. If for some non-empty subset of , then and there exists at least one state such that .
This lemma is the foundation of the well-known iterative procedure, called Policy Iteration (PI), that generates a sequence of policies as follows.
The choice for the subsets leads to different variations of PI. In this paper we will focus on two specific variations:
When for all iteration , , that is one switches the actions in all states with positive advantage with respect to , the above algorithm is known as Howard’s PI; it can be seen then that .
When for all , is a singleton containing a state , that is if we only switch one action in the state with maximal advantage with respect to , we will call it Simplex-PI333
In this case, PI is equivalent to running the simplex algorithm with the highest-pivot rule on a linear program version of the MDP problem(ye)..
Since it generates a sequence of policies with increasing values, any variation of PI converges to the optimal policy in a number of iterations that is smaller than the total number of policies . In practice, PI converges in very few iterations. On random MDP instances, convergence often occurs in time sub-linear in . The aim of this paper is to discuss existing and provide new upper bounds on the number of iterations required by Howard’s PI and Simplex-PI that are much sharper than .
In the next sections, we describe some known results—see ye for a recent and comprehensive review—about the number of iterations required by Howard’s PI and Simplex-PI, along with some of our original improvements and extensions.For clarity, all proofs are deferred to the later sections.
2 Bounds with respect to a Fixed Discount Factor
A key observation for both algorithms, that will be central to the results we are about to discuss, is that the sequence they generate satisfies some contraction property444A sequence of non-negative numbers is contracting with coefficient if and only if for all , .. For any vector , let be the max-norm of . Let be the vector of which all components are equal to 1.
Lemma 2 (Proof in Section 5).
The sequence built by Howard’s PI is contracting with coefficient .
Lemma 3 (Proof in Section 6).
The sequence built by Simplex-PI is contracting with coefficient .
Though this observation is widely known for Howard’s PI, it was to our knowledge never mentioned explicitly in the literature for Simplex-PI. These contraction properties have the following immediate consequence555For Howard’s PI, we have: . Thus, a sufficient condition for is , which is implied by . For Simplex-PI, we have , and the conclusion is similar to that for Howard’s PI..
Let be an upper bound on for all policies . In order to get an -optimal policy, that is a policy satisfying , Howard’s PI requires at most iterations, while Simplex-PI requires at most iterations.
These bounds depend on the precision term , which means that Howard’s PI and Simplex-PI are weakly polynomial for a fixed discount factor . An important breakthrough was recently achieved by ye who proved that one can remove the dependency with respect to , and thus show that Howard’s PI and Simplex-PI are strongly polynomial for a fixed discount factor .
Theorem 1 (ye).
Simplex-PI and Howard’s PI both terminate after at most iterations.
The proof is based on the fact that PI corresponds to the simplex algorithm in a linear programming formulation of the MDP problem. Using a more direct proof, hansen recently improved the result by a factor for Howard’s PI.
Theorem 2 (hansen).
Howard’s PI terminates after at most iterations.
Theorem 3 (Proof in Section 7).
Howard’s PI terminates after at most iterations.
Theorem 4 (Proof in Section 8).
Simplex-PI terminates after at most iterations.
Our result for Howard’s PI is a factor better than the previous best result of hansen. Our result for Simplex-PI is only very slightly better (by a factor 2) than that of ye, and uses a proof that is more direct. Using more refined argument, we managed to also improve the bound for Simplex-PI by a factor .
Theorem 5 (Proof in Section 9).
Simplex-PI terminates after at most iterations.
Compared to Howard’s PI, our bound for Simplex-PI is a factor larger. However, since one changes only one action per iteration, each iteration may have a complexity lower by a factor : the update of the value can be done in time through the Sherman-Morrisson formula, though in general each iteration of Howard’s PI, which amounts to compute the value of some policy that may be arbitrarily different from the previous policy, may require time. Overall, both algorithms seem to have a similar complexity.
It is easy to see that the linear dependency of the bound for Howard’s PI with respect to is optimal. We conjecture that the linear dependency of both bounds with respect to is also optimal. The dependency with respect to the term may be improved, but removing it is impossible for Howard’s PI and very unlikely for Simplex-PI. fearnley describes an MDP for which Howard’s PI requires an exponential (in ) number of iterations for and hollanders argued that this holds also when is in the vicinity of . Though a similar result does not seem to exist for Simplex-PI in the literature, condon consider four variations of PI that all switch one action per iteration, and show through specifically designed MDPs that they may require an exponential (in ) number of iterations when .
3 Bounds for Simplex-PI that are independent of
In this section, we will describe some bounds that do not depend on but that will be based on some structural assumptions of the MDPs. On this topic, ye2 recently showed the following result for deterministic MDPs.
Theorem 6 (ye2).
If the MDP is deterministic, then Simplex-PI terminates after at most iterations.
Given a policy of a deterministic MDP, states are either on cycles or on paths induced by . The core of the proof relies on the following lemmas that altogether show that cycles are created regularly and that significant progress is made every time a new cycle appears; in other words, significant progress is made regularly.
If the MDP is deterministic, after at most iterations, either Simplex-PI finishes or a new cycle appears.
If the MDP is deterministic, when Simplex-PI moves from to where involves a new cycle, we have
Indeed, these observations suffice to prove666This can be done by using arguments similar to the proof of Theorem 4 in Section 8. that Simplex-PI terminates after . Removing completely the dependency with respect to the discount factor —the term in —requires a careful extra work described in ye2, which incurs an extra term of order .
At a more technical level, the proof of ye2 critically relies on some properties of the vector that provides a discounted measure of state visitations along the trajectories induced by a policy starting from a uniform distribution:
where denotes the uniform distribution on the state space . For any policy and state , we trivially have . The proof exploits the fact that belongs to the set when is on path of , while belongs to the set when is on a cycle of . As we are going to show, it is possible to extend the proof of ye2 to stochastic MDPs. Given a policy of a stochastic MDP, states are either in recurrent classes or transient classes (these two categories respectively generalize those of cycles and paths). We will consider the following structural assumption.
Let and be the smallest constants such that for all policies and all states ,
|if is transient for , and||(5)|
The constant (resp. ) can be seen as a measure of the time needed to leave transient states (resp. the time needed to revisit states in recurrent classes). In particular, when tends to , it can be seen that is an upper bound of the expected time needed to “Leave the set of transient states”, since for any policy ,
Similarly, when is in the vicinity of , is the minimal asymptotic frequency777If the MDP is aperiodic and irreducible, and thus admits a stationary distribution for any policy , one can see that
If the MDP satisfies Assumption 1, after at most iterations either Simplex-PI finishes or a new recurrent class appears.
If the MDP satisfies Assumption 1, when Simplex-PI moves from to where involves a new recurrent class, we have
From these generalized observations, we can deduce the following original result.
Theorem 7 (Proof in Section 10).
If the MDP satisfies Assumption 1, then Simplex-PI terminates after at most
An immediate consequence of the above result is that Simplex-PI is strongly polynomial for sets of MDPs that are much larger than the deterministic MDPs mentioned in Theorem 6.
For any family of MDPs indexed by and such that and are polynomial functions of and , Simplex-PI terminates after a number of steps that is polynomial in and .
4 Similar results for Howard’s PI?
One may then wonder whether similar results can be derived for Howard’s PI. Unfortunately, and as quickly mentioned by ye2, the line of analysis developed for Simplex-PI does not seem to adapt easily to Howard’s PI, because simultaneously switching several actions can interfere in a way that the policy improvement turns out to be small. We can be more precise on what actually breaks in the approach we have described so far. On the one hand, it is possible to write counterparts of Lemmas 4 and 6 for Howard’s PI (see Section 11).
If the MDP is deterministic, after at most iterations, either Howard’s PI finishes or a new cycle appears.
If the MDP satisfies Assumption 1, after at most iterations, either Howard’s PI finishes or a new recurrent class appears.
However, on the other hand, we did not manage to adapt Lemma 5 nor Lemma 7. In fact, it is unlikely that a result similar to that of Lemma 5 will be shown to hold for Howard’s PI. In a recent deterministic example due to hansen2 to show that Howard’s PI may require at most iterations, new cycles are created every single iteration but the sequence of values satisfies888This MDP has an even number of states . The goal is to minimize the long term expected cost. The optimal value function satisfies for all , with . The policies generated by Howard’s PI have values . We deduce that for all iterations and states , . for all iterations and states ,
Contrary to Lemma 5, as grows, the amount of contraction gets (exponentially) smaller and smaller. With respect to Simplex-PI, this suggests that Howard’s PI may suffer from subtle specific pathologies. In fact, the problem of determining the number of iterations required by Howard’s PI has been challenging for almost 30 years. It was originally identified as an open problem by schmitz. In the simplest—deterministic—case, the question is still open: the currently best known lower bound is the bound by hansen2 we have just mentioned, while the best known upper bound is (valid for all MDPs) due to mansour.
On the positive side, an adaptation of the line of proof we have considered so far can be carried out under the following assumption.
The state space can be partitioned in two sets and such that for all policies , the states of are transient and those of are recurrent.
Indeed, under this assumption, we can prove for Howard’s PI a variation of Lemma 7 introduced for Simplex-PI.
And we can deduce the following original bound (that also applies to Simplex-PI).
Theorem 8 (Proof in Section 12).
It should however be noted that Assumption 2 is rather restrictive. It implies that the algorithms converge on the recurrent states independently of the transient states, and thus the analysis can be decomposed in two phases: 1) the convergence on recurrent states and then 2) the convergence on transient states (given that recurrent states do not change anymore). The analysis of the first phase (convergence on recurrent states) is greatly facilitated by the fact that in this case, a new recurrent class appears every single iteration (this is in contrast with Lemmas 4, 6, 8 and 9 that were designed to show under which conditions cycles and recurrent classes are created). Furthermore, the analysis of the second phase (convergence on transient states) is similar to that of the discounted case of Theorems 3 and 4. In other words, if this last result sheds some light on the practical efficiency of Howard’s PI and Simplex-PI, a general analysis of Howard’s PI is still largely open, and constitutes our main future work.
5 Contraction property for Howard’s PI (Proof of Lemma 2)
For any , using the notation “” for “ is positive definite”, we have
Since is non negative, we can take the max norm and get:
6 Contraction property for Simplex-PI (Proof of Lemma 3)
We begin by proving a useful identity.
For all pairs of policies and ,
7 A bound for Howard’s PI when (Proof of Theorem 3)
Though the overall line or arguments follows those given originally by ye and adapted by hansen, our proof is slightly more direct and leads to a better result. For any , we have:
Since is non negative, we can take the max norm and get:
By definition of the max-norm, there exists a state such that . We deduce that for all ,
As a consequence, the action must be different from when , that is for all values of satisfying
In other words, if some policy is not optimal, then one of its non-optimal actions will be eliminated for good after at most iterations. By repeating this argument, one can eliminate all non-optimal actions (they are at most ), and the result follows.
8 A bound for Simplex-PI when (Proof of Theorem 4)
The overall line or arguments follows those given originally by ye, and is similar to that of the previous section. Still the result we get is slightly better. For any , we have:
Similarly to the proof for Howard’s PI, we deduce that a non-optimal action is eliminated after at most
and the overall number of iterations is obtained by noting that there are at most non-optimal actions to eliminate.
9 Another bound for Simplex-PI when (Proof of Theorem 5)
This second bound for Simplex-PI is a factor better, but requires a slightly more careful analysis.
At each iteration , let be the state in which an action is switched. We have (by definition of the algorithm):
Starting with arguments similar to those for the contraction property of Simplex-PI, we have:
which implies that
On the other hand, we have:
which implies that
This implies in particular that
but also—since and are non-negative—that
Now, write the vector on the state space such that is the number of times state has been switched until iteration (including ). Since by Lemma 1 the sequence