I Introduction
Policy iteration is a family of algorithms that are used to find an optimal policy for a given Markov Decision Problem (MDP). Simple Policy iteration (SPI) is a type of policy iteration where the strategy is to change the policy at exactly one improvable state at every step. Melekopoglou and Condon [1990] [2] showed an exponential lower bound on the number of iterations taken by SPI for a 2 action MDP. The results have not been generalized to action MDP since.
In this paper we revisit the algorithm and the analysis done in [2]. We generalize the previous result and prove a novel exponential lower bound on the number of iterations taken by policy iteration for state, action MDPs. We construct a family of MDPs and give an indexbased switching rule that yields a strong lower bound of . In section 2 we describe the relevant background and in section 3 we present the important notations. In section 4 we show the MDP construction and in section 5 we describe the index based switching rule. This is followed by various experiments in section 6 and a proof in section 7.
Ii Background
A Markov Decision Process (MDP)
[1],[3] represents the environment for a sequential decision making problem. An MDP is defined by the tuple . is the set of states and is the set of actions. is denoted as n and is denoted as k.is a function which gives the transition probability. Specifically
Probability of reaching state from state by taking action . is the reward function. is the expected reward that the agent will get by taking action from state . is the discount factor, which indicates the importance given to future expected reward. Policy is defined as the probability that agent will choose action from state . If the policy is deterministic, then is the action that the agent takes when it is on state . For a given policy , we define the value function as the total expected reward that the agent receives by following the policy from state . The stateaction value function for a policy is defined as the total expected reward that the agent receives if it takes action from state , and then follows the policy .Iia Policy Iteration
Policy Iteration (PI) is an iterative algorithm that is used to obtain the optimal policy of a MDP, Let describe a MDP and let be the set of all policies. It has been proved that there exists an optimal policy such that . PI consists of two fundamental steps performed sequentially in every iteration:
Policy Evaluation: This step is used to evaluate the state values of the MDP under a particular policy. Given a deterministic policy that maps , the state values satisfy the following relation :
The state values can be computed by solving the system of linear equations.
Policy Improvement: The stateaction value function can be found using:
A state is defined as improvable state under a policy if . One or more improvable states are switched to an improvable action under the policy improvement step and the resultant policy can be denoted as . There can be many choices for the ”locally improving” policies and different PI variants follow different switching strategies.
Iii Notation
Starting from time , policy iteration would update the policy to . The value function and Qvalue corresponding to this is denoted as and respectively.
The policy , for the nstates for any general t is expressed as .
Iv MDP Construction
In this section we formulate a method of constructing a family of MDPs, that give the lower bounds for the switching procedure. Our MDP construction builds over the formulation given by Melekopoglou and Condon [1990]. For a MDP having states, () actions the MDP graph is as follows:

The graph has () “state” vertices

The graph has () “average” vertices

The graph has 2 sink (terminal) vertices with sink value =.
where .
The transitions for actions () on the MDP graph are constructed as follows:

Every action taken on an average vertex results into an equally likely transition to the state vertex and average vertex

Action 0 on a state vertex results into a deterministic transition to the state vertex

Action 1 on a state vertex results into a deterministic transition to the average vertex

In a MDP having 2: An action on the state vertex results into a deterministic transition to the average vertex .

In a MDP having 3: The action on a state vertex results into a deterministic transition to the average vertex .

In a MDP having 3: An action on a state vertex results in a stochastic transition to the average vertex with a probability and to the average vertex with a probability . An increasing order is maintained over the transition probabilities, that is .
Every transition into the sink states gives a reward equal to the sink value. Every other transition gives a reward of 0. The MDP is undiscounted and is set to 1. Note that setting equal to 2 gives the family of MDPs described by Melekopoglou and Condon [2]. We shall denote the states, actions MDP belonging to this family as henceforth.
Clearly, PI will never update the policy at the average vertices as due to the equivalence of all the actions for the average vertices and so their policy will always be their initial policy. Thus for all subsequent analysis, the only the policy of the state vertices are considered.
Note that the optimal policy for this MDP is ().
V Simple Policy Iteration
In Simple Policy Iteration (SPI), the policy of an arbitrary improvable state is switched to an arbitrary improving action. Specifically, the improvable state with the highest index is selected and its policy is switched to the improvable action with the highest index.
We denote the number of iterations taken by SPI for the a state, action MDP from the above family with an initial policy of to converge to the optimum policy as . We shall experimentally show and later prove that:
(1) 
Vi Experiments
Figure 4 shows a plot of the number of iterations against the number of states and actions. Table II in the appendix contains number of iterations for for all pairs upto .
We next describe how the switching happens for the MDP graph shown in Figure 1. Initially both vertices 1 and 2 are switchable because:
According to the switching rule, state 2 switches to action 1. At this point only state 1 is improvable. So state 1 switches to the highest indexed improvable action which is 2. After this switch state 1 is again the only improvable state and it switches to action 1. This switch causes state 1 to attain its best possible value (0) and also makes state 2 improvable because:
Hence state 2 switches to action 0 and SPI converges to the optimal policy. The switching has been shown in the table below.
t  

0  
1  
2  
3  
4 
Vii Proof
The proof of the recursive relation requires the construction of a complementary family of MDPs, , which have the same structure and transition probabilities as but sink values of the opposite relative order. We shall denote the states, actions MDP belonging to this complementary family as henceforth. By Corollary 7.2, the complementary MDP is set to have sink rewards of (1,0). Note that the optimal policy for is . We denote the number of iterations taken by SPI for a state, action complementary MDP beginning with a policy of as .
Lemma 7.1.
Policy iteration for the nstate, kaction MDP from family and is invariant to the actual value of sink values, and only depends on their relative values
Proof.
Let sink values be = (,).
The transformation of the sink reward maintaining the relative order can be expressed as by a linear transform:
where .
The linear transformation of the sink rewards would result in the same transformation to the and values. As , the relative orders in Qvalues do no change and so the switches do not change.
∎
Corollary 7.2.
Sink values for the MDPs from family can be set to = (1,0) and those for the complementary MDPs can be set to =(1,0) without loss of generality.
Lemma 7.3.
At any time t, for ,
Proof.
By the structure of the MDP,
and
This results in
By the construction of ,
Plugging the values of will yield the desired relation. ∎
Lemma 7.4.
At any time t, for ,
Proof.
By the structure of the MDP,
and
This results in
By the construction of ,
Plugging the values of will yield the desired relation. ∎
Lemma 7.5 (Baseline).
Proof.
We have the initial policy and
As per our definition of SPI, policy with highest improvable action will be chosen. Consequently,
Next we have , we focus on improving state 1. We observe
Hence,
Even now, state 2 is not improvable and we have
by our choice of . Hence,
Now,
So,
For all consequent iteration,
and hence only is improvable. From Lemma 7.4, we have
Thus, the next iterations are required to reach
thus giving a total of iterations.
∎
Lemma 7.6.
For the with initial policy: it takes exactly iterations before policy at the first state vertex changes.
Proof.
Due to the switching rule of SPI, will only change when all the other state vertices are not improvable. Until all the higher states finish improving, the current sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 5. Using Lemma 7.1 the resultant MDP is equivalent to having initial policy . This MDP takes iterations to converge to the optimal policy . By this logic after the iterations, the policy would be ∎
Lemma 7.7.
If for , the next 2 switches of simple policy iteration occur at state 1
Proof.
Using Lemma 7.6 the policy is optimal with respect to the states and is . Vertex 1 is the only improvable state and according to Lemma 7.4 it switches to its highest indexed improvable action . Hence the policy becomes . With respect to the current policy the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 6. The action values are equal for equal sink values and hence the policy for states is still optimal. State 1 is still the only improvable state and according to Lemma 7.4 it switches to the next improvable action 0. This completes the proof. The policy would now be ∎
Lemma 7.8.
If for , it takes iterations to converge to optimal policy.
Proof.
With respect to the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 7. Invoking Lemma 7.1 this MDP is equivalent to with initial policy and hence takes iterations to converge to optimal policy . The complete policy is now which is also the optimal policy for ∎
Theorem 7.9.
(2) 
Proof.
This can be proved by sequentially applying Lemmas 7.67.8 ∎
Lemma 7.10.
For t having initial policy , it takes exactly iterations before policy at the first state vertex changes.
Proof.
Due to the switching rule of SPI, will only change when all the other state vertices are not improvable. Until the higher states improve, the current sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 8. Using Lemma 7.1 the resultant MDP is equivalent to having initial policy . This MDP takes iterations to converge to the optimal policy . By this logic after the iterations, the policy would be ∎
Lemma 7.11.
If , the next 2 switches of simple policy iteration occur at vertex 1
Proof.
Using Lemma 7.3 the policy is optimal with respect to the states and is . Vertex 1 is the only improvable state and according to Lemma 7.3 it switches to its highest indexed improvable action . Hence the policy becomes . With respect to the current policy the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 9. The action values are equal for equal sink values and hence the policy for states is still optimal. State 1 is still the only improvable state and according to Lemma 7.3 it switches to the next improvable action k2. This completes the proof. The policy would now be ∎
Lemma 7.12.
If for , it takes iterations before policy at the first state vertex changes.
Proof.
Lemma 7.13.
If for , it takes iterations to converge to optimal policy.
Proof.
With respect to for the sinks and state 1 can be effectively reduced to new sinks with . Invoking Lemma 7.1 this MDP is equivalent to with initial policy and hence this is already optimal. This means that only change that will happen is at state 1. Using Lemma 7.3 we get an incremental change in the improvable policy. Also, at any subsequent stage the value of sink is , which again implies only state 1 can be improved. Hence it would take iterations to converge to the optimal policy
∎
Theorem 7.14.
(3) 
Theorem 7.15 (Recursive Relation).
Proof.
Theorem 7.16.
Viii Conclusion
In this work, we established a generalized lower bound on the number of iterations for a state, action MDP. We demonstrated the MDP formulation and proved a lower bound of . However, we do not reject the existence of an MDP with a tighter lower bound, say, . Out of all of family of MDP that we constructed and verified, most of them were and a few were . Considering the switching rule employed by our construction, and the family of other MDPs we tested, finding an MDP with a tighter lower bound would be an interesting extension to our work.
Ix Additional Result
We observed that the pattern used to define multiple actions was not being followed in very last states, which had scope of improvement. A simple modification to the actions from the final state improve the baseline from
to
which futher increases the lower bound from to which was confirmed experimentally. Since the change is only at the final state, we believe that the rest of the proof will remain the same. The variation in the MDP has been shown in the Appendix Fig.11. are actions with probability respectively and .
References
 [1] (1957) Dynamic programming. 1 edition, Princeton University Press, Princeton, NJ, USA. External Links: Link Cited by: §II.
 [2] (1994) On the complexity of the policy improvement algorithm for markov decision processes. INFORMS Journal on Computing 6, pp. 188–192. Cited by: §I, §I, §IV.
 [3] (1994) Markov decision processes. Wiley. External Links: ISBN 9780471727828, Link Cited by: §II.
 [4] (2019) A tighter analysis of randomised policy iteration. In UAI, pp. 174.
X Appendix
We present the simulation results for all actions and states upto ten in the table below. The experimental results are in coherence with the theoretical values derived and proved.
n=2  n=3  n=4  n=5  n=6  n=7  n=8  n=9  n=10  

k=3  4  10  22  46  94  190  382  766  1534 
k=4  5  12  26  54  110  222  446  894  1790 
k=5  6  14  30  62  126  254  510  1022  2046 
k=6  7  16  34  70  142  286  574  1150  2302 
k=7  8  18  38  78  158  318  638  1278  2558 
k=8  9  20  42  86  174  350  702  1406  2814 
k=9  10  22  46  94  190  382  766  1534  3070 
k=10  11  24  50  102  206  414  830  1662  3326 
Comments
There are no comments yet.