Analysis of Lower Bounds for Simple Policy Iteration

11/28/2019
by   Sarthak Consul, et al.
IIT Bombay
0

Policy iteration is a family of algorithms that are used to find an optimal policy for a given Markov Decision Problem (MDP). Simple Policy iteration (SPI) is a type of policy iteration where the strategy is to change the policy at exactly one improvable state at every step. Melekopoglou and Condon [1990] showed an exponential lower bound on the number of iterations taken by SPI for a 2 action MDP. The results have not been generalized to k-action MDP since. In this paper, we revisit the algorithm and the analysis done by Melekopoglou and Condon. We generalize the previous result and prove a novel exponential lower bound on the number of iterations taken by policy iteration for N-state, k-action MDPs. We construct a family of MDPs and give an index-based switching rule that yields a strong lower bound of O((3+k)2^N/2-3).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/16/2020

Lower Bounds for Policy Iteration on Multi-action MDPs

Policy Iteration (PI) is a classical family of algorithms to compute an ...
11/04/2019

An Exponential Lower Bound for Zadeh's pivot rule

The question whether the Simplex Algorithm admits an efficient pivot rul...
06/03/2013

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Given a Markov Decision Process (MDP) with n states and a totalnumber m ...
10/02/2019

Optimistic Value Iteration

Markov decision processes are widely used for planning and verification ...
07/11/2012

Discretized Approximations for POMDP with Average Cost

In this paper, we propose a new lower approximation scheme for POMDP wit...
10/03/2020

Exponential Lower Bounds for Planning in MDPs With Linearly-Realizable Optimal Action-Value Functions

We consider the problem of local planning in fixed-horizon Markov Decisi...
05/23/2018

Representation Balancing MDPs for Off-Policy Policy Evaluation

We study the problem of off-policy policy evaluation (OPPE) in RL. In co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Policy iteration is a family of algorithms that are used to find an optimal policy for a given Markov Decision Problem (MDP). Simple Policy iteration (SPI) is a type of policy iteration where the strategy is to change the policy at exactly one improvable state at every step. Melekopoglou and Condon [1990] [2] showed an exponential lower bound on the number of iterations taken by SPI for a 2 action MDP. The results have not been generalized to action MDP since.

In this paper we revisit the algorithm and the analysis done in [2]. We generalize the previous result and prove a novel exponential lower bound on the number of iterations taken by policy iteration for state, action MDPs. We construct a family of MDPs and give an index-based switching rule that yields a strong lower bound of . In section 2 we describe the relevant background and in section 3 we present the important notations. In section 4 we show the MDP construction and in section 5 we describe the index based switching rule. This is followed by various experiments in section 6 and a proof in section 7.

Ii Background

A Markov Decision Process (MDP)

[1],[3] represents the environment for a sequential decision making problem. An MDP is defined by the tuple . is the set of states and is the set of actions. is denoted as n and is denoted as k.

is a function which gives the transition probability. Specifically

Probability of reaching state from state by taking action . is the reward function. is the expected reward that the agent will get by taking action from state . is the discount factor, which indicates the importance given to future expected reward. Policy is defined as the probability that agent will choose action from state . If the policy is deterministic, then is the action that the agent takes when it is on state . For a given policy , we define the value function as the total expected reward that the agent receives by following the policy from state . The state-action value function for a policy is defined as the total expected reward that the agent receives if it takes action from state , and then follows the policy .

Ii-a Policy Iteration

Policy Iteration (PI) is an iterative algorithm that is used to obtain the optimal policy of a MDP, Let describe a MDP and let be the set of all policies. It has been proved that there exists an optimal policy such that . PI consists of two fundamental steps performed sequentially in every iteration:
Policy Evaluation: This step is used to evaluate the state values of the MDP under a particular policy. Given a deterministic policy that maps , the state values satisfy the following relation :

The state values can be computed by solving the system of linear equations.
Policy Improvement: The state-action value function can be found using:

A state is defined as improvable state under a policy if . One or more improvable states are switched to an improvable action under the policy improvement step and the resultant policy can be denoted as . There can be many choices for the ”locally improving” policies and different PI variants follow different switching strategies.

Iii Notation

Starting from time , policy iteration would update the policy to . The value function and Q-value corresponding to this is denoted as and respectively.

The policy , for the n-states for any general t is expressed as .

Iv MDP Construction

In this section we formulate a method of constructing a family of MDPs, that give the lower bounds for the switching procedure. Our MDP construction builds over the formulation given by Melekopoglou and Condon [1990]. For a MDP having states, () actions the MDP graph is as follows:

  • The graph has () “state” vertices

  • The graph has () “average” vertices

  • The graph has 2 sink (terminal) vertices with sink value =.

where .
The transitions for actions () on the MDP graph are constructed as follows:

  • Every action taken on an average vertex results into an equally likely transition to the state vertex and average vertex

  • Action 0 on a state vertex results into a deterministic transition to the state vertex

  • Action 1 on a state vertex results into a deterministic transition to the average vertex

  • In a MDP having 2: An action on the state vertex results into a deterministic transition to the average vertex .

  • In a MDP having 3: The action on a state vertex results into a deterministic transition to the average vertex .

  • In a MDP having 3: An action on a state vertex results in a stochastic transition to the average vertex with a probability and to the average vertex with a probability . An increasing order is maintained over the transition probabilities, that is .

Every transition into the sink states gives a reward equal to the sink value. Every other transition gives a reward of 0. The MDP is undiscounted and is set to 1. Note that setting equal to 2 gives the family of MDPs described by Melekopoglou and Condon [2]. We shall denote the states, actions MDP belonging to this family as henceforth.

Clearly, PI will never update the policy at the average vertices as due to the equivalence of all the actions for the average vertices and so their policy will always be their initial policy. Thus for all subsequent analysis, the only the policy of the state vertices are considered.

Note that the optimal policy for this MDP is ().

Figures 1, 2 and 3 show the MDP graph for , and respectively.

Fig. 1: , Sinks are square vertices (Best viewed in color)
Fig. 2: (Best viewed in color)
Fig. 3: (Best viewed in color)

V Simple Policy Iteration

In Simple Policy Iteration (SPI), the policy of an arbitrary improvable state is switched to an arbitrary improving action. Specifically, the improvable state with the highest index is selected and its policy is switched to the improvable action with the highest index.

We denote the number of iterations taken by SPI for the a -state, -action MDP from the above family with an initial policy of to converge to the optimum policy as . We shall experimentally show and later prove that:

(1)

Vi Experiments

Figure 4 shows a plot of the number of iterations against the number of states and actions. Table II in the appendix contains number of iterations for for all pairs upto .

Fig. 4: (Top) Variation of number of iterations with number of states for fixed number of actions. The vertical scale is in logarithm with base 2. The line is almost linear which justifies the exponential dependence with number of states. (Bottom) Variation of number of iterations with number of action for fixed k. This dependence is linear with an increasing slope.

We next describe how the switching happens for the MDP graph shown in Figure 1. Initially both vertices 1 and 2 are switchable because:

According to the switching rule, state 2 switches to action 1. At this point only state 1 is improvable. So state 1 switches to the highest indexed improvable action which is 2. After this switch state 1 is again the only improvable state and it switches to action 1. This switch causes state 1 to attain its best possible value (0) and also makes state 2 improvable because:

Hence state 2 switches to action 0 and SPI converges to the optimal policy. The switching has been shown in the table below.

t
0
1
2
3
4
TABLE I: Switching sequence for

Vii Proof

The proof of the recursive relation requires the construction of a complementary family of MDPs, , which have the same structure and transition probabilities as but sink values of the opposite relative order. We shall denote the states, actions MDP belonging to this complementary family as henceforth. By Corollary 7.2, the complementary MDP is set to have sink rewards of (1,0). Note that the optimal policy for is . We denote the number of iterations taken by SPI for a -state, -action complementary MDP beginning with a policy of as .

Lemma 7.1.

Policy iteration for the n-state, k-action MDP from family and is invariant to the actual value of sink values, and only depends on their relative values

Proof.

Let sink values be = (,).

The transformation of the sink reward maintaining the relative order can be expressed as by a linear transform:

where .
The linear transformation of the sink rewards would result in the same transformation to the and values. As , the relative orders in Q-values do no change and so the switches do not change.

Corollary 7.2.

Sink values for the MDPs from family can be set to = (-1,0) and those for the complementary MDPs can be set to =(1,0) without loss of generality.

Lemma 7.3.

At any time t, for ,

Proof.

By the structure of the MDP, and
This results in

By the construction of ,

Plugging the values of will yield the desired relation. ∎

Lemma 7.4.

At any time t, for ,

Proof.

By the structure of the MDP, and
This results in

By the construction of ,

Plugging the values of will yield the desired relation. ∎

Lemma 7.5 (Baseline).
Proof.

We have the initial policy and

As per our definition of SPI, policy with highest improvable action will be chosen. Consequently,

Next we have , we focus on improving state 1. We observe

Hence,

Even now, state 2 is not improvable and we have

by our choice of . Hence,

Now,

So,

For all consequent iteration,

and hence only is improvable. From Lemma 7.4, we have

Thus, the next iterations are required to reach

thus giving a total of iterations.

Lemma 7.6.

For the with initial policy: it takes exactly iterations before policy at the first state vertex changes.

Proof.

Due to the switching rule of SPI, will only change when all the other state vertices are not improvable. Until all the higher states finish improving, the current sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 5. Using Lemma 7.1 the resultant MDP is equivalent to having initial policy . This MDP takes iterations to converge to the optimal policy . By this logic after the iterations, the policy would be

Fig. 5: Reducing to with respect to its initial policy
Lemma 7.7.

If for , the next 2 switches of simple policy iteration occur at state 1

Proof.

Using Lemma 7.6 the policy is optimal with respect to the states and is . Vertex 1 is the only improvable state and according to Lemma 7.4 it switches to its highest indexed improvable action . Hence the policy becomes . With respect to the current policy the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 6. The action values are equal for equal sink values and hence the policy for states is still optimal. State 1 is still the only improvable state and according to Lemma 7.4 it switches to the next improvable action 0. This completes the proof. The policy would now be

Fig. 6: Reducing with respect to the policy
Lemma 7.8.

If for , it takes iterations to converge to optimal policy.

Proof.

With respect to the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 7. Invoking Lemma 7.1 this MDP is equivalent to with initial policy and hence takes iterations to converge to optimal policy . The complete policy is now which is also the optimal policy for

Fig. 7: Reducing to with respect to policy
Theorem 7.9.
(2)
Proof.

This can be proved by sequentially applying Lemmas 7.6-7.8 ∎

Lemma 7.10.

For t having initial policy , it takes exactly iterations before policy at the first state vertex changes.

Proof.

Due to the switching rule of SPI, will only change when all the other state vertices are not improvable. Until the higher states improve, the current sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 8. Using Lemma 7.1 the resultant MDP is equivalent to having initial policy . This MDP takes iterations to converge to the optimal policy . By this logic after the iterations, the policy would be

Fig. 8: Reducing to with respect to its initial policy
Lemma 7.11.

If , the next 2 switches of simple policy iteration occur at vertex 1

Proof.

Using Lemma 7.3 the policy is optimal with respect to the states and is . Vertex 1 is the only improvable state and according to Lemma 7.3 it switches to its highest indexed improvable action . Hence the policy becomes . With respect to the current policy the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 9. The action values are equal for equal sink values and hence the policy for states is still optimal. State 1 is still the only improvable state and according to Lemma 7.3 it switches to the next improvable action k-2. This completes the proof. The policy would now be

Fig. 9: Reducing with respect to policy
Lemma 7.12.

If for , it takes iterations before policy at the first state vertex changes.

Proof.

With respect to the sinks and state 1 can be effectively reduced to new sinks with . This reduction is shown in Figure 10. Invoking Lemma 7.1 this MDP is equivalent to with initial policy and hence takes iterations to converge to optimal policy . The complete policy is now

Fig. 10: Reducing to with respect to policy
Lemma 7.13.

If for , it takes iterations to converge to optimal policy.

Proof.

With respect to for the sinks and state 1 can be effectively reduced to new sinks with . Invoking Lemma 7.1 this MDP is equivalent to with initial policy and hence this is already optimal. This means that only change that will happen is at state 1. Using Lemma 7.3 we get an incremental change in the improvable policy. Also, at any subsequent stage the value of sink is , which again implies only state 1 can be improved. Hence it would take iterations to converge to the optimal policy

Theorem 7.14.
(3)
Proof.

This can be be proven by sequentially applying Lemmas 7.10-7.13. ∎

Theorem 7.15 (Recursive Relation).
Proof.

Subtracting eq.(2) from eq.(3), we get the relation:

As the RHS of the above is independent of , we can replace with to get

Substituting from the above equation into eq.(3) completes the proof. ∎

Theorem 7.16.
Proof.

For a fixed k, use the baseline from lemma 7.5 and apply the recursive relation described by theorem 7.15 to complete the proof.

Viii Conclusion

In this work, we established a generalized lower bound on the number of iterations for a state, action MDP. We demonstrated the MDP formulation and proved a lower bound of . However, we do not reject the existence of an MDP with a tighter lower bound, say, . Out of all of family of MDP that we constructed and verified, most of them were and a few were . Considering the switching rule employed by our construction, and the family of other MDPs we tested, finding an MDP with a tighter lower bound would be an interesting extension to our work.

Ix Additional Result

We observed that the pattern used to define multiple actions was not being followed in very last states, which had scope of improvement. A simple modification to the actions from the final state improve the baseline from

to

which futher increases the lower bound from to which was confirmed experimentally. Since the change is only at the final state, we believe that the rest of the proof will remain the same. The variation in the MDP has been shown in the Appendix Fig.11. are actions with probability respectively and .

References

  • [1] R. Bellman (1957) Dynamic programming. 1 edition, Princeton University Press, Princeton, NJ, USA. External Links: Link Cited by: §II.
  • [2] M. Melekopoglou and A. Condon (1994) On the complexity of the policy improvement algorithm for markov decision processes. INFORMS Journal on Computing 6, pp. 188–192. Cited by: §I, §I, §IV.
  • [3] M. L. Puterman (1994) Markov decision processes. Wiley. External Links: ISBN 978-0471727828, Link Cited by: §II.
  • [4] M. Taraviya and S. Kalyanakrishnan (2019) A tighter analysis of randomised policy iteration. In UAI, pp. 174.

X Appendix

We present the simulation results for all actions and states upto ten in the table below. The experimental results are in coherence with the theoretical values derived and proved.

n=2 n=3 n=4 n=5 n=6 n=7 n=8 n=9 n=10
k=3 4 10 22 46 94 190 382 766 1534
k=4 5 12 26 54 110 222 446 894 1790
k=5 6 14 30 62 126 254 510 1022 2046
k=6 7 16 34 70 142 286 574 1150 2302
k=7 8 18 38 78 158 318 638 1278 2558
k=8 9 20 42 86 174 350 702 1406 2814
k=9 10 22 46 94 190 382 766 1534 3070
k=10 11 24 50 102 206 414 830 1662 3326
TABLE II: Number of Iterations for family of MDP
Fig. 11: Improvement in MDP (Best viewed in color)