DeepAI
Log In Sign Up

Geometric Policy Iteration for Markov Decision Processes

Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound ||/1 - γlog1/1-γ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/22/2021

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

We study the off-policy evaluation (OPE) problem in reinforcement learni...
07/23/2021

An Adaptive State Aggregation Algorithm for Markov Decision Processes

Value iteration is a well-known method of solving Markov Decision Proces...
06/03/2020

Kernel Taylor-Based Value Function Approximation for Continuous-State Markov Decision Processes

We propose a principled kernel-based policy iteration algorithm to solve...
02/08/2020

Provably Efficient Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value functio...
07/04/2022

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

Value iteration (VI) is a foundational dynamic programming method, impor...
07/15/2022

Set-based value operators for non-stationary Markovian environments

This paper analyzes finite state Markov Decision Processes (MDPs) with u...
11/29/2015

Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs (Extended Version)

Many exact and approximate solution methods for Markov Decision Processe...

1. Introduction

The Markov decision process (MDP) is the mathematical foundation of reinforcement learning (RL) which has achieved great empirical success in sequential decision problems. Despite RL’s success, new mathematical properties of MDPs are to be discovered to theoretically understand RL algorithms. In this paper, we study the geometric properties of discounted MDPs with finite states and actions and propose a new value-based algorithm inspired by their polyhedral structures.

A large family of methods for solving MDPs is based on the notion of value function, which maps a policy to state values. When the value function is maximized, the optimal policy can then be extracted by taking a greedy step with respect to the value function. Policy iteration (howard60dynamic) is such an algorithm that repeatedly alternates between a policy evaluation step and a policy improvement step until convergence. The policy is mapped into the value space in the policy evaluation step and then greedily improved according to the state values in the policy improvement step. It is also well known that the optimal state values can be solved by linear programming (LP) (puterman94markov) which also attracts a lot of research interest due to its mathematical formulation.

Although these algorithms are efficient in practice, their worst-case complexity was long believed exponential (mansour1999complexity). The major breakthrough was made by ye2011 where the author proved that both policy iteration and LP with Simplex method (danzigsimplex) terminate in . The author first proved that the Simplex method with the most-negative-reduced-cost pivoting rule is strongly polynomial in this situation. Then, a variant of policy iteration called simple policy iteration was shown to be equivalent to the Simplex method. hansen2013strategy later improved the complexity of policy iteration by a factor of . The best known complexity of policy iteration is proved by scherrer2016improved.

In the LP formulation, the state values are optimized through the vertices of the LP feasible region which is a convex polytope. Surprisingly, it was recently discovered that the space of the value function is a (possibly non-convex) polytopes (Dadashi2019value). We call such object the value function polytope denoted by . As opposed to LP, the state values are navigated through in policy iteration. Moreover, the line theorem (Dadashi2019value) states that the set of policies that only differ in one state is mapped onto the same line segment in the value function polytope. This suggests the potential of new algorithms based on single-state updates.

Our first contribution is on the structure of the value function polytope . Specifically, we show that a hyperplane arrangement is shared by and the polytope of the linear programming formulation for MDPs. We characterize these hyperplanes using the Bellman equation of policies that are deterministic in a single state. We prove that the boundary of the value function polytope is the union of finitely many (convex polyhedral) cells of . Moreover, each full-dimensional cell of the value function polytope is contained in the union of finitely many full-dimensional cells defined by . We further conjecture that the cells of the arrangement cannot be partial, but they have to be entirely contained in the value function polytope.

The learning dynamic of policy iteration in the value function polytope shows that every policy update leads to an improvement of state values along one line segment of . Based on this, we propose a new algorithm, geometric policy iteration (GPI), a variant of the classic policy iteration with several improvements. First, policy iteration may perform multiple updates on the same line segment. GPI avoids this situation by always reaching an endpoint of a line segment in the value function polytope for every policy update. This is achieved by efficiently calculating the true state value of each potential policy update instead of using the Bellman operator which only guarantees a value improvement. Second, GPI updates the values for all states immediately after each policy update for a single state, which makes the value function monotonically increasing with respect to every policy update. Last but not least, GPI can be implemented in an asynchronous fashion. This makes GPI more flexible and advantageous over policy iteration in MDPs with a very large state set.

We prove that GPI converges in iterations, which matches the best-known bound for solving finite discounted MDPs. Although using a more complicated strategy for policy improvement, GPI maintains the same arithmetic operations in each iteration as policy iteration. We empirically demonstrate that GPI takes fewer iterations and policy updates to attain the optimal value.

1.1. Related Work

One line of work related to this paper is on the complexity of the policy iteration. For MDPs with a fixed discount factor, the complexity of policy iteration has been improved significantly (Littman94; ye2011; Ye2013Post; hansen2013strategy; scherrer2016improved). There are also positive results reported on stochastic games (SG). hansen2013strategy proved that a two-player turn-based SG can be solved by policy iteration in strongly polynomial time when the discount factor is fixed. Akian2013PolicyIF further proved that policy iteration is strongly polynomial in mean-payoff SG with state-dependent discount factors under some restrictions. In terms of more general settings, the worst-case complexity can still be exponential (mansour1999complexity; Fearnley; Hollanders2012; Hollanders2016). Another line of related work studies the geometric properties of MDPs and RL algorithms. The concept of the value function polytope in this paper was first proposed in Dadashi2019value, which was also the first recent work studying the geometry of the value function. Later, Bellemare2019Geometric explored the direction of using these geometric structures as auxiliary tasks in representation learning in deep RL. policyimprovepath also aimed at improving the representation learning by shaping the policy improvement path within the value function polytope. The geometric perspective of RL also contributes to unsupervised skill learning where no reward function can be accessed (unsupervised_skill_learning). Very recently, geometryPOMDP analyzed the geometry of state-action frequencies in partially observable MDPs and formulated the problem of finding the optimal memoryless policy as a polynomial program with a linear objective and polynomial constraints. The geometry of the value function in robust MDP is also studied in geometryRMDP.

2. Preliminaries

An MDP has five components where and are finite state set and action set, is the transition function with

denoting the probability simplex.

is the reward function and is the discount factor that represents the value of time.

A policy is a mapping from states to distributions over actions. The goal is to find a policy that maximizes the cumulative sum of rewards.

Define

as the vector of state values.

is then the expected cumulative reward starting from a particular state and acting according to :

The Bellman equation (Bellman:DynamicProgramming) connects the value at a state with the value at the subsequent states when following :

(1)

Define and as follows.

Then, the Bellman equation for a policy can be expressed in matrix form as follows.

(2)

Under this notation, we can define the Bellman operator and the optimality Bellman operator for an arbitrary value vector as follows.

is optimal if and only if . MDPs can be solved by value iteration which consists of the repeated application of the optimality Bellman operator until a fixed point has been reached.

Let denote the space of all policies, denote the space of all state values. We define the value function as

(3)

The value function is fundamental to many algorithmic solutions of an MDP. Policy iteration (PI) (howard60dynamic) repeatedly alternates between a policy evaluation step and a policy improvement step until convergence. In the policy evaluation step, the state values of the current policy is evaluated which involves solving a linear system (Eq. (2)). In the policy improvement step, the next policy is obtained by taking a greedy step using the optimality Bellman operator as follows.

Simple policy iteration (SPI) is a variant of policy iteration. It only differs from policy iteration in the policy improvement step where the policy is only updated for the state-action pair with the largest improvement over the following advantage function.

SPI selects a state-action pair from then updates the policy accordingly.

2.1. Geometry of the Value Function

While the space of policies is the Cartesian product of probability simplices, Dadashi2019value proved that the value function space is a possibly non-convex polytope (Ziegler_polytope). Figure 3 shows a convex and a non-convex polytopes of 2 MDPs in blue regions. The proof is built upon the line theorem which is an equally important geometric property of the value space. The line theorem depends on the following definition of policy determinism.

Definition 2.0 (Policy Determinism).

A policy is

  • -deterministic for if it selects one concrete action for sure in state , i.e., ;

  • deterministic if it is -deterministic for all .

(a)
(b)
Figure 3. The blue regions are the value spaces of 2 MDPs with and . The regions are obtained by plotting of random policies. (a): Both and agree on but differ in . and are deterministic. is -deterministic. and are -deterministic. (b): and agree on and , respectively. , and are deterministic while and are and -deterministic, respectively.

The line theorem captures the geometric property of a set of policies that differ in only one state. Specifically, we say two policies agree on states if for each , . For a given policy , we denote by the set of policies that agree with on ; we will also write to describe the set of policies that agree with on all states except . When we keep the probabilities fixed at all but state , the functional draws a line segment which is oriented in the positive orthant (that is, one end dominates the other). Furthermore, the endpoints of this line segment are -deterministic policies.

The line theorem is stated as follows:

Theorem 2 (Line theorem (Dadashi2019value)).

Let be a state and a policy. Then there are two policies in , denoted , which bracket the value of all other policies :

For both Figure (a)a and (b)b, we plot policies that agree on one state to illustrate the line theorem. The determinism decides if policies are mapped to a vertex, onto the boundary or inside the polytope.

3. The Cell Structure of the Value Function Polytope

In this section, we revisit the geometry of the (non-convex) value function polytope presented in Dadashi2019value. We establish a connection to linear programming formulations of the MDP which then can be adapted to show a finer description of cells in the value function polytope as unions of cells of a hyperplane arrangement. For more on hyperplane arrangements and their structure, see hyperplanes-intro.

It is known since at least the 1990’s that finding the optimal value function of an MDP can be formulated as a linear program (see for example (puterman94markov; bertsekas96neurodynamic)). In the primal form, the feasible constraints are defined by , where is the optimality Bellman operator. Concretely, the following linear program is well-known to be equivalent to maximizing the expected total reward in Eq. (2). We call this convex polyhedron the MDP-LP polytope (because it is a linear programming form of the MDP problem).

s.t.

where

is a probability distribution over

.

Our main new observation is that the MDP-LP polytope and the value polytope are actually closely related and one can describe the regions of the (non-convex) value function polytope in terms of the (convex) cells of the arrangement.

Theorem 1 ().

Consider the hyperplane arrangement , with hyperplanes, consisting of those of the MDP polytope, i.e.,

Then, the boundary of the value function polytope is the union of finitely (convex polyhedral) cells of the arrangement . Moreover, each full-dimensional cell of the value polytope is contained in the union of finitely many full-dimensional cells defined by .

Proof.

Let us first consider a point being on the boundary of the value function polytope. Theorem 2 and Corollary 3 of Dadashi2019value demonstrated that the boundary of the space of value functions is a (possibly proper) subset of the ensemble of value functions of policies, where at least one state has a fixed deterministic choice for all actions. Note that from the value function Eq. (3), then the hyperplane

includes all policies taking policy in state . Thus the points of the boundary of the value function polytope are contained in the hyperplanes of . Now we can see how the -dimensional cells of the boundary are then in the intersections of the hyperplanes too.

The zero-dimensional cells (vertices) are clearly a subset of the zero-dimensional cells of the arrangement because, by above results, the zero-dimensional cells are precisely in the intersection of many hyperplanes from , which is equivalent to choosing a fixed set of actions for all states. This corresponds to solving a linear system consisting of the hyperplanes that bound (same as Eq. (2)). But more generally, if we fix the policies for only states, the induced space lies in a dimensional affine space. Consider a policy and states , and write for the columns of the matrix corresponding to states other than . Define the affine vector space

Now For a given policy , we denote by the set of policies which agree with on ; Thus the value functions generated by are contained in the affine vector space :

The points of in one or more of the planes (each hyperplane is precisely fixing one policy action pair). This is the intersection of hyperplanes given by the following equations.

Thus we can be sure of the stated containment.

Finally, the only remaining case is when is in the interior of the value polytope. If that is the case, because partitions the entire Euclidean space, it must be contained in at least one of the full-dimensional cell of . ∎

Figure (a)a is an example of the value function polytope in blue, MDP-LP polytope in green and its bounding hyperplanes (the arrangement ) as blue and red lines. In Figure (b)b we exemplify Theorem 1 by presenting a value function polytope with delimited boundaries where hyperplanes are indicated in different colors. The deterministic policies are those for which . In both pictures, the values of deterministic policies in the value space are shown as red dots. The boundaries of the value polytope are indeed included in the set of cells of the arrangement as stated by Theorem 1. These figures of value function polytopes (blue regions) were obtained by randomly sampling policies and plotting their corresponding state values.

(a)
(b)
Figure 6. (a): polytope (blue) and MDP-LP polytope (green) of an MDP with and . (b) polytope overlapped with the hyperplane arrangement from Theorem 1. This MDP has 3 actions so .

Some remarks are in order. Note how sometimes the several adjacent cells of the MDP arrangement together form a connected cell of the value function polytope. We also observe that for any set of states and a policy , can be expressed as a convex combination of value functions of -deterministic policies. In particular, is included in the convex hull of the value functions of deterministic policies. It is also demonstrated clearly in Figure (b)b that the value functions of deterministic policies are not always vertices and the vertices of the value polytope are not always value functions of deterministic policies, but they are always intersections of hyperplanes on . However, optimal values will always include a deterministic vertex. This observation suggests that it would suffice to find the optimal policy by only visiting deterministic policies on the boundary. It is worthwhile to note that the optimal value of our MDP would be at the unique intersection vertex of the two polytopes. We note that the blue regions in Figure (a)a are not related to the polytope of the dual formulation of LP. Unlike the MDP polytope which can be characterized as the intersection of finitely many half-spaces, we do not have such a neat representation for the value function polytope. The pictures presented here and many more experiments we have done suggest the following stronger result is true:

Conjecture: if the value polytope intersects a cell of the arrangement , then it contains the entire cell, thus all full-dimensional cells of the value function polytope are equal to the union of full-dimensional cells of the arrangement.

Proving this conjecture requires to show that the map from policies to value functions is surjective over the cells it touches. At the moment we can only guarantee that there are no isolated components because the value polytope is a compact set. More strongly

Dadashi2019value shown (using the line theorem) that there is path connectivity from , in any cell, to others is guaranteed by a polygonal path. More precisely if we let and be two value functions. Then there exists a sequence of policies, , such that , , and for every , the set forms a line segment.

It was observed that algorithms for solving MDPs have different learning behavior when visualized in the value polytope space. For example policy gradient methods (sutton2000policy; kakade2002natural; policygradient_actorcritic; policygradientWilliam; policygradientWilliamsPeng91) have an improvement path inside of the value function polytope; value iteration can go outside of the polytope which means there can be no corresponding policy during the update process; and policy iteration navigates exactly through deterministic policies. In the rest of our paper we use this geometric intuition to design a new algorithm.

4. The Method of Geometric Policy Iteration

(a)
(b)
Figure 9. (a): one iteration of PI update. We may not reach the end of a line segment for an action switch. (b): one iteration of GPI. An endpoint is reached in each update.

We now present geometric policy iteration (GPI) that improves over PI based on the geometric properties of the learning dynamics. Define an action switch to be an update of policy in any state . The Line theorem shows that policies agreeing on all but one state lie on a line segment. So an action switch is a move along a line segment to improve the value function. In PI, we use the optimality Bellman operator to decide the action to switch to for state . However, does not guarantee the largest value improvement for . This phenomenon is illustrated in Figure 9 where we plot the value sequences of PI and the proposed GPI.

We propose an alternative action-switch strategy in GPI that directly calculates the improvement of the value function for one state. By choosing the action with the largest value improvement, we can always reach the endpoint of a line segment which potentially reduces the number of action switches.

This strategy requires efficient computation of the value function because a naive calculation of the value function by Eq. (2) is very expensive due to the matrix inversion. On the other hand, PI only re-evaluates the value function once per iteration. Our next theorem states that the new state-value can be efficiently computed. This is achieved by using the fact that the policy improvement step can be done state-by-state within a sweep over the state set, so adjacent policies in the update sequence only differ in one state.

Theorem 1 ().

Given and . If a new policy only differs from in state with , can be calculated efficiently by

(4)

where is a -d vector, is a scalar, is the -th column of , and is a vector with entry being , others being .

Proof.

We here provide a general proof that we can calculate given policy , , and differs from in only one state.

where . Assume and differ in state . is a rank- matrix with row being , and all other rows being zero vectors.

We can then express as the outer product of two vectors , where is an one-hot vector

(5)

and

(6)

Similarly, . Then, we have

Thus, for state , we have

which completes the proof. ∎

(a)
(b)
Figure 12. Two paths are shown for each PI, GPI. The green and red paths denote one iteration with and updated first, respectively. (a): The policy improvement path of PI. The red path is not action-switch-monotone which will lead to an additional iteration. (b): GPI is always action-switch-monotone. The red path achieves the optimal values in one action switch.

Theorem 1 suggests that updating the value of a single state using Eq. (4) takes arithmetic operations which matches the complexity of the optimality Bellman operator used in policy iteration.

The second improvement over policy iteration comes from the fact that the value improvement path in may not be monotonic with respect to action switches. Although it is well-known that the update sequence is non-decreasing in the iteration number , the value function could decrease in the policy improvement step of the policy iteration. An illustration is shown in Figure 12. This is because when the Bellman operator is used to decide an action switch, is fixed for the entire sweep of states. This leads us to the motivation for GPI which is to update the value function after each action switch such that the value function action-switch-monotone. This idea can be seamlessly combined with Theorem 1 since the values of all states can be updated efficiently in arithmetic operations. Thus, the complexity of completing one iteration is the same as policy iteration.

1:, ,
2:set iteration number and randomly initialize
3:Calculate and
4:for  do
5:     calculate the best action according to Eq. (7)
6:     update according to Eq. (10)
7:     
8:     
9:if  is optimal then return
10:, , , . Go to step 4
Algorithm 1 Geometric Policy Iteration

We summarize GPI in Algorithm 1. GPI looks for action switches for all states in one iteration, and updates the value function after each action switch. Let superscript denote the iteration index, subscript denote the state index in one iteration. To avoid clutter, we use to denote the state being updated and drop superscript in and . Step 3 evaluates the initial policy . The difference here is that we store the intermediate matrix for later computation. From step 4 to step 8, we iterate over all states to search for potential updates. In step 5, GPI selects the best action by computing the new state-value of each potential action switch by Eq. (7).

(7)

where

(8)
(9)

and is vector with entry being and others being .

Define to be the column of . is obtained by Eq. (8) using the selected action . In step 6, we update as follows.

(10)

The policy is updated in step 7 and the value vector is updated in step 8 where is the reward vector under the new policy. The algorithm is terminated when the optimal values are achieved.

4.1. Theoretical Guarantees

Before we present any properties of GPI, let us first prove the following very useful lemma.

Lemma 0 ().

Given two policies and , we have the following equalities.

(11)
(12)
Proof.

By Bellman equation, we have

(13)

Eq. (13) can be rearranged as

and Eq. (11) follows.

To get Eq. (12), we rearrange Eq. (13) as

and Eq. (12) follows. ∎

Our first result is an immediate consequence of re-evaluating the value function after an action switch.

Proposition 0 ().

The value function is non-decreasing with respect to action switches in GPI, i.e., .

Proof.

From Eq. (11) in Lemma 2, we have

Since , we have

which implies that for any , ,

(14)

Now, consider and . According to the updating rule of GPI, for state we have . For state , we have

Combined, we have , which completes the proof. ∎

We next turn to the complexity of GPI and bound the number of iterations required to find the optimal solution. The analysis depends on the lemma described as follows.

Lemma 0 ().

Let denote the optimal value. At iteration of GPI, we have the following inequality.

Proof.

For state , we have

(15)
(16)

Let . The inequality (16) is because of the updating rule of GPI and (14).

which completes the proof. ∎

Theorem 5 ().

GPI finds the optimal policy in iterations.

Proof.

Define with . Then, by Lemma 4, we have

Let be the state such that , the following properties can be obtained by Eq. (12) in Lemma 2.

Also from Eq. (12), we have

(17)

It follows that

which implies when , the non-optimal action for in is switched in and will never be switched back to in future iterations. Now we are ready to bound . By taking the logarithm for both sides of , we have

Each non-optimal action is eliminated after at most iterations, and there are non-optimal actions. Thus, GPI takes at most iterations to reach the optimal policy. ∎

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
Figure 28. The results of MDPs with states in (a)-(e). The horizontal axis is the number of actions for all graphs. The vertical axes are the number of iterations, number of action switches and wall time for the first to the third row, respectively. The performance curves of SPI, PI and GPI are in green, blue and red, respectively. The SPI curves are only presented in (a) and (b) to provide a “lower bound” on the number of action switches, and are dropped for larger MDPs due to its higher running time. The number of switches of GPI remains low compared to PI. The proposed GPI consistently outperforms PI in both iteration count and wall time. The advantages of GPI become more significant as the action set size grows.

4.2. Asynchronous Geometric Policy Iteration

When the state set is large it would be beneficial to perform policy updates in an orderless way (Sutton1998). This is because iterating over the entire state set may be prohibitive, and exactly evaluating the value function with Eq. (2) may be too expensive. Thus, in practice, the value function is often approximated when the state set is large. One example is modified policy iteration (puterman_modified_pi; puterman94markov) where the policy evaluation step is approximated with certain steps of value iteration.

Since GPI avoids the matrix inversion by updating the value function incrementally, it has the potential to update the policy for arbitrary states available to the agent. This property also opens up the possibility of asynchronous (orderless) updates of policies and the value function when the state set is large or the agent has to update the policy for the state it encounters in real-time. The asynchronous update strategy can also help avoid being stuck in states that lead to minimal progress and may reach the optimal policy without reaching a certain set of states.

Asynchronous GPI (Async-GPI) can be described as follows. Assume the transition matrix is available to the agent, we initialize the policy and calculate the initial and accordingly. In real-time settings, the sequence of states are collected by an agent through real-time interaction with the environment. At time step , we perform a policy update using Eq. (7). Then, update using Eq. (10) and using Eq. (2). Asynchronous value-based methods converge if each state is visited infinitely often (bertsekasAsyncVI). We later demonstrate in experiments that Async-GPI converges well in practice.

(a)
(b)
(c)
(d)
Figure 33. Comparison between asynchronous geometric policy iteration (red curve) and asynchronous value iteration (greencurve) in 4 MDPs. for all MDPs and for (a)-(d), respectively. The horizontal axis shows the number of updates. The vertical axis shows the mean of the value function.

5. Experiments

We test GPI on random MDPs of different sizes. The baselines are policy iteration (PI) and simple policy iteration (SPI). We compare the number of iterations, actions switches and wall time. Here we denote the number of iterations as the number of sweeps over the entire state set. Action switches are those policy updates within each iteration. The results are shown in Figure 28. We generate MDPs with corresponding to Figure 28 (a)-(e). And for each state size, we increase the number of actions (horizontal axes) to observe the difference in performance. The rows from the top to bottom are the number of iterations, action switches and wall time (vertical axes), respectively. Since SPI only performs one action switch per iteration, we only show its number of action switches. The purpose of adding SPI to the baseline is to verify if our GPI can effectively reduce the number of action switches. Due to the fact that SPI sweeps over the entire state set and updates a single state with the largest improvement, it is supposed to have the least number of action switches. However, SPI’s larger complexity of performing one update should lead to higher running time. This is supported by the experiments as Figure 28 (a) and (b) show that SPI (green curves) takes the least number of switches and longest time. We drop SPI in Figure 28 (c)-(e) to have a clearer comparison between GPI and PI (especially in wall time). The proposed GPI has a clear advantage over PI in almost all tests. The second row of Figure 28 (a) and (b) shows that the number of action switches of GPI is significantly fewer than PI and very close to SPI although the complexity of a switch is cheaper by a factor of . And the reduction in the number of action switches leads to fewer iterations. Another important observation is that the margin increases as the action set becomes larger. This is strong empirical evidence that demonstrates the benefits of GPI’s action selection strategy which is to reach the endpoints of line segments in the value function polytope. The larger the action set is, the more policies lying on the line segments and thus the more actions being excluded in one switch. The wall time of GPI is also very competitive compared to PI which further demonstrates that GPI can be a very practical algorithm for solving finite discounted MDPs.

We also test the performance of the asynchronous GPI (Async-GPI) on MDPs with and . For each setting, we randomly generate a sequence of states that is larger than . We compare Async-GPI with asynchronous value iteration (Async-VI) which is classic asynchronous dynamic programming algorithm. At time step , Async-VI performs one step of the optimality Bellman operator on a single state that is available to the algorithm. The results are shown in Figure 33. The mean of the value function is plotted against the number of updates. We observe that Async-GPI took significantly fewer updates to reach the optimal value function. The gap becomes larger when the state set grows in size. These results are expected because Async-GPI also has a higher complexity to perform an update and Async-VI never really solves the real value function before reaching the optimality.

6. Conclusions and Future Work

In this paper, we discussed the geometric properties of finite MDPs. We characterized the hyperplane arrangement that includes the boundary of the value function polytope, and further related it to the MDP-LP polytope by showing that they share the same hyperplane arrangement. Unlike the well-defined MDP-LP polytope, it remains unclear which bounding hyperplanes are active and which halfspaces of them belong to the value space. Besides the conjecture stated earlier we would like to understand in the future which cells of the hyperplane arrangement form the value function polytope and may derive a bound on the number of convex cells. It is also plausible that the rest of the hyperplane arrangement will help us devise new algorithms for solving MDPs.

Following the fact that policies that differ in only one state are mapped onto a line segment in the value function polytope and the only two policies on the polytope boundary are deterministic in that state, we proposed a new algorithm called geometric policy iteration that is guaranteed to reach an endpoint of the line segment for every action switch. We developed a mechanism that makes the value function monotonically increase with respective to action switches and the whole process can be computed efficiently. Our experiments showed that our algorithm is very competitive compared to the widely-used policy iteration and value iteration. We believe this type of algorithm can be extended to multi-agent settings, e.g., stochastic games (shapley1953stochastic). And it will also be interesting to apply similar ideas to model-based reinforcement learning.

Acknowledgements.

This work is supported by NSF DMS award 1818969 and a seed award from Center for Data Science and Artificial Intelligence Research at UC Davis.

References