1. Introduction
The Markov decision process (MDP) is the mathematical foundation of reinforcement learning (RL) which has achieved great empirical success in sequential decision problems. Despite RL’s success, new mathematical properties of MDPs are to be discovered to theoretically understand RL algorithms. In this paper, we study the geometric properties of discounted MDPs with finite states and actions and propose a new valuebased algorithm inspired by their polyhedral structures.
A large family of methods for solving MDPs is based on the notion of value function, which maps a policy to state values. When the value function is maximized, the optimal policy can then be extracted by taking a greedy step with respect to the value function. Policy iteration (howard60dynamic) is such an algorithm that repeatedly alternates between a policy evaluation step and a policy improvement step until convergence. The policy is mapped into the value space in the policy evaluation step and then greedily improved according to the state values in the policy improvement step. It is also well known that the optimal state values can be solved by linear programming (LP) (puterman94markov) which also attracts a lot of research interest due to its mathematical formulation.
Although these algorithms are efficient in practice, their worstcase complexity was long believed exponential (mansour1999complexity). The major breakthrough was made by ye2011 where the author proved that both policy iteration and LP with Simplex method (danzigsimplex) terminate in . The author first proved that the Simplex method with the mostnegativereducedcost pivoting rule is strongly polynomial in this situation. Then, a variant of policy iteration called simple policy iteration was shown to be equivalent to the Simplex method. hansen2013strategy later improved the complexity of policy iteration by a factor of . The best known complexity of policy iteration is proved by scherrer2016improved.
In the LP formulation, the state values are optimized through the vertices of the LP feasible region which is a convex polytope. Surprisingly, it was recently discovered that the space of the value function is a (possibly nonconvex) polytopes (Dadashi2019value). We call such object the value function polytope denoted by . As opposed to LP, the state values are navigated through in policy iteration. Moreover, the line theorem (Dadashi2019value) states that the set of policies that only differ in one state is mapped onto the same line segment in the value function polytope. This suggests the potential of new algorithms based on singlestate updates.
Our first contribution is on the structure of the value function polytope . Specifically, we show that a hyperplane arrangement is shared by and the polytope of the linear programming formulation for MDPs. We characterize these hyperplanes using the Bellman equation of policies that are deterministic in a single state. We prove that the boundary of the value function polytope is the union of finitely many (convex polyhedral) cells of . Moreover, each fulldimensional cell of the value function polytope is contained in the union of finitely many fulldimensional cells defined by . We further conjecture that the cells of the arrangement cannot be partial, but they have to be entirely contained in the value function polytope.
The learning dynamic of policy iteration in the value function polytope shows that every policy update leads to an improvement of state values along one line segment of . Based on this, we propose a new algorithm, geometric policy iteration (GPI), a variant of the classic policy iteration with several improvements. First, policy iteration may perform multiple updates on the same line segment. GPI avoids this situation by always reaching an endpoint of a line segment in the value function polytope for every policy update. This is achieved by efficiently calculating the true state value of each potential policy update instead of using the Bellman operator which only guarantees a value improvement. Second, GPI updates the values for all states immediately after each policy update for a single state, which makes the value function monotonically increasing with respect to every policy update. Last but not least, GPI can be implemented in an asynchronous fashion. This makes GPI more flexible and advantageous over policy iteration in MDPs with a very large state set.
We prove that GPI converges in iterations, which matches the bestknown bound for solving finite discounted MDPs. Although using a more complicated strategy for policy improvement, GPI maintains the same arithmetic operations in each iteration as policy iteration. We empirically demonstrate that GPI takes fewer iterations and policy updates to attain the optimal value.
1.1. Related Work
One line of work related to this paper is on the complexity of the policy iteration. For MDPs with a fixed discount factor, the complexity of policy iteration has been improved significantly (Littman94; ye2011; Ye2013Post; hansen2013strategy; scherrer2016improved). There are also positive results reported on stochastic games (SG). hansen2013strategy proved that a twoplayer turnbased SG can be solved by policy iteration in strongly polynomial time when the discount factor is fixed. Akian2013PolicyIF further proved that policy iteration is strongly polynomial in meanpayoff SG with statedependent discount factors under some restrictions. In terms of more general settings, the worstcase complexity can still be exponential (mansour1999complexity; Fearnley; Hollanders2012; Hollanders2016). Another line of related work studies the geometric properties of MDPs and RL algorithms. The concept of the value function polytope in this paper was first proposed in Dadashi2019value, which was also the first recent work studying the geometry of the value function. Later, Bellemare2019Geometric explored the direction of using these geometric structures as auxiliary tasks in representation learning in deep RL. policyimprovepath also aimed at improving the representation learning by shaping the policy improvement path within the value function polytope. The geometric perspective of RL also contributes to unsupervised skill learning where no reward function can be accessed (unsupervised_skill_learning). Very recently, geometryPOMDP analyzed the geometry of stateaction frequencies in partially observable MDPs and formulated the problem of finding the optimal memoryless policy as a polynomial program with a linear objective and polynomial constraints. The geometry of the value function in robust MDP is also studied in geometryRMDP.
2. Preliminaries
An MDP has five components where and are finite state set and action set, is the transition function with
denoting the probability simplex.
is the reward function and is the discount factor that represents the value of time.A policy is a mapping from states to distributions over actions. The goal is to find a policy that maximizes the cumulative sum of rewards.
Define
as the vector of state values.
is then the expected cumulative reward starting from a particular state and acting according to :The Bellman equation (Bellman:DynamicProgramming) connects the value at a state with the value at the subsequent states when following :
(1) 
Define and as follows.
Then, the Bellman equation for a policy can be expressed in matrix form as follows.
(2) 
Under this notation, we can define the Bellman operator and the optimality Bellman operator for an arbitrary value vector as follows.
is optimal if and only if . MDPs can be solved by value iteration which consists of the repeated application of the optimality Bellman operator until a fixed point has been reached.
Let denote the space of all policies, denote the space of all state values. We define the value function as
(3) 
The value function is fundamental to many algorithmic solutions of an MDP. Policy iteration (PI) (howard60dynamic) repeatedly alternates between a policy evaluation step and a policy improvement step until convergence. In the policy evaluation step, the state values of the current policy is evaluated which involves solving a linear system (Eq. (2)). In the policy improvement step, the next policy is obtained by taking a greedy step using the optimality Bellman operator as follows.
Simple policy iteration (SPI) is a variant of policy iteration. It only differs from policy iteration in the policy improvement step where the policy is only updated for the stateaction pair with the largest improvement over the following advantage function.
SPI selects a stateaction pair from then updates the policy accordingly.
2.1. Geometry of the Value Function
While the space of policies is the Cartesian product of probability simplices, Dadashi2019value proved that the value function space is a possibly nonconvex polytope (Ziegler_polytope). Figure 3 shows a convex and a nonconvex polytopes of 2 MDPs in blue regions. The proof is built upon the line theorem which is an equally important geometric property of the value space. The line theorem depends on the following definition of policy determinism.
Definition 2.0 (Policy Determinism).
A policy is

deterministic for if it selects one concrete action for sure in state , i.e., ;

deterministic if it is deterministic for all .
The line theorem captures the geometric property of a set of policies that differ in only one state. Specifically, we say two policies agree on states if for each , . For a given policy , we denote by the set of policies that agree with on ; we will also write to describe the set of policies that agree with on all states except . When we keep the probabilities fixed at all but state , the functional draws a line segment which is oriented in the positive orthant (that is, one end dominates the other). Furthermore, the endpoints of this line segment are deterministic policies.
The line theorem is stated as follows:
Theorem 2 (Line theorem (Dadashi2019value)).
Let be a state and a policy. Then there are two policies in , denoted , which bracket the value of all other policies :
3. The Cell Structure of the Value Function Polytope
In this section, we revisit the geometry of the (nonconvex) value function polytope presented in Dadashi2019value. We establish a connection to linear programming formulations of the MDP which then can be adapted to show a finer description of cells in the value function polytope as unions of cells of a hyperplane arrangement. For more on hyperplane arrangements and their structure, see hyperplanesintro.
It is known since at least the 1990’s that finding the optimal value function of an MDP can be formulated as a linear program (see for example (puterman94markov; bertsekas96neurodynamic)). In the primal form, the feasible constraints are defined by , where is the optimality Bellman operator. Concretely, the following linear program is wellknown to be equivalent to maximizing the expected total reward in Eq. (2). We call this convex polyhedron the MDPLP polytope (because it is a linear programming form of the MDP problem).
s.t. 
where
is a probability distribution over
.Our main new observation is that the MDPLP polytope and the value polytope are actually closely related and one can describe the regions of the (nonconvex) value function polytope in terms of the (convex) cells of the arrangement.
Theorem 1 ().
Consider the hyperplane arrangement , with hyperplanes, consisting of those of the MDP polytope, i.e.,
Then, the boundary of the value function polytope is the union of finitely (convex polyhedral) cells of the arrangement . Moreover, each fulldimensional cell of the value polytope is contained in the union of finitely many fulldimensional cells defined by .
Proof.
Let us first consider a point being on the boundary of the value function polytope. Theorem 2 and Corollary 3 of Dadashi2019value demonstrated that the boundary of the space of value functions is a (possibly proper) subset of the ensemble of value functions of policies, where at least one state has a fixed deterministic choice for all actions. Note that from the value function Eq. (3), then the hyperplane
includes all policies taking policy in state . Thus the points of the boundary of the value function polytope are contained in the hyperplanes of . Now we can see how the dimensional cells of the boundary are then in the intersections of the hyperplanes too.
The zerodimensional cells (vertices) are clearly a subset of the zerodimensional cells of the arrangement because, by above results, the zerodimensional cells are precisely in the intersection of many hyperplanes from , which is equivalent to choosing a fixed set of actions for all states. This corresponds to solving a linear system consisting of the hyperplanes that bound (same as Eq. (2)). But more generally, if we fix the policies for only states, the induced space lies in a dimensional affine space. Consider a policy and states , and write for the columns of the matrix corresponding to states other than . Define the affine vector space
Now For a given policy , we denote by the set of policies which agree with on ; Thus the value functions generated by are contained in the affine vector space :
The points of in one or more of the planes (each hyperplane is precisely fixing one policy action pair). This is the intersection of hyperplanes given by the following equations.
Thus we can be sure of the stated containment.
Finally, the only remaining case is when is in the interior of the value polytope. If that is the case, because partitions the entire Euclidean space, it must be contained in at least one of the fulldimensional cell of . ∎
Figure (a)a is an example of the value function polytope in blue, MDPLP polytope in green and its bounding hyperplanes (the arrangement ) as blue and red lines. In Figure (b)b we exemplify Theorem 1 by presenting a value function polytope with delimited boundaries where hyperplanes are indicated in different colors. The deterministic policies are those for which . In both pictures, the values of deterministic policies in the value space are shown as red dots. The boundaries of the value polytope are indeed included in the set of cells of the arrangement as stated by Theorem 1. These figures of value function polytopes (blue regions) were obtained by randomly sampling policies and plotting their corresponding state values.
Some remarks are in order. Note how sometimes the several adjacent cells of the MDP arrangement together form a connected cell of the value function polytope. We also observe that for any set of states and a policy , can be expressed as a convex combination of value functions of deterministic policies. In particular, is included in the convex hull of the value functions of deterministic policies. It is also demonstrated clearly in Figure (b)b that the value functions of deterministic policies are not always vertices and the vertices of the value polytope are not always value functions of deterministic policies, but they are always intersections of hyperplanes on . However, optimal values will always include a deterministic vertex. This observation suggests that it would suffice to find the optimal policy by only visiting deterministic policies on the boundary. It is worthwhile to note that the optimal value of our MDP would be at the unique intersection vertex of the two polytopes. We note that the blue regions in Figure (a)a are not related to the polytope of the dual formulation of LP. Unlike the MDP polytope which can be characterized as the intersection of finitely many halfspaces, we do not have such a neat representation for the value function polytope. The pictures presented here and many more experiments we have done suggest the following stronger result is true:
Conjecture: if the value polytope intersects a cell of the arrangement , then it contains the entire cell, thus all fulldimensional cells of the value function polytope are equal to the union of fulldimensional cells of the arrangement.
Proving this conjecture requires to show that the map from policies to value functions is surjective over the cells it touches. At the moment we can only guarantee that there are no isolated components because the value polytope is a compact set. More strongly
Dadashi2019value shown (using the line theorem) that there is path connectivity from , in any cell, to others is guaranteed by a polygonal path. More precisely if we let and be two value functions. Then there exists a sequence of policies, , such that , , and for every , the set forms a line segment.It was observed that algorithms for solving MDPs have different learning behavior when visualized in the value polytope space. For example policy gradient methods (sutton2000policy; kakade2002natural; policygradient_actorcritic; policygradientWilliam; policygradientWilliamsPeng91) have an improvement path inside of the value function polytope; value iteration can go outside of the polytope which means there can be no corresponding policy during the update process; and policy iteration navigates exactly through deterministic policies. In the rest of our paper we use this geometric intuition to design a new algorithm.
4. The Method of Geometric Policy Iteration
We now present geometric policy iteration (GPI) that improves over PI based on the geometric properties of the learning dynamics. Define an action switch to be an update of policy in any state . The Line theorem shows that policies agreeing on all but one state lie on a line segment. So an action switch is a move along a line segment to improve the value function. In PI, we use the optimality Bellman operator to decide the action to switch to for state . However, does not guarantee the largest value improvement for . This phenomenon is illustrated in Figure 9 where we plot the value sequences of PI and the proposed GPI.
We propose an alternative actionswitch strategy in GPI that directly calculates the improvement of the value function for one state. By choosing the action with the largest value improvement, we can always reach the endpoint of a line segment which potentially reduces the number of action switches.
This strategy requires efficient computation of the value function because a naive calculation of the value function by Eq. (2) is very expensive due to the matrix inversion. On the other hand, PI only reevaluates the value function once per iteration. Our next theorem states that the new statevalue can be efficiently computed. This is achieved by using the fact that the policy improvement step can be done statebystate within a sweep over the state set, so adjacent policies in the update sequence only differ in one state.
Theorem 1 ().
Given and . If a new policy only differs from in state with , can be calculated efficiently by
(4) 
where is a d vector, is a scalar, is the th column of , and is a vector with entry being , others being .
Proof.
We here provide a general proof that we can calculate given policy , , and differs from in only one state.
where . Assume and differ in state . is a rank matrix with row being , and all other rows being zero vectors.
We can then express as the outer product of two vectors , where is an onehot vector
(5) 
and
(6) 
Similarly, . Then, we have
Thus, for state , we have
which completes the proof. ∎
Theorem 1 suggests that updating the value of a single state using Eq. (4) takes arithmetic operations which matches the complexity of the optimality Bellman operator used in policy iteration.
The second improvement over policy iteration comes from the fact that the value improvement path in may not be monotonic with respect to action switches. Although it is wellknown that the update sequence is nondecreasing in the iteration number , the value function could decrease in the policy improvement step of the policy iteration. An illustration is shown in Figure 12. This is because when the Bellman operator is used to decide an action switch, is fixed for the entire sweep of states. This leads us to the motivation for GPI which is to update the value function after each action switch such that the value function actionswitchmonotone. This idea can be seamlessly combined with Theorem 1 since the values of all states can be updated efficiently in arithmetic operations. Thus, the complexity of completing one iteration is the same as policy iteration.
We summarize GPI in Algorithm 1. GPI looks for action switches for all states in one iteration, and updates the value function after each action switch. Let superscript denote the iteration index, subscript denote the state index in one iteration. To avoid clutter, we use to denote the state being updated and drop superscript in and . Step 3 evaluates the initial policy . The difference here is that we store the intermediate matrix for later computation. From step 4 to step 8, we iterate over all states to search for potential updates. In step 5, GPI selects the best action by computing the new statevalue of each potential action switch by Eq. (7).
(7) 
where
(8)  
(9) 
and is vector with entry being and others being .
Define to be the column of . is obtained by Eq. (8) using the selected action . In step 6, we update as follows.
(10) 
The policy is updated in step 7 and the value vector is updated in step 8 where is the reward vector under the new policy. The algorithm is terminated when the optimal values are achieved.
4.1. Theoretical Guarantees
Before we present any properties of GPI, let us first prove the following very useful lemma.
Lemma 0 ().
Given two policies and , we have the following equalities.
(11)  
(12) 
Proof.
Our first result is an immediate consequence of reevaluating the value function after an action switch.
Proposition 0 ().
The value function is nondecreasing with respect to action switches in GPI, i.e., .
Proof.
We next turn to the complexity of GPI and bound the number of iterations required to find the optimal solution. The analysis depends on the lemma described as follows.
Lemma 0 ().
Let denote the optimal value. At iteration of GPI, we have the following inequality.
Proof.
Theorem 5 ().
GPI finds the optimal policy in iterations.
Proof.
Define with . Then, by Lemma 4, we have
Let be the state such that , the following properties can be obtained by Eq. (12) in Lemma 2.
Also from Eq. (12), we have
(17) 
It follows that
which implies when , the nonoptimal action for in is switched in and will never be switched back to in future iterations. Now we are ready to bound . By taking the logarithm for both sides of , we have
Each nonoptimal action is eliminated after at most iterations, and there are nonoptimal actions. Thus, GPI takes at most iterations to reach the optimal policy. ∎
4.2. Asynchronous Geometric Policy Iteration
When the state set is large it would be beneficial to perform policy updates in an orderless way (Sutton1998). This is because iterating over the entire state set may be prohibitive, and exactly evaluating the value function with Eq. (2) may be too expensive. Thus, in practice, the value function is often approximated when the state set is large. One example is modified policy iteration (puterman_modified_pi; puterman94markov) where the policy evaluation step is approximated with certain steps of value iteration.
Since GPI avoids the matrix inversion by updating the value function incrementally, it has the potential to update the policy for arbitrary states available to the agent. This property also opens up the possibility of asynchronous (orderless) updates of policies and the value function when the state set is large or the agent has to update the policy for the state it encounters in realtime. The asynchronous update strategy can also help avoid being stuck in states that lead to minimal progress and may reach the optimal policy without reaching a certain set of states.
Asynchronous GPI (AsyncGPI) can be described as follows. Assume the transition matrix is available to the agent, we initialize the policy and calculate the initial and accordingly. In realtime settings, the sequence of states are collected by an agent through realtime interaction with the environment. At time step , we perform a policy update using Eq. (7). Then, update using Eq. (10) and using Eq. (2). Asynchronous valuebased methods converge if each state is visited infinitely often (bertsekasAsyncVI). We later demonstrate in experiments that AsyncGPI converges well in practice.
5. Experiments
We test GPI on random MDPs of different sizes. The baselines are policy iteration (PI) and simple policy iteration (SPI). We compare the number of iterations, actions switches and wall time. Here we denote the number of iterations as the number of sweeps over the entire state set. Action switches are those policy updates within each iteration. The results are shown in Figure 28. We generate MDPs with corresponding to Figure 28 (a)(e). And for each state size, we increase the number of actions (horizontal axes) to observe the difference in performance. The rows from the top to bottom are the number of iterations, action switches and wall time (vertical axes), respectively. Since SPI only performs one action switch per iteration, we only show its number of action switches. The purpose of adding SPI to the baseline is to verify if our GPI can effectively reduce the number of action switches. Due to the fact that SPI sweeps over the entire state set and updates a single state with the largest improvement, it is supposed to have the least number of action switches. However, SPI’s larger complexity of performing one update should lead to higher running time. This is supported by the experiments as Figure 28 (a) and (b) show that SPI (green curves) takes the least number of switches and longest time. We drop SPI in Figure 28 (c)(e) to have a clearer comparison between GPI and PI (especially in wall time). The proposed GPI has a clear advantage over PI in almost all tests. The second row of Figure 28 (a) and (b) shows that the number of action switches of GPI is significantly fewer than PI and very close to SPI although the complexity of a switch is cheaper by a factor of . And the reduction in the number of action switches leads to fewer iterations. Another important observation is that the margin increases as the action set becomes larger. This is strong empirical evidence that demonstrates the benefits of GPI’s action selection strategy which is to reach the endpoints of line segments in the value function polytope. The larger the action set is, the more policies lying on the line segments and thus the more actions being excluded in one switch. The wall time of GPI is also very competitive compared to PI which further demonstrates that GPI can be a very practical algorithm for solving finite discounted MDPs.
We also test the performance of the asynchronous GPI (AsyncGPI) on MDPs with and . For each setting, we randomly generate a sequence of states that is larger than . We compare AsyncGPI with asynchronous value iteration (AsyncVI) which is classic asynchronous dynamic programming algorithm. At time step , AsyncVI performs one step of the optimality Bellman operator on a single state that is available to the algorithm. The results are shown in Figure 33. The mean of the value function is plotted against the number of updates. We observe that AsyncGPI took significantly fewer updates to reach the optimal value function. The gap becomes larger when the state set grows in size. These results are expected because AsyncGPI also has a higher complexity to perform an update and AsyncVI never really solves the real value function before reaching the optimality.
6. Conclusions and Future Work
In this paper, we discussed the geometric properties of finite MDPs. We characterized the hyperplane arrangement that includes the boundary of the value function polytope, and further related it to the MDPLP polytope by showing that they share the same hyperplane arrangement. Unlike the welldefined MDPLP polytope, it remains unclear which bounding hyperplanes are active and which halfspaces of them belong to the value space. Besides the conjecture stated earlier we would like to understand in the future which cells of the hyperplane arrangement form the value function polytope and may derive a bound on the number of convex cells. It is also plausible that the rest of the hyperplane arrangement will help us devise new algorithms for solving MDPs.
Following the fact that policies that differ in only one state are mapped onto a line segment in the value function polytope and the only two policies on the polytope boundary are deterministic in that state, we proposed a new algorithm called geometric policy iteration that is guaranteed to reach an endpoint of the line segment for every action switch. We developed a mechanism that makes the value function monotonically increase with respective to action switches and the whole process can be computed efficiently. Our experiments showed that our algorithm is very competitive compared to the widelyused policy iteration and value iteration. We believe this type of algorithm can be extended to multiagent settings, e.g., stochastic games (shapley1953stochastic). And it will also be interesting to apply similar ideas to modelbased reinforcement learning.
Acknowledgements.
This work is supported by NSF DMS award 1818969 and a seed award from Center for Data Science and Artificial Intelligence Research at UC Davis.