 # The Value Function Polytope in Reinforcement Learning

We establish geometric and topological properties of the space of value functions in finite state-action Markov decision processes. Our main contribution is the characterization of the nature of its shape: a general polytope (Aigner et al., 2010). To demonstrate this result, we exhibit several properties of the structural relationship between policies and value functions including the line theorem, which shows that the value functions of policies constrained on all but one state describe a line segment. Finally, we use this novel perspective to introduce visualizations to enhance the understanding of the dynamics of reinforcement learning algorithms.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The notion of value function is central to reinforcement learning (RL). It arises directly in the design of algorithms such as value iteration (Bellman, 1957), policy gradient (Sutton et al., 2000), policy iteration (Howard, 1960), and evolutionary strategies (e.g. Szita & Lőrincz, 2006)

, which either predict it directly or estimate it from samples, while also seeking to maximize it. The value function is also a useful tool for the analysis of approximation errors

(Bertsekas & Tsitsiklis, 1996; Munos, 2003).

In this paper we study the map from stationary policies, which are typically used to describe the behaviour of RL agents, to their respective value functions. Specifically, we vary over the joint simplex describing all policies and show that the resulting image forms a polytope, albeit one that is possibly self-intersecting and non-convex.

We provide three results all based on the notion of “policy agreement”, whereby we study the behaviour of the map as we only allow the policy to vary at a subset of all states.

Line theorem. We show that policies that agree on all but one state generate a line segment within the value function polytope, and that this segment is monotone (all state values increase or decrease along it).

Relationship between faces and semi-deterministic policies. We show that -dimensional faces of this polytope are mapped one-to-many to policies which behave deterministically in at least states.

Sub-polytope characterization. We use this result to generalize the line theorem to higher dimensions, and demonstrate that varying a policy along states generates a -dimensional sub-polytope.

Although our “line theorem” may not be completely surprising or novel to expert practitioners, we believe we are the first to highlight its existence. In turn, it forms the basis of the other two results, which require additional technical machinery which we develop in this paper, leaning on results from convex analysis and topology.

While our characterization is interesting in and of itself, it also opens up new perspectives on the dynamics of learning algorithms. We use the value polytope to visualize the expected behaviour and pitfalls of common algorithms: value iteration, policy iteration, policy gradient, natural policy gradient (Kakade, 2002), and finally the cross-entropy method (de Boer, 2004).

## 2 Preliminaries

We are in the reinforcement learning setting (Sutton & Barto, 2018). We consider a Markov decision process with the finite state space, the finite action space, the reward function, the transition function, and the discount factor for which we assume . We denote the number of states by , the number of actions by .

A stationary policy is a mapping from states to distributions over actions; we denote the space of all policies by . Taken with the transition function , a policy defines a state-to-state transition function :

 Pπ(s′|s)=∑a∈Aπ(a|s)P(s′|s,a)

The value is defined as the expected cumulative reward from starting in a particular state and acting according to :

 Vπ(s)=EPπ(∞∑i=0γir(si,ai)|s0=s)

The Bellman equation (Bellman, 1957) connects the value function at a state with the value function at the subsequent states when following :

 Vπ(s)=EPπ(r(s,a)+γVπ(s′)). (1)

Throughout we will make use of vector notation

(e.g. Puterman, 1994). Specifically, we view (with some abuse of notation) as a matrix, as a -dimensional vector, and write for the vector of expected rewards under . In this notation, the Bellman equation for a policy is

 Vπ=rπ+γPπVπ=(I−γPπ)−1rπ.

In this work we study how the value function changes as we continuously vary the policy . As such, we will find convenient to also view this value function as the functional

 fv:P(A)S →RS π ↦Vπ=(I−γPπ)−1rπ.

We will use the notation when the emphasis is on the vector itself, and when the emphasis is on the mapping from policies to value functions.

Finally, we will use and for element-wise vector inequalities, and for a function and a subset write to mean the image of applied to .

### 2.1 Polytopes in Rn

Central to our work will be the result that the image of the functional applied to the space of policies forms a polytope, possibly nonconvex and self-intersecting, with certain structural properties. This section lays down some of the necessary definitions and notations. For a complete overview on the topic, we refer the reader to Grünbaum et al. (1967); Ziegler (2012); Brondsted (2012).

We begin by characterizing what it means for a subset to be a convex polytope or polyhedron. In what follows we write to denote the convex hull of the points .

###### Definition 1 (Convex Polytope).

is a convex polytope iff there are points such that .

###### Definition 2 (Convex Polyhedron).

P is a convex polyhedron iff there are half-spaces whose intersection is , that is

 P=∩ki=1^Hk.

A celebrated result from convex analysis relates these two definitions: a bounded, convex polyhedron is a polytope (Ziegler, 2012).

The next two definitions generalize convex polytopes and polyhedra to non-convex bodies.

###### Definition 3 (Polytope).

A (possibly non-convex) polytope is a finite union of convex polytopes.

###### Definition 4 (Polyhedron).

A (possibly non-convex) polyhedron is a finite union of convex polyhedra.

We will make use of another, recursive characterization based on the notion that the boundaries of a polytope should be “flat” in a topological sense (Klee, 1959).

For an affine subspace , recall that is a relative neighbourhood of in if and is open in . For , the relative interior of in , denoted , is then the set of points in which have a relative neighbourhood in . The notion of “open in ” is key here: a point that lies on an edge of the unit square does not have a relative neighbourhood in the square, but it has a relative neighbourhood in that edge. The relative boundary is defined as the set of points in not in the relative interior of , that is

 ∂KP=P∖relintK(P).

Finally, we recall that is a hyperplane if is an affine subspace of of dimension .

###### Proposition 1.

is a polyhedron in an affine subspace if

1. is closed;

2. There are hyperplanes in whose union contains the boundary of in :
; and

3. For each of these hyperplanes, is a polyhedron in .

All proofs may be found in the appendix.

## 3 The Space of Value Functions

We now turn to the main object of our study, the space of value functions . The space of value functions is the set of all value functions that are attained by some policy. As noted earlier, this corresponds to the image of under the mapping :

 (2)

As a warm-up, Figure 2 depicts the space corresponding to four 2-state MDPs; each set is made of value functions corresponding to 50,000 policies sampled uniformly at random from . The specifics of all MDPs depicted in this work can be found in Appendix A.

While the space of policies is easily described (it is the Cartesian product of simplices), value function spaces arise as complex polytopes. Of note, they may be non-convex – justifying our more intricate definition.

In passing, we remark that the polytope gives a clear illustration of the following classic results regarding MDPs (e.g. Bertsekas & Tsitsiklis, 1996):

• (Dominance of ) The optimal value function is the unique dominating vertex of ;

• (Monotonicity) The edges of are oriented with the positive orthant;

• (Continuity) The space is connected.

The next sections will formalize these and other, less-understood properties of the space of value functions.

### 3.1 Basic Shape from Topology

We begin with a first approximation on how the functional transforms the space of policies into the space of value functions (Figure 1). Recall that

 fv(π)=(I−γPπ)−1rπ

Hence is infinitely differentiable everywhere on (Appendix C). The following is a topological consequence of this property, along with the fact that is a compact and connected set.

###### Lemma 1.

The space of value functions is compact and connected.

The interested reader may find more details on this topological argument in (Engelking, 1989).

### 3.2 Policy Agreement and Policy Determinism

Two notions play a central role in our analysis: policy agreement and policy determinism.

###### Definition 5 (Policy Agreement).

Two policies agree on states if for each , .

For a given policy , we denote by the set of policies which agree with on ; we will also write to describe the set of policies that agree with on all states except . Note that policy agreement does not imply disagreement; in particular, for any subset of states .

###### Definition 6 (Policy Determinism).

A policy is

1. -deterministic for if .

2. semi-deterministic if it is -deterministic for at least one .

3. deterministic if it is -deterministic for all states .

We will denote by the set of semi-deterministic policies that take action when in state .

###### Lemma 2.

Consider two policies that agree on . Then the vector has zeros in the components corresponding to and the matrix has zeros in the corresponding rows.

This lemma highlights that when two policies agree on a given state they have the same immediate dynamic on this state, i.e. they get the same expected reward, and have the same next state transition probabilities. Lemma

3 in Section 3.3 will be a direct consequence of this property.

### 3.3 Value Functions and Policy Agreement

We begin our characterization by considering the subsets of value functions that are generated when the action probabilities are kept fixed at certain states, that is: when we restrict the functional to the set of policies that agree with some base policy on these states.

Something special arises when we keep the probabilities fixed at all but state : the functional draws a line segment which is oriented in the positive orthant (that is, one end dominates the other end). Furthermore, the extremes of this line segment can be taken to be -deterministic policies. This is the main result of this section, which we now state more formally.

###### Theorem 1.

[Line Theorem] Let be a state and , a policy. Then there are two policies in , denoted , which bracket the value of all other policies :

 fv(πl)≼fv(π′)≼fv(πu).

Furthermore, the image of restricted to is a line segment, and the following three sets are equivalent:

1. ,

2. ,

3. .

The second part of Theorem 1 states that one can generate the set of value functions in two ways: either by drawing the line segment in value space, to , or drawing the line segment in policy space, from to and then mapping to value space. While somewhat technical, this characterization of line segment is needed to prove some of our later results. Figure 3

illustrates the path drawn by interpolating between two policies that agree on state

. Figure 3: Illustration of Theorem 1. The orange points are the value functions of mixtures of policies that agree everywhere but one state.

Theorem 1 depends on two lemmas, which we now provide in turn. Consider a policy and states , and write for the columns of the matrix corresponding to states other than . Define the hyperplane

 Hπs1,…,sk=Vπ+Span(Cπk+1,…,Cπ|S|).
###### Lemma 3.

Consider a policy and states . Then the value functions generated by are contained in the hyperplane :

 fv(Yπs1,..,sk)=V∩Hπs1,..,sk.

Put another way, Lemma 3 shows that if we fix the policies on states, the induced space of value function loses at least degrees of freedom, specifically that it lies in a dimensional affine vector space.

For , Lemma 3 implies that the value functions lie on a line – however, the following is necessary to expose the full structure of within this line.

###### Lemma 4.

Consider the ensemble of policies that agree with a policy everywhere but on . For define the function

 g(μ)=fv(μπ1+(1−μ)π0).

Then the following hold regarding :

1. is continuously differentiable;

2. (Total order) or ;

3. If then , ;

4. (Monotone interpolation) If there is a such that , and is a strictly monotonic rational function of .

The result (ii) in Lemma 4 was established in (Mansour & Singh, 1999) for deterministic policies. We remark that (iv) in Lemma 4 arises from the Sherman-Morrison formula, which has been used in reinforcement learning for efficient sequential matrix inverse estimation (Bradtke & Barto, 1996). Note that in general, in the above, as the following example demonstrates.

###### Example 1.

Suppose , with terminal with no reward associated to it, . The transitions and rewards are defined by . Define two deterministic policies such that . We have

 fv((1−μ)π1+μπ2)=[μ1−γ(1−μ)0]

Remarkably, Theorem 1 shows that policies agreeing on all but one state draw line segments irrespective of the size of the action space; this may be of particular interest in the context of continuous action problems. Second, this structure is unique, in the sense that the paths traced by interpolating between two arbitrary policies may be neither linear, nor monotonic (Figure 4 depicts two examples). Figure 4: Value functions of mixtures of two policies in the general case. The orange points describe the value functions of mixtures of two policies.

### 3.4 Convex Consequences of Theorem 1

Some consequences arise immediately from Theorem 1. First, the result suggests a recursive application from the value function of a policy into its deterministic constituents.

###### Corollary 1.

For any set of states and a policy , can be expressed as a convex combination of value functions of -deterministic policies. In particular, is included in the convex hull of the value functions of deterministic policies.

This result indicates a relationship between the vertices of and deterministic policies. Nevertheless, we observe in Figure 5 that the value functions of deterministic policies are not necessarily the vertices of and that the vertices of are not necessarily attained by value functions of deterministic policies. Figure 5: Visual representation of Corollary 1. The space of value functions is included in the convex hull of value functions of deterministic policies (red dots).

The space of value functions is in general not convex. However, it does possess a weaker structural property regarding paths between value functions which is reminiscent of policy iteration-type results.

###### Corollary 2.

Let and be two value functions. Then there exists a sequence of policies, , such that , , and for every , the set

 {fv(απi+(1−α)πi+1)|α∈[0,1]}

forms a line segment.

### 3.5 The Boundary of V

We are almost ready to show that is a polytope. To do so, however, we need to show that the boundary of the space of value functions is described by semi-deterministic policies.

While at first glance reasonable given our earlier topological analysis, the result is complicated by the many-to-one mapping from policies to value functions, and requires additional tooling not provided by the line theorem. Recall from Lemma 3 the use of the hyperplane to constrain the value functions generated by fixing certain action probabilities.

###### Theorem 2.

Consider the ensemble of policies that agree with on states . Suppose , , s.t. , then has a relative neighborhood in .

Theorem 2 demonstrates by contraposition that the boundary of the space of value functions is a subset of the ensemble of value functions of semi-deterministic policies. Figure 6 shows that the latter can be a proper subset.

###### Corollary 3.

Consider a policy , the states , and the ensemble of policies that agree with on . Define , we have that the relative boundary of in is included in the value functions spanned by policies in that are -deterministic for :

 ∂Vy⊂⋃s∉S⋃a∈Afv(Yπs1,..,sk∩Ds,a),

where refers to . Figure 6: Visual representation of Corollary 3. The orange points are the value functions of semi-deterministic policies.

### 3.6 The Polytope of Value Functions

We are now in a position to combine the results of the previous section to arrive at our main contribution: is a polytope in the sense of Def. 3 and Prop. 1. Our result is in fact stronger: we show that any subset of policies generates a sub-polytope of .

###### Theorem 3.

Consider a policy , the states , and the ensemble of policies that agree with on . Then is a polytope and in particular, is a polytope.

Despite the evidence gathered in the previous section in favour of the above theorem, the result is surprising given the fundamental non-linearity of the functional : again, mixtures of policies can describe curves (Figure 4), and even the mapping in Lemma 4 is nonlinear in .

That the polytope can be non-convex is obvious from the preceding figures. As Figure 6 (right) shows, this can happen when value functions along two different line segments cross. At that intersection, something interesting occurs: there are two policies with the same value function but that do not agree on either state. We will illustrate the effect of this structure on learning dynamics in Section 5.

Finally, there is a natural sub-polytope structure in the space of value functions. If policies are free to vary only on a subset of states of cardinal , then there is a polytope of dimension associated with the induced space of value functions. This makes sense since constraining policies on a subset of states is equivalent to defining a new MDP, where the transitions associated with the complement of this subset of states are not dependent on policy decisions.

## 4 Related Work

The link between geometry and reinforcement learning has been so far fairly limited. However we note the former use of convex polyhedra in the following:

Simplex Method and Policy Iteration. The policy iteration algorithm (Howard, 1960) closely relates to the simplex algorithm (Dantzig, 1948). In fact, when the number of states where the policy can be updated is at most one, it is exactly the simplex method, sometimes referred to as simple policy iteration. As opposed to the limitations of the simplex algorithm (Littman et al., 1995), namely the worst case convergence in exponential time, it was demonstrated that the simplex algorithm applied to MDPs with an adequate pivot rule converges in polynomial time (Ye, 2011).

Linear Programming

. Finding the optimal value function of an MDP can be formulated as a linear program

(Puterman, 1994; Bertsekas & Tsitsiklis, 1996; De Farias & Van Roy, 2003; Wang et al., 2007). In the primal form, the feasible constraints are defined by , where is the optimality Bellman operator. Notice that there is a unique value function that is feasible, which is exactly the optimal value function .

The dual formulation consists of maximizing the expected return for a given initial state distribution, as a function of the discounted state action visit frequency distribution. Contrary to the primal form, any feasible discounted state action visit frequency distribution maps to an actual policy (Wang et al., 2007).

## 5 Dynamics in the Polytope

In this section we study how the behaviour of common reinforcement learning algorithms is reflected in the value function polytope. We consider two value-based methods, value iteration and policy iteration, three variants of the policy gradient method, and an evolutionary strategy.

Our experiments use the two-state, two-action MDP depicted elsewhere in this paper (details in Appendix A). Value-based methods are parametrized directly in terms of the value vector in ; policy-based methods are parametrized using the softmax distribution, with one parameter per state. We initialize all methods at the same starting value functions (indicated on Figure 7): near a vertex (), near a boundary (), and in the interior of the polytope ().111The use of the softmax precludes initializing policy-based methods exactly at boundaries.

We are chiefly interested in three aspects of the different algorithms’ learning dynamics: 1) the path taken through the value polytope, 2) the speed at which they traverse the polytope, and 3) any accumulation points that occur along this path. As such, we compute model-based versions of all relevant updates; in the case of evolutionary strategies, we use large population sizes (de Boer, 2004).

### 5.1 Value Iteration

Value iteration (Bellman, 1957) consists of the repeated application of the optimality Bellman operator

 Vk+1:=T∗Vk,

In all cases, is initialized to the relevant starting value function.

Figure 7 depicts the paths in value space taken by value iteration, from the starting point to the optimal value function. We observe that the path does not remain within the polytope: value iteration generates a sequence of vectors that may not map to any policy. Our visualization also highlights results by (Bertsekas, 1994)

showing that value iteration spends most of its time along the constant (1, 1) vector, and that the “real” convergence rate is in terms of the second largest eigenvalue of

.

### 5.2 Policy Iteration

Policy iteration (Howard, 1960) consists of the repeated application of a policy improvement step and a policy evaluation step until convergence to the optimal policy. The policy improvement step updates the policy by acting greedily according to the current value function; the value function of the new policy is then evaluated. The algorithm is based on the following update rule

 πk+1:=greedy(Vk) Vk+1:=evaluate(πk+1),

with initialized as in value iteration. Figure 8: Policy iteration. The red arrows show the sequence of value functions (blue) generated by the algorithm.

The sequence of value functions visited by policy iteration (Figure 8) corresponds to value functions of deterministic policies, which in this specific MDP corresponds to vertices of the polytope.

Policy gradient is a popular approach for directly optimizing the value function via parametrized policies (Williams, 1992; Konda & Tsitsiklis, 2000; Sutton et al., 2000). For a policy with parameters the policy gradient is

 ∇θJ(θ)=Es∼dπ,a∼π(⋅|s)∇θlogπ(a|s)[r(s,a)+γEV(s′)]

where is the discounted stationary distribution; here we assume a uniformly random initial distribution over the states. The policy gradient update is then (

 θk+1:=θk+η∇θJ(θk).

Figure 9 shows that the convergence rate of policy gradient strongly depends on the initial condition. In particular, Figure 9a),b) show accumulation points along the update path (not shown here, the method does eventually converge to ). This behaviour is sensible given the dependence of on , with gradients vanishing at the boundary of the polytope. We provide the corresponding gradient fields in Appendix E.

### 5.4 Entropy Regularized Policy Gradient

Entropy regularization adds an entropy term to the objective (Williams & Peng, 1991). The new policy gradient becomes

 ∇θJent(θ)=∇θJ(θ)−∇θEs∼dπH(π(⋅|s)),

where denotes the Shannon entropy.

The entropy term encourages policies to move away from the boundary of the polytope. Consequent with our previous observation regarding policy gradient, we find that this improves the convergence rate of the optimization procedure (Figure 10). One trade-off is that the policy converges to a sub-optimal policy, which is not deterministic.

 θk+1:=θk+ηF−1∇θJ(θk)

This causes the gradient steps to follow the steepest ascent direction in the underlying structure of the parameter space.

In our experiment, we observe that natural policy gradient is less prone to accumulation than policy gradient (Fig. 11), in part because the step-size is better conditioned. Figure b) shows unregularized policy gradient does not, surprisingly enough, take the “shortest path” through the polytope to the optimal value function: instead, it moves from one vertex to the next, similar to policy iteration.

### 5.6 Cross-Entropy Method

Gradient-free optimization methods have shown impressive performance over complex control tasks (de Boer, 2004; Salimans et al., 2017). We present the dynamics of the cross-entropy method (CEM), without noise and with a constant noise factor (CEM-CN) (Szita & Lőrincz, 2006). The mechanics of the algorithm is threefold: (i) sample a population of size

of policy parameters from a Gaussian distribution of mean

, covariance ; (ii) evaluate the returns of the population; (iii) select top members, and fit a new Gaussian onto them. In the CEM-CN variant, we inject additional isotropic noise at each iteration. We use , , an initial covariance of , where

is the identity matrix of size 2, and a constant noise of

. Figure 12: The cross-entropy method without noise (CEM) (a, b, c); with constant noise (CEM-CN) (d, e, f).

As observed in the original work (Szita & Lőrincz, 2006), the covariance of CEM without noise collapses (Figure 12.a)b)c)), and therefore reaches convergence for a suboptimal policy. However, the noise addition at each iteration prevents this undesirable behaviour (Figure 12.d)e)f)), as the algorithm converges to the optimal value functions for all three initialization points.

## 6 Discussion and Concluding Remarks

In this work, we characterized the shape of value functions and established its surprising geometric nature: a possibly non-convex polytope. This result was based on the line theorem which provides guarantees of monotonic improvement as well as a line-like variation in the space of value functions. This structural property raises the question of new learning algorithms based on a single state change, and what this might mean in the context of function approximation.

We noticed the existence of self-intersecting spaces of value functions, which have a bottleneck. However, from our simple study of learning dynamics over a class of reinforcement learning methods, it does not seem that this bottleneck leads to any particular learning slowdown.

Some questions remain open. Although those geometric concepts make sense for finite state action spaces, it is not clear how they generalize to the continuous case. There is a connection between representation learning and the polytopal structure of value functions that we have started exploring (Bellemare et al., 2019). Another exciting research direction is the relationship between the geometry of value functions and function approximation.

## 7 Acknowledgements

The authors would like to thank their colleagues at Google Brain for their help; Carles Gelada, Doina Precup, Georg Ostrovski, Marco Cuturi, Marek Petrik, Matthieu Geist, Olivier Pietquin, Pablo Samuel Castro, Rémi Munos, Rémi Tachet, Saurabh Kumar, and Zafarali Ahmed for useful discussion and feedback; Jake Levinson and Mathieu Guay-Paquet for their insights on the proof of Proposition 1; Mark Rowland for providing invaluable feedback on two earlier versions of this manuscript.

## References

• Aigner et al. (2010) Aigner, M., Ziegler, G. M., Hofmann, K. H., and Erdos, P. Proofs from the Book, volume 274. Springer, 2010.
• Bellemare et al. (2019) Bellemare, M. G., Dabney, W., Dadashi, R., Taiga, A. A., Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore, T., and Lyle, C. A geometric perspective on optimal representations for reinforcement learning. arXiv preprint arXiv:1901.11530, 2019.
• Bellman (1957) Bellman, R. Dynamic Programming. Dover Publications, 1957.
• Bertsekas (1994) Bertsekas, D. P. Generic rank-one corrections for value iteration in markovian decision problems. Technical report, M.I.T., 1994.
• Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, 1996.
• Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33–57, 1996.
• Brondsted (2012) Brondsted, A. An Introduction to Convex Polytopes, volume 90. Springer Science & Business Media, 2012.
• Dantzig (1948) Dantzig, G. B. Programming in a linear structure. Washington, DC, 1948.
• de Boer (2004) de Boer, P., K. D. M. S. R. R. A tutorial on the cross-entropy method. Annals of Operations Research, 2004.
• De Farias & Van Roy (2003) De Farias, D. P. and Van Roy, B. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003.
• Engelking (1989) Engelking, R. General Topology. Heldermann, 1989.
• Grünbaum et al. (1967) Grünbaum, B., Klee, V., Perles, M. A., and Shephard, G. C. Convex Polytopes. Springer, 1967.
• Howard (1960) Howard, R. A. Dynamic Programming and Markov Processes. MIT Press, 1960.
• Klee (1959) Klee, V. Some characterizations of convex polyhedra. Acta Mathematica, 102(1-2):79–107, 1959.
• Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pp. 1008–1014, 2000.
• Littman et al. (1995) Littman, M. L., Dean, T. L., and Kaelbling, L. P. On the complexity of solving Markov decision problems. In

Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence

, pp. 394–402. Morgan Kaufmann Publishers Inc., 1995.
• Mansour & Singh (1999) Mansour, Y. and Singh, S. On the complexity of policy iteration. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 401–408. Morgan Kaufmann Publishers Inc., 1999.
• Munos (2003) Munos, R. Error bounds for approximate policy iteration. In Proceedings of the International Conference on Machine Learning, 2003.
• Puterman (1994) Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
• Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
• Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT press, 2nd edition, 2018.
• Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
• Szita & Lőrincz (2006) Szita, I. and Lőrincz, A. Learning tetris using the noisy cross-entropy method. Neural Computation, 2006.
• Wang et al. (2007) Wang, T., Bowling, M., and Schuurmans, D. Dual representations for dynamic programming and reinforcement learning. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pp. 44–51. IEEE, 2007.
• Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992.
• Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
• Ye (2011) Ye, Y. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4):593–603, 2011.
• Ziegler (2012) Ziegler, G. M. Lectures on Polytopes, volume 152. Springer Science & Business Media, 2012.

## Appendix A Details of Markov Decision Processes

In this section we give the specifics of the Markov Decision Processes presented in this work. We will use the following convention:

 r(si,aj)=^r[i×|A|+j] P(sk|si,aj)=^P[i×|A|+j][k]

where are the vectors given below.

 In Section ???, Figure ???% : (a) |A|=2,γ=0.9 ^r=[0.06,0.38,−0.13,0.64] ^P=[[0.01,0.99],[0.92,0.08],[0.08,0.92],[0.70,0.30]] (b) |A|=2,γ=0.9 ^r=[0.88,−0.02,−0.98,0.42] ^P=[[0.96,0.04],[0.19,0.81],[0.43,0.57],[0.72,0.28]]) (c) |A|=3,γ=0.9 ^r=[−0.93,−0.49,0.63,0.78,0.14,0.41] ^P=[[0.52,0.48],[0.5,0.5],[0.99,0.01],[0.85,0.15],[0.11,0.89],[0.1,0.9]] (d) |A|=2,γ=0.9 ^r=[−0.45,−0.1,0.5,0.5] ^P=[[0.7,0.3],[0.99,0.01],[0.2,0.8],[0.99,0.01]] In Section 3, Figure 3, 4, 5, 6: (left) |A|=3,γ=0.8 ^r=[−0.1,−1.,0.1,0.4,1.5,0.1] ^P=[[0.9,0.1],[0.2,0.8],[0.7,0.3],[0.05,0.95],[0.25,0.75],[0.3,0.7]] (right) |A|=2,γ=0.9 ^r=[−0.45,−0.1,0.5,0.5] ^P=[[0.7,0.3],[0.99,0.01],[0.2,0.8],[0.99,0.01]] In Section 5: |A|=2,γ=0.9 ^r=[−0.45,−0.1,0.5,0.5] ^P=[[0.7,0.3],[0.99,0.01],[0.2,0.8],[0.99,0.01]]

## Appendix B Notation for the proofs

In the section we present the notation that we use to establish the results in the main text. The space of policies describes a Cartesian product of simplices that we can express as a space of matrices. However, we will adopt for policies, as well as the other components of , a convenient matrix form similar to (Wang et al., 2007).

• The transition matrix is a matrix denoting the probability of going to state when taking action in state .

• A policy is represented by a block diagonal matrix . Suppose the state is indexed by and the action is indexed by in the matrix form, then we have that . The rest of the entries of are 0. From now on, we will confound and to enhance readability.

• The transition matrix induced by a policy is a matrix denoting the probability of going from state to state when following the policy .

• The reward vector is a matrix denoting the expected reward when taking action in state . The reward vector of a policy is a vector.

• The value function of a policy is a matrix.

• We note the -th column of .

Under these notations, we can define the Bellman operator and the optimality Bellman operator as follows:

 TπVπ=rπ+γPπVπ=π(r+γPVπ) ∀s∈S, T∗Vπ(s)=maxπ′∈P(A)Srπ′(s)+γPπ′Vπ(s).

## Appendix C Supplementary Results

###### Lemma 5.

is infinitely differentiable on .

###### Proof.

We have that:

Where det is the determinant and where adj is the adjunct. , therefore is infinitely differentiable. ∎

###### Lemma 6.

Let , , and . We have

 Span(Cπk+1,..,Cπ|S|)=Span(Cπ′k+1,..,Cπ′|S|)
###### Proof.

As and are equal on their first rows, we also have that and are equal on their first rows. We note these rows .

By assumption, we have that:

 ∀i∈{1,…,k},∀j∈{k+1,…,|S|},LiCπj=0,LiCπ′j=0

Which we can rewrite,

 Span(Cπk+1,…,Cπ|S|)⊂Span(L1,…,Lk)⊥ Span(Cπ′k+1,..,Cπ′|S|)⊂Span(L1,…,Lk)⊥

Now using, , we have:

 Span(Cπk+1,…,Cπ|S|)=Span(L1,…,Lk)⊥ Span(Cπ′k+1,..,Cπ′|S|)=Span(L1,…,Lk)⊥

## Appendix D Proofs

See 1

###### Proof.

is connected since it is a convex space, and it is compact because it is closed and bounded in a finite dimensional real vector space. Since is continuous (Lemma 5), we have is compact and connected. ∎

See 2

###### Proof.

Suppose without loss of generality that are the first states in the matrix form notation. We have,

 rπ1 =π1r rπ2 =π2r Pπ1 =π1P Pπ2 =π2P

Since for all , the first rows of are identical in the matrix form notation. Therefore, the first k elements of and are identical, and the first k rows of and are identical, hence the result. ∎

See 3

###### Proof.

Let us first show that .
Let , i.e. agrees with on . Using Bellman’s equation, we have:

 Vπ′−Vπ =rπ′−rπ+γPπ′Vπ′−γPπVπ =rπ′−rπ+γ(Pπ′−Pπ)Vπ′+γPπ(Vπ′−Vπ) =(I−γPπ)−1(rπ′−rπ+γ(Pπ′−Pπ)Vπ′) (3)

Since the policies and agree on the states , we have, using Lemma 2:

 {rπ′−rπ is zero on its % first k elementsPπ′−Pπ is zero on its first k rows.

Hence, the right-hand side of Eq. 3 is the product of a matrix with a vector whose first elements are 0. Therefore

 Vπ′∈Vπ+Span(Cπk+1,..,Cπ|S|).

We shall now show that .

Suppose . We want to show that there is a policy such that . We construct the following way:

 π′={π(.|s)if s∈{s1,..,sk}^π(.|s)otherwise.

Therefore, using the result of the first implication of this proof:

 V^π−Vπ′∈Span(Cπk+1,..,Cπ|S|) by assumption V^π−Vπ′∈Span(Cπ′1,..,Cπ′k) since ^π and π′ % agree on sk+1,…,s|S|

However, as , we have using Lemma 6:

 Span(Cπk+1,..,Cπ|S|)=Span(Cπ′k+1,..,Cπ′|S|).

Therefore, , meaning that . ∎

See 4

###### Proof.

(i) is continuously differentiable as a composition of two continuously differentiable functions.

(ii) We want to show that we have either or .

Suppose, without loss of generality, that is the first state in the matrix form. Using Lemma 3, we have:

 Vπ0=Vπ1+αCπ11, with α∈R

As , whose entries are all positive, is a vector with positive entries. Therefore we have or , depending on the sign of .

(iii) We have, using Equation (3)

 Vπ0−Vπμ =(I−γPπμ)−1(rπ0−rπμ+γ(Pπ0−