# Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes

In this paper we extend temporal difference policy evaluation algorithms to performance criteria that include the variance of the cumulative reward. Such criteria are useful for risk management, and are important in domains such as finance and process control. We propose both TD(0) and LSTD(lambda) variants with linear function approximation, prove their convergence, and demonstrate their utility in a 4-dimensional continuous state space problem.

## Authors

• 45 publications
• 14 publications
• 127 publications
• ### Policy Gradients with Variance Related Risk Criteria

Managing risk in dynamic decision problems is of cardinal importance in ...
06/27/2012 ∙ by Dotan Di Castro, et al. ∙ 0

• ### Mean-Variance Optimization in Markov Decision Processes

We consider finite horizon Markov decision processes under performance m...
04/29/2011 ∙ by Shie Mannor, et al. ∙ 0

• ### Variance-Based Risk Estimations in Markov Processes via Transformation with State Lumping

Variance plays a crucial role in risk-sensitive reinforcement learning, ...
07/09/2019 ∙ by Shuai Ma, et al. ∙ 1

• ### On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

We consider off-policy temporal-difference (TD) learning methods for pol...
12/27/2017 ∙ by Huizhen Yu, et al. ∙ 0

• ### Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models

We introduce an off-policy evaluation procedure for highlighting episode...
05/14/2019 ∙ by Michael Oberst, et al. ∙ 5

• ### Per-Step Reward: A New Perspective for Risk-Averse Reinforcement Learning

We present a new per-step reward perspective for risk-averse control in ...
04/22/2020 ∙ by Shangtong Zhang, et al. ∙ 15

• ### Risk-Sensitive Markov Decision Processes with Combined Metrics of Mean and Variance

This paper investigates the optimization problem of an infinite stage di...
08/09/2020 ∙ by Li Xia, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In both Reinforcement Learning (RL; Bertsekas & Tsitsiklis, 1996) and planning in Markov Decision Processes (MDPs; Puterman, 1994), the typical objective is to maximize the cumulative (possibly discounted) expected reward, denoted by . In many applications, however, the decision maker is also interested in minimizing some form of risk of the policy. By risk, we mean reward criteria that take into account not only the expected reward, but also some additional statistics of the total reward such as its variance, its Value at Risk, etc. (Luenberger, 1998).

In this work we focus on risk measures that involve the variance of the cumulative reward, denoted by . Typical performance criteria that fall under this definition include

1. Maximize s.t.

2. Minimize s.t.

3. Maximize the Sharpe Ratio:

4. Maximize

The rationale behind our choice of risk measure is that these performance criteria, such as the Sharpe Ratio (Sharpe, 1966) mentioned above, are being used in practice. Moreover, it seems that human decision makers understand how to use variance well, in comparison to exponential utility functions (Howard & Matheson, 1972), which require determining a non-intuitive exponent coefficient.

A fundamental concept in RL is the the value function - the expected reward to go from a given state. Estimates of the value function drive most RL algorithms, and efficient methods for obtaining these estimates have been a prominent area of research. In particular, Temporal Difference (TD; (Sutton & Barto, 1998)) based methods have been found suitable for problems where the state space is large, requiring some sort of function approximation. TD methods enjoy theoretical guarantees (Bertsekas, 2012; Lazaric et al., 2010) and empirical success (Tesauro, 1995), and are considered the state of the art in policy evaluation.

In this work we present a TD framework for estimating the variance of the reward to go

. Our approach is based on the following key observation: the second moment of the reward to go, denoted by

, together with the value function , obey a linear equation - similar to the Bellman equation that drives regular TD algorithms. By extending TD methods to jointly estimate and , we obtain a solution for estimating the variance, using the relation .

We propose both a variant of Least Squares Temporal Difference (LSTD) (Boyan, 2002) and of TD(0) (Sutton & Barto, 1998) for jointly estimating and with a linear function approximation. For these algorithms, we provide convergence guarantees and error bounds. In addition, we introduce a novel approach for enforcing the approximate variance to be positive, through a constrained TD equation.

Finally, an empirical evaluation on a challenging continuous maze domain highlights both the usefulness of our approach, and the importance of the variance function in understanding the risk of a policy.

This paper is organized as follows. In Section 2 we present our formal RL setup. In Section 3 we derive the fundamental equations for jointly approximating and , and discuss their properties. A solution to these equations may be obtained by simulation, through the use of TD algorithms, as presented in Section 4. In Section 5 we further extend the LSTD framework by forcing the approximated variance to be positive. Section 6 presents an empirical evaluation, and Section 7 concludes, and discusses future directions.

## 2 Framework and Background

We consider a Stochastic Shortest Path (SSP) problem111This is also known as an episodic setup. (Bertsekas, 2012), where the environment is modeled by an MDP in discrete time with a finite state set and a terminal state . A fixed policy determines, for each , a stochastic transition to a subsequent state

with probability

. We consider a deterministic and bounded reward function . We denote by the state at time , where .

A policy is said to be proper (Bertsekas, 2012) if there is a positive probability that the terminal state will be reached after at most transitions, from any initial state. In this paper we make the following assumption

###### Assumption 1.

The policy is proper.

Let

denote the first visit time to the terminal state, and let the random variable

denote the accumulated reward along the trajectory until that time222We do not define the reward at the terminal state as it is not relevant to our performance criteria. However, the customary zero terminal reward may be assumed throughout the paper.

 B≜τ−1∑k=0r(xk).

In this work, we are interested in the mean-variance tradeoff in , represented by the value function

 J(x)≜E[B|x0=x],x∈X,

and the variance of the reward to go

 V(x)≜Var[B|x0=x],x∈X.

We will find it convenient to define also the second moment of the reward to go

 M(x)≜E[B2|x0=x],x∈X.

Our goal is to estimate and from trajectories obtained by simulating the MDP with policy .

## 3 Approximation of the Variance of the Reward To Go

In this section we derive a projected equation method for approximating and using linear function approximation. The estimation of will then follow from the relation .

Our starting point is a system of equations for and , first derived by Sobel (1982) for a discounted infinite horizon case, and extended here to the SSP case. Note that the equation for is the well known Bellman equation for a fixed policy, and independent of the equation for .

###### Proposition 2.

The following equations hold for

 J(x)=r(x)+∑y∈XP(y|x)J(y),M(x)=r(x)2+2r(x)∑y∈XP(y|x)J(y)+∑y∈XP(y|x)M(y). (1)

Furthermore, under Assumption 1 a unique solution to (1) exists.

The proof is straightforward, and given in Appendix A.

At this point the reader may wonder why an equation for is not presented. While such an equation may be derived, as was done in (Tamar et al., 2012), it is not linear. The linearity of (1) is the key to our approach. As we show in the next subsection, the solution to (1) may be expressed as the fixed point of a linear mapping in the joint space of and . We will then show that a projection of this mapping onto a linear feature space is contracting, thus allowing us to use existing TD theory to derive estimation algorithms for and .

### 3.1 A Projected Fixed Point Equation on the Joint Space of J and M

For the sequel we introduce the following vector notations. We denote by

and the SSP transition matrix and reward vector, i.e., and , where . Also, we define .

For a vector we let and denote its leading and ending components, respectively. Thus, such a vector belongs to the joint space of and .

We define the mapping by

 [Tz]J=r+PzJ,[Tz]M=Rr+2RPzJ+PzM.

It may easily be verified that a fixed point of is a solution to (1), and by Proposition 2 such a fixed point exists and is unique.

When the state space is large, a direct solution of (1) is not feasible, even if may be accurately obtained. A popular approach in this case is to approximate by restricting it to a lower dimensional subspace, and use simulation based TD algorithms to adjust the approximation parameters (Bertsekas, 2012). In this paper we extend this approach to the approximation of as well.

We consider a linear approximation architecture of the form

 ~J(x)=ϕJ(x)TwJ,~M(x)=ϕM(x)TwM, (2)

where and are the approximation parameter vectors, and are state dependent features, and denotes the transpose of a vector. The low dimensional subspaces are therefore

 SJ={ΦJw|w∈RsJ},SM={ΦMw|w∈RsM},

where and are matrices whose rows are and , respectively. We make the following standard independence assumption on the features

###### Assumption 3.

The matrix has rank and the matrix has rank .

As outlined earlier, our goal is to estimate and from simulated trajectories of the MDP. Thus, it is constructive to consider projections onto and with respect to a norm that is weighted according to the state occupancy in these trajectories.

For a trajectory , where is drawn from a fixed distribution , and the states evolve according to the MDP with policy , define the state occupancy probabilities

 qt(x)=P(xt=x),x∈X,t=0,1,…

and let

 q(x)=∞∑t=0qt(x),x∈XQ≜diag(q).

We make the following assumption on the policy and initial distribution

###### Assumption 4.

Each state has a positive probability of being visited, namely, for all .

For vectors in , we introduce the weighted Euclidean norm

 ∥y∥q= ⎷n∑i=1q(i)(y(i))2,y∈Rn,

and we denote by and the projections from onto the subspaces and , respectively, with respect to this norm. For we denote by the projection of onto and onto , namely 333The projection operators and are linear, and may be written explicitly as , and similarly for .

 Π=(ΠJ00ΠM). (3)

We are now ready to fully describe our approximation scheme. We consider the projected fixed point equation

 z=ΠTz, (4)

and, letting denote its solution, propose the approximate value function and second moment function .

We proceed to derive some properties of the projected fixed point equation (4). We begin by stating a well known result regarding the contraction properties of the projected Bellman operator , where . A proof can be found at (Bertsekas, 2012), proposition 7.1.1.

###### Lemma 5.

Let Assumptions 1, 3, and 4 hold. Then, there exists some norm and some such that

 ∥ΠJPy∥J≤βJ∥y∥J,∀y∈Rn.

Similarly, there exists some norm and some such that

 ∥ΠMPy∥M≤βM∥y∥M,∀y∈Rn.

Next, we define a weighted norm on

###### Definition 6.

For a vector and a scalar , the -weighted norm is

 ∥z∥α=α∥zJ∥J+(1−α)∥zM∥M, (5)

where the norms and are defined in Lemma 5.

Our main result of this section is given in the following lemma, where we show that the projected operator is a contraction with respect to the -weighted norm.

###### Lemma 7.

Let Assumptions 1, 3, and 4 hold. Then, there exists some and some such that is a -contraction with respect to the -weighted norm, i.e.,

 ∥ΠTz∥α≤β∥z∥α,∀z∈R2n.
###### Proof.

Let denote the following matrix in

 P=(P02RPP),

and let . We need to show that

 ∥ΠPz∥α≤β∥z∥α.

From (3) we have

 ΠP=(ΠJP02ΠMRPΠMP).

Therefore, we have

 ∥ΠPz∥α=α∥ΠJPzJ∥J+(1−α)∥2ΠMRPzJ+ΠMPzM∥M≤α∥ΠJPzJ∥J+(1−α)∥ΠMPzM∥M+(1−α)∥2ΠMRPzJ∥M≤αβJ∥zJ∥J+(1−α)βM∥zM∥M+(1−α)∥2ΠMRPzJ∥M, (6)

where the equality is by definition of the weighted norm (5), the first inequality is from the triangle inequality, and the second inequality is by Lemma 5. Now, we claim that there exists some finite such that

 ∥2ΠMRPy∥M≤C∥y∥J,∀y∈Rn. (7)

To see this, note that since is a finite dimensional real vector space, all vector norms are equivalent (Horn & Johnson, 1985) therefore there exist finite and such that for all

 C1∥2ΠMRPy∥2≤∥2ΠMRPy∥M≤C2∥2ΠMRPy∥2,

where denotes the Euclidean norm. Let denote the spectral norm of the matrix , which is finite since all the matrix elements are finite. We have

 ∥2ΠMRPy∥2≤λ∥y∥2,∀y∈Rn.

Using again the fact that all vector norms are equivalent, there exists a finite such that

 ∥y∥2≤C3∥y∥J,∀y∈Rn.

Setting we get the desired bound. Let , and choose such that

 ~β+ϵ<1.

Now, choose such that

 α=Cϵ+C.

We have that

 (1−α)C=αϵ,

and plugging in (7)

 (1−α)∥2ΠMRPy∥M≤αϵ∥y∥J.

Plugging in (6) we have

 αβJ∥zJ∥J+(1−α)βM∥zM∥M+(1−α)∥2ΠMRPzJ∥M≤αβJ∥zJ∥J+(1−α)βM∥zM∥M+αϵ∥zJ∥J≤(~β+ϵ)(α∥zJ∥J+(1−α)∥zM∥M)

and therefore

 ∥ΠPz∥α≤(~β+ϵ)∥z∥α

Finally, choose . ∎

Lemma 7 guarantees that the projected operator has a unique fixed point. Let us denote this fixed point by , and let denote the corresponding weights, which are unique due to Assumption 3

 ΠTz∗=z∗,z∗J=ΦJw∗J,z∗M=ΦMw∗M. (8)

In the next lemma we provide a bound on the approximation error. The proof is in Appendix B.

###### Lemma 8.

Let Assumptions 1, 3, and 4 hold. Denote by the true value and second moment functions, i.e., satisfies . Then,

 ∥ztrue−z∗∥α≤11−β∥ztrue−Πztrue∥α,

with and defined in Lemma 7.

## 4 Simulation Based Estimation Algorithms

We now use the theoretical results of the previous subsection to derive simulation based algorithms for jointly estimating the value function and second moment. The projected equation (8) is linear, and can be written in matrix form as follows. First let us write the equation explicitly as

 ΠJ(r+PΦJw∗J)=ΦJw∗J,ΠM(Rr+2RPΦJw∗J+PΦMw∗M)=ΦMw∗M. (9)

Projecting a vector onto satisfies the following orthogonality condition

 ΦTQ(y−Φw)=0,

therefore we have

 ΦTJQ(ΦJw∗J−(r+PΦJw∗J))=0,ΦTMQ(ΦMw∗M−(Rr+2RPΦJw∗J+PΦMw∗M))=0,

which can be written as

 Aw∗J=b,Cw∗M=d, (10)

with

 A=ΦTJQ(I−P)ΦJ,b=ΦTJQr,C=ΦTMQ(I−P)ΦM,d=ΦTMQR(r+2PΦJA−1b), (11)

and the matrices and are invertible since Lemma 7 guarantees a unique solution to (8) and Assumption 3 guarantees the unique weights of its projection.

### 4.1 A Least Squares TD Algorithm

Our first simulation based algorithm is an extension of the Least Squares Temporal Difference (LSTD) algorithm (Boyan, 2002). We simulate trajectories of the MDP with the policy and initial state distribution . Let and , where , denote the state sequence and visit times to the terminal state within these trajectories, respectively. We now use these trajectories to form the following estimates of the terms in (11)

 AN=EN[τ−1∑t=0ϕJ(xt)(ϕJ(xt)−ϕJ(xt+1))T],bN=EN[τ−1∑t=0ϕJ(xt)r(xt)],CN=EN[τ−1∑t=0ϕM(xt)(ϕM(xt)−ϕM(xt+1))T],dN=EN[τ−1∑t=0ϕM(xt)r(xt)(r(xt)+2ϕJ(xt+1)TA−1NbN)], (12)

where denotes an empirical average over trajectories, i.e., . The LSTD approximation is given by

 ^w∗J=A−1NbN,^w∗M=C−1NdN.

The next theorem shows that the LSTD approximation converges.

###### Theorem 9.

Let Assumptions 1, 3, and 4 hold. Then and as with probability 1.

The proof involves a straightforward application of the law of large numbers and is described in Appendix

C.

### 4.2 An online TD(0) Algorithm

Our second estimation algorithm is an extension of the well known TD(0) algorithm (Sutton & Barto, 1998). Again, we simulate trajectories of the MDP corresponding to the policy and initial state distribution , and we iteratively update our estimates at every visit to the terminal state444An extension to an algorithm that updates at every state transition is also possible, but we do not pursue such here.. For some and weights , we introduce the TD terms

 δkJ(t,wJ,wM)=r(xkt)+(ϕJ(xkt+1)T−ϕJ(xkt)T)wJ,δkM(t,wJ,wM)=r2(xkt)+2r(xkt)ϕJ(xkt+1)TwJ+(ϕM(xkt+1)T−ϕM(xkt)T)wM.

Note that is the standard TD error (Sutton & Barto, 1998). The TD(0) update is given by

 ^wJ;k+1=^wJ;k+ξkτk−1∑t=0ϕJ(xt)δkJ(t,^wJ;k,^wM;k),^wM;k+1=^wM;k+ξkτk−1∑t=0ϕM(xt)δkM(t,^wJ;k,^wM;k),

where are positive step sizes.

The next theorem shows that the TD(0) algorithm converges.

###### Theorem 10.

Let Assumptions 1, 3, and 4 hold, and let the step sizes satisfy

 ∞∑k=0ξk=∞,∞∑k=0ξ2k<∞.

Then and as with probability 1.

The proof, provided in Appendix D, is based on representing the TD(0) algorithm as a stochastic approximation and using contraction properties similar to the ones of the previous section to prove convergence.

### 4.3 Multistep Algorithms

A common method in value function approximation is to replace the single step mapping with a multistep version of the form

 T(λ)J=(1−λ)∞∑l=0λlTl+1J

with . The projected equation (9) then becomes

 ΠJT(λ)J(ΦJw∗(λ)J)=ΦJw∗(λ)J.

Similarly, we may write a multistep equation for

 ΠMT(λ)M(ΦMw∗(λ)M)=ΦMw∗(λ)M, (13)

where

 T(λ)M=(1−λ)∞∑l=0λlTl+1M∗,

and

 TM∗(y)=Rr+2RPΦJw∗(λ)J+Py.

Note the difference between and defined earlier; We are no longer working on the joint space of and but instead we have an independent equation for approximating , and its solution is part of equation (13) for approximating . By Proposition 7.1.1. of (Bertsekas, 2012) both and are contractions with respect to the weighted norm , therefore both multistep projected equations admit a unique solution. In a similar manner to the single step version, the projected equations may be written in matrix form

 A(λ)w∗(λ)J=b(λ),C(λ)w∗(λ)M=d(λ), (14)

where

 A(λ)=ΦTJQ(I−P(λ))ΦJ,b(λ)=ΦTJQ(I−λP)−1r,C(λ)=ΦTMQ(I−P(λ))ΦM,d(λ)=ΦTMQ(I−λP)−1R(r+2PΦJw∗(λ)J),

and

 P(λ)=(1−λ)∞∑l=0λlPl+1.

Simulation based estimates and of the expressions above may be obtained by the use of eligibility traces, as described in (Bertsekas, 2012), and the LSTD() approximation is then given by . By substituting with in the expression for , a similar procedure may be used to derive estimates and , and to obtain the LSTD() approximation . Due to the similarity to the LSTD procedure in (12), the exact details are omitted.

## 5 Positive Variance as a Constraint in LSTD

The TD algorithms of the preceding section approximated and by the solution to the fixed point equation (8). While Lemma 8 provides us a bound on the approximation error of and measured in the -weighted norm, it does not guarantee that the approximated variance , given by , is positive for all states. If we are estimating as a means to infer , it may be useful to include our prior knowledge that in the estimation process. In this section we propose to enforce this knowledge as a constraint in the projected fixed point equation.

The multistep equation for the second moment weights (13) may be written with the projection operator as an explicit minimization

 w∗(λ)M=argminw∥ΦMw−(~r+~Φw∗(λ)M)∥q,

with

 ~Φ=P(λ)ΦM,

and

 ~r=(I−λP)−1(Rr+2RPΦJw∗(λ)J).

Requiring non negative variance in some state may be written as a linear constraint in

 ϕM(x)Tw∗(λ)M−(ϕJ(x)Tw∗(λ)J)2≥0.

Let denote a set of states in which we demand that the variance be non negative. Let denote a matrix with the features as its rows, and let denote a vector with elements . We can write the variance-constrained projected equation for the second moment as

 wvcM={argminw∥ΦMw−(~r+~ΦwvcM)∥qs.t.Hw≤g (15)

The following assumption guarantees that the constraints in (15) admit a feasible solution.

###### Assumption 11.

There exists such that .

Note that a simple way to satisfy Assumption 11 is to have some feature vector that is positive for all states. Equation (15) is a form of projected equation studied in (Bertsekas, 2011), the solution of which may be obtained by the following iterative procedure

 wk+1=ΠΞ,^WM[wk−γΞ−1(C(λ)wk−d(λ))], (16)

where is some positive definite matrix, and denotes a projection onto the convex set with respect to the weighted Euclidean norm. The following lemma, which is based on a convergence result of (Bertsekas, 2011), guarantees that algorithm (16) converges.

###### Lemma 12.

Assume . Then there exists such that the algorithm (16) converges at a linear rate to .

###### Proof.

This is a direct application of the convergence result in (Bertsekas, 2011). The only nontrivial assumption that needs to be verified is that is a contraction in the norm (Proposition 1 in Bertsekas, 2011). For Proposition 7.1.1. of (Bertsekas, 2012) guarantees that is indeed contracting in the norm. ∎

We illustrate the effect of the positive variance constraint in a simple example. Consider the Markov chain depicted in Figure

1, which consists of states with reward and a terminal state with zero reward. The transitions from each state is either to a subsequent state (with probability ) or to a preceding state (with probability ), with the exception of the first state which transitions to itself instead. We chose to approximate and with polynomials of degree 1 and 2, respectively. For such a small problem the fixed point equation (14) may be solved exactly, yielding the approximation depicted in Figure 2 (dotted line), for , , and . Note that the variance is negative for the last two states. Using algorithm (16) we obtained a positive variance constrained approximation, which is depicted in figure 2 (dashed line). Note that the variance is now positive for all states (as was required by the constraints).

## 6 Experiments

In this section we present numerical simulations of policy evaluation on a challenging continuous maze domain. The goal of this presentation is twofold; first, we show that the variance function may be estimated successfully on a large domain using a reasonable amount of samples. Second, the intuitive maze domain highlights the information that may be gleaned from the variance function. We begin by describing the domain and then present our policy evaluation results.

The Pinball Domain (Konidaris & Barto, 2009) is a continuous 2-dimensional maze where a small ball needs to be maneuvered between obstacles to reach some target area, as depicted in figure 3

(left). The ball is controlled by applying a constant force in one of the 4 directions at each time step, which causes acceleration in the respective direction. In addition, the ball’s velocity is susceptible to additive Gaussian noise (zero mean, standard deviation 0.03) and friction (drag coefficient 0.995). The state of the ball is thus 4-dimensional (

), and the action set is discrete, with 4 available controls. The obstacles are sharply shaped and fully elastic, and collisions cause the ball to bounce. As noted in (Konidaris & Barto, 2009), the sharp obstacles and continuous dynamics make the pinball domain more challenging for RL than simple navigation tasks or typical benchmarks like Acrobot.

A Java implementation of the pinball domain used in (Konidaris & Barto, 2009) is available on-line and was used for our simulations as well, with the addition of noise to the velocity.

We obtained a near-optimal policy using SARSA (Sutton & Barto, 1998)

with radial basis function features and a reward of -1 for all states until reaching the target. The value function for this policy is plotted in Figure

3, for states with zero velocity. As should be expected, the value is approximately a linear function of the distance to the target.

Using 3000 trajectories (starting from uniformly distributed random states in the maze) we estimated the value and second moment functions by the LSTD(

) algorithm described above. We used uniform tile coding as features ( non-overlapping tiles in and , no dependence on velocity) and set . The resulting estimated standard deviation function is shown in Figure 4 (left). In comparison, the standard deviation function shown in Figure 4 (right) was estimated by the naive sample variance, and required 500 trajectories from each point - a total of 1,250,000 trajectories.

Note that the variance function is clearly not a linear function of the distance to the target, and in some places not even monotone. Furthermore, we see that an area in the top part of the maze before the first turn is very risky, even more than the farthest point from the target. We stress that this information cannot be gleaned from inspecting the value function alone.

## 7 Conclusion

This work presented a novel framework for policy evaluation in RL with variance related performance criteria. We presented both formal guarantees and empirical evidence that this approach is useful in problems with a large state space.

A few issues are in need of further investigation. First, we note a possible extension to other risk measures such as the percentile criterion (Delage & Mannor, 2010). In a recent work, Morimura et al. (2012) derived Bellman equations for the distribution of the total return, and appropriate TD learning rules were proposed, albeit without function approximation and formal guarantees.

More importantly, at the moment it remains unclear how the variance function may be used for policy optimization. While a naive policy improvement step may be performed, its usefulness should be questioned, as it was shown to be problematic for the standard deviation adjusted reward (Sobel, 1982) and the variance constrained reward (Mannor & Tsitsiklis, 2011). In (Tamar et al., 2012), a policy gradient approach was proposed for handling variance related criteria, which may be extended to an actor-critic method by using the variance function presented here.

## References

• Bertsekas (2012) Bertsekas, D. P. Dynamic Programming and Optimal Control, Vol II. Athena Scientific, fourth edition, 2012.
• Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. Athena Scientific, 1996.
• Bertsekas (2011) Bertsekas, D.P. Temporal difference methods for general projected equations. IEEE Trans. Auto. Control, 56(9):2128–2139, 2011.
• Borkar (2008) Borkar, V.S. Stochastic approximation: a dynamical systems viewpoint. Cambridge Univ Press, 2008.
• Boyan (2002) Boyan, J.A. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):233–246, 2002.
• Delage & Mannor (2010) Delage, E. and Mannor, S. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203–213, 2010.
• Horn & Johnson (1985) Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge University Press, 1985.
• Howard & Matheson (1972) Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management Science, 18(7):356–369, 1972.
• Konidaris & Barto (2009) Konidaris, G.D. and Barto, A.G. Skill discovery in continuous reinforcement learning domains using skill chaining. In NIPS, 2009.
• Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite-sample analysis of lstd. In ICML, 2010.
• Luenberger (1998) Luenberger, D. Investment Science. Oxford University Press, 1998.
• Mannor & Tsitsiklis (2011) Mannor, S. and Tsitsiklis, J. N. Mean-variance optimization in markov decision processes. In ICML, 2011.
• Morimura et al. (2012) Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Parametric return density estimation for reinforcement learning. arXiv preprint arXiv:1203.3497, 2012.
• Puterman (1994) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.
• Sharpe (1966) Sharpe, W. F. Mutual fund performance. The Journal of Business, 39(1):119–138, 1966.
• Sobel (1982) Sobel, M. J. The variance of discounted markov decision processes. J. Applied Probability, pp. 794–802, 1982.
• Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement Learning. MIT Press, 1998.
• Tamar et al. (2012) Tamar, A., Di Castro, D., and Mannor, S. Policy gradients with variance related risk criteria. In ICML, 2012.
• Tesauro (1995) Tesauro, G. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.

## Appendix A Proof of Proposition 2

###### Proof.

The equation for is well-known, and its proof is given here only for completeness. Choose . Then,

 J(x)=E[B|x0=x]=E[τ−1∑k=0r(xk)|x0=x]=r(x)+E[τ−1∑k=1r(xk)|x0=x]=r(x)+E[E[τ−1∑k=1r(xk)|x0=x,x1=y]]=r(x)+∑y∈XP(y|x)J(y)

where we excluded the terminal state from the sum since reaching it ends the trajectory.

Similarly,

 M(x)=E[B2|x0=x]=E⎡⎣(τ−1∑k=0r(xk))2|x0=x⎤⎦=E⎡⎣(r(x0)+τ−1∑k=1r(xk))2|x0=x⎤⎦=r(x)2+2r(x)E[τ−1∑k=1r(xk)|x0=x]+E⎡⎣(τ−1∑k=1r(xk))2|x0=x⎤⎦=r(x)2+2r(x)∑y∈XP(y|x)J(y)+∑y∈XP(y|x)M(y).

The uniqueness of the value function for a proper policy is well known, c.f. proposition 3.2.1 in (Bertsekas, 2012). The uniqueness of follows by observing that in the equation for , may be seen as the value function of an MDP with the same transitions but with reward . Since only the rewards change, the policy remains proper and proposition 3.2.1 in (Bertsekas, 2012) applies. ∎

## Appendix B Proof of Lemma 8

###### Proof.

We have

 ∥ztrue−z∗∥α≤∥ztrue−Πztrue∥α+∥Πztrue−z∗∥α=∥ztrue−Πztrue∥α+∥ΠTztrue−ΠTz∗∥α≤∥ztrue−Πztrue∥α+β∥ztrue−z∗∥α.

rearranging gives the stated result. ∎

## Appendix C Proof of Theorem 9

###### Proof.

Let , be some vector functions of the state. We claim that

 E[τ−1∑t=0ϕ1(xt)ϕ2(xt)T]=∑xq(x)ϕ1(x)ϕ2(x)T. (17)

To see this, let denote the indicator function and write

 E[τ−1∑t=0ϕ1(xt)ϕ2(xt)T]=E[τ−1∑t=0∑xϕ1(x)ϕ2(x)T1(xt=x)]=E[∑xϕ1(x)ϕ2(x)Tτ−1∑t=01(xt=x)]=∑xϕ1(x)ϕ2(x)TE[τ−1∑t=01(xt=x)].

Now, note that the last term on the right hand side is an expectation (over all possible trajectories) of the number of visits to a state until reaching the terminal state, which is exactly since

 q(x)=∞∑t=0P(xt=x)=∞∑t=0E[1(xt=x)]=E[∞∑t=01(xt=x)]=E[τ−1∑t=01(xt=x)],

where the last equality follows from the absorbing property of the terminal state. Similarly, we have

 E[τ−1∑t=0ϕ1(xt)ϕ2(xt+1)T]=∑x∑yq(x)P(y|x)ϕ1(x)ϕ2(y)T, (18)

since

and

 q(x)P(y|x)=∞∑t=0P(xt=x)P(y|x)=∞∑t=0P(xt=x,xt+1=y)=∞∑t=0E[1(xt=x,xt+1=y)]=E[∞∑t=01(xt=x