1 Introduction
Reinforcement Learning (RL) [27] methods have recently grown in popularity in many types of applications. Powerful policy search [2] algorithms, such as TRPO [22] and PPO [23], give very exciting results [13, 7] in terms of efficiently maximizing the expected value of the cumulative discounted rewards (referred to as expected return). These types of algorithms are proving to be very effective in many sectors, but they are leaving behind problems in which maximizing the return is not the only goal and risk aversion becomes an important objective too. Riskaverse reinforcement learning is not a new theme: a utility based approach has been introduced in [24] and [9] where the value function becomes the expected value of a utility function of the reward. A category of objective functions called coherent risk functions, characterized by convexity, monotonicity, translation invariance and positive homogeneity, has been defined and studied in [29]. These include known risk functions such as CVaR (Conditional Value at Risk) and meansemideviation.
Another category of riskaverse objective functions are those which include the variance of the returns (referred to as return variance throughout the paper), which is then combined with the standard return in a meanvariance [30, 21] or a Sharpe ratio [10] fashion.
In certain domains, return variance and CVaR are not suitable to correctly capture risk. In finance, for instance, keeping a low return variance could be appropriate in the case of long term investments, where performance can be measured on the Profit and Loss (P&L) made at the end of the year. However, in any other type of investment, interim results are evaluated frequently, thus keeping a lowvarying daily P&L becomes crucial.
This paper analyzes, for the first time, the variance of the reward at each time step w.r.t. state visitation probabilities. We call this quantity
reward volatility. Intuitively, the return variance measures the variation of accumulated reward among trajectories, while reward volatility is concerned with the variation of singlestep rewards among visited states. Reward volatility is used to define a new riskaverse performance objective which trades off the maximization of expected return with the minimization of shortterm risk (called meanvolatility).The main contribution of this paper is the derivation of a policy gradient theorem for the new riskaverse objective, based on a novel riskaverse Bellman equation. These theoretical results are made possible by the simplified tractability of reward volatility compared to return variance, as the former lacks the problematic intertemporal correlation terms of the latter. However, we also show that reward volatility upper bounds the return variance (albeit for a normalization term). This is an interesting result, indicating that it is possible to use the analytic results we derived for the reward volatility to keep under control the return variance.
If correctly optimized, the meanvolatility objective allows to limit the inherent risk due to the stochastic nature of the environment. However, the imperfect knowledge of the model parameters, and the consequent imprecise optimization process, is another relevant source of risk, known as model risk. This is especially important when the optimization is performed online, as may happen for an autonomous, adaptive trading system. To avoid any kind of performance oscillation, the intermediate solutions implemented by the learning algorithm must guarantee continuing improvement. The TRPO algorithm provides this kind of guarantees (at least in its ideal formulation) for the riskneutral objective. The second contribution of our paper is the derivation of the Trust Region Volatility Optimization (TRVO) algorithm, a TRPOstyle algorithm for the new meanvolatility objective. After some background on policy gradients (Section 2), the volatility measure is introduced in Section 3 and compared to the return variance. The Policy Gradient Theorem for the meanvolatility objective is provided in Section 4. In Section 4.1
, we introduce an estimator for the gradient which is based on the sample trajectories obtained from direct interaction with the environment, which yields a practical riskaverse policy gradient algorithm (VOLAPG). The TRVO algorithm is introduced in Section
5. Finally, in Section 7, we test our algorithms on two simulated financial environments; in the former the agent has to balance investing in liquid and nonliquid assets with simulated dynamics while in the latter the agent has to learn trading on a real asset using historical data.2 Preliminaries
A discretetime Markov Decision Process (MDP) is defined as a tuple
, where is the (continuous) state space, the (continuous) action space, is a Markovian transition model that assigns to each stateaction pair the probability of reaching the next state , is a bounded reward function, i.e. , is the discount factor, and is the distribution of the initial state. The policy of an agent is characterized by , which assigns to each state the density distribution over the action space .We consider infinitehorizon problems in which future rewards are exponentially discounted with . Following a trajectory ), let the returns be defined as the discounted cumulative reward: For each state and action , the actionvalue function is defined as:
(1) 
which can be recursively defined by the following Bellman equation:
For each state , we define the statevalue function of the stationary policy as:
(2)  
It is useful to introduce the (discounted) stateoccupancy measure induced by :
where is the probability of reaching state in steps from following policy (see Appendix A).
2.1 Actoronly policy gradient
From the previous subsection, it is possible to define the normalized^{1}^{1}1In our notation, the expected return (as commonly defined in the RL literature) is expected return using two distinct formulations, one based on transition probabilities and the other on state occupancy :
For the rest of the paper, we consider parametric policies, where the policy
is parametrized by a vector
. In this case, the goal is to find the optimal parametric policy maximizing the performance, i.e. .^{2}^{2}2For the sake of brevity, when a variable depends on the policy , in subscripts only is shown, omitting the dependency on the parameters . The Policy Gradient Theorem (PGT) [27] states that, for a given policy :(3) 
In the following, we omit the gradient subscript whenever clear from the context.
3 Risk Measures
This section introduces the concept of reward volatility, comparing it with the more common return variance. The latter, denoted with , is defined as:
(4) 
In our case, it is useful to define reward volatility in terms of the distribution . As it is not possible to define the return variance in the same way, we also rewrite reward volatility as an expected sum over trajectories:^{3}^{3}3
In finance, the term “volatility” refers to a generic measure of variation, often defined as a standard deviation. In this paper, volatility is defined as a variance.
(5)  
(6) 
Once we have set a meanvariance parameter , the performance or objective function related to the policy can be defined as:
(7) 
called meanvolatility hereafter, where allows to tradeoff expected return maximization with risk minimization. Similarly, the meanvariance objective is . An important result on the relationship between the two variance measures is the following:
Lemma 1
The proofs for this and other formal statements can be found in Appendix A. It is important to notice that the factor simply comes from the fact that the return variance is not normalized, unlike the reward volatility (intuitively, volatility measures risk on a shorter time scale). What is lost in the reward volatility compared to the return variance are the intertemporal correlations between the rewards. However, Lemma 1 shows that the minimization of the reward volatility yields a low return variance. The opposite is clearly not true: a counterexample can be a stock price, having the same value at the beginning and at the end of a day, but making complex movements inbetween.
To better understand the difference between the two types of variance, consider the deterministic MDP in Figure 1. First assume . Every optimal policy (thus avoiding the rewards) yields an expected return . The reward volatility of a deterministic policy that repeats the action is while the reward volatility of repeating the action is . If we were minimizing the reward volatility, we would prefer the first policy, while we would be indifferent between the two policies based on the return variance ( is 0 in both cases). Now let us complicate the example, setting . The returns are now . As a consequence, a meanvariance objective would always choose action , since the return variance is still , while the meanvolatility objective may choose the other path, depending on the value of the riskaversion parameter . This simple example shows how the meanvariance objective can be insensitive to shortterm risk (the reward), even when the gain in expected return is very small in comparison (). Instead, the meanvolatility objective correctly captures this kind of tradeoff.
4 RiskAverse Policy Gradient
In this section, we derive a policy gradient theorem for the reward volatility , and we propose an unbiased gradient estimator. This will allow us to solve the optimization problem via stochastic gradient ascent. We introduce a volatility equivalent of the actionvalue function (Equation (1)), which is the volatility observed by starting from state and taking action thereafter:
(8) 
called actionvolatility function. Like the function, can be written recursively by means of a Bellman equation:
(9) 
The statevolatility function, which is the equivalent of the function (Equation (2)), can then be defined as:
This allows to derive a Policy Gradient Theorem (PGT) for , as done in (Sutton et al. 2000) for the expected return:
Theorem 2 (Reward Volatility PGT)
Using the definitions of actionvariance and statevariance function, the variance term can be rewritten as:
(10) 
Moreover, for a given policy :
The existence of a linear Bellman equation (9) and the consequent policy gradient theorem are two nontrivial theoretical advantages of the new reward volatility approach with respect to the return variance. With a simple extension it is possible to obtain the policy gradient theorem for the meanvolatility objective defined in equation (7). The action value and state value functions are obtained by combining the action value functions of the expected return (1) and of the volatility (8); the same holds for the state value functions:
The policy gradient theorem thus states:
In the following part, we use these results to design a VOLatilityAverse Policy Gradient algorithm (VOLAPG).
4.1 Estimating the riskaverse policy gradient
To design a practical actoronly policy gradient algorithm, the actionvalue function needs to be estimated as in [27, 16]. Similarly, we need an estimator for the actionvariance function . In this approximate framework, we consider to collect finite trajectories
per each policy update. An unbiased estimator of
can be defined as:(11) 
where rewards are denoted as . This can be used to compute an estimator for the actionvolatility function:
Lemma 3
Note that, in order to obtain an unbiased estimator for , a triple sampling procedure is needed. This may result in being very restrictive. However, by adopting single sampling instead, the bias introduced is equivalent to the variance of , so the estimator is still consistent. This result can be used to build a consistent estimator for the gradient :^{4}^{4}4Obtained by adopting single sampling.
(13)  
This is just a PGT [26] estimator with in place of the simple reward. As shown in [16], this can be turned into a GPOMDP estimator, for which varianceminimizing baselines can be easily computed. Pseudocode for the resulting actoronly policy gradient method is reported in Algorithm 1.
5 Trust Region Volatility Optimization
In this section, we go beyond the standard policy gradient theorem and show it is possible to guarantee a monotonic improvement of the meanvolatility performance measure (7) at each policy update. Safe (in the sense of nonpejorative) updates are of fundamental importance when learning online on a real system, such as when controlling a robot or when trading in the financial markets; but also give interesting results during offline training, guaranteeing an automatic tuning of the optimal learning rate. While the meanvolatility objective ensures a risk averse behavior of the policy, the safe update ensures a risk averse update of the paramenters of the policy. Thus, if we care about the agent’s performance within the learning process, we must emphasize the importance of the step sizes at each update of the parameters. Adapting the approach in [22] to our meanvolatility objective, we show it is possible to obtain a learning rate that guarantees that the performance of the updated policy is bounded with respect to the previous policy. The safe update is based on the advantage function, defined as the difference between the action value and state value function, because of the linearity of the Bellman equations, we can extend the definition to obtain the meanvolatility advantage function:
(14)  
Furthermore, with the meanvolatility objective all the theoretical results leading to the TRPO algorithm hold. The main ones are in this section, but the full derivation can be found in Appendix A. Lemma 4 is a extension of Lemma 6.1 in [8]:^{5}^{5}5The only slight difference is in the normalization term, consequent to the choice of different definitions.
Lemma 4 (Performance Difference Lemma)
The difference of the performances between to two policies and can be expressed in terms of expected advantage:
(15)  
This result is very interesting, since the last term adds a gain related to the square of the difference of the expected returns of the policies; therefore there is a bonus if the expected return of the second policy is either higher or lower than the first one. However, this is a difficult term to consider, and we can bound the performance difference by neglecting it; in this case the bound becomes:
(16) 
This result is the one that could be obtained considering a transformation of the MDP with a new, policydependent reward . In any case, the reader must be careful not to underestimate the choice to adopt a reward of this kind, since, in general, for policydependent rewards the PGT in Equation (3) and the performance bound in Equation (16) do not hold without proper assumptions.
Following the approach proposed in [22], it is possible to adopt an approximation of the surrogate function, which provides monotonic improvement guarantees by considering the KL divergence between the policies:
Theorem 5
Consider the following approximation of , replacing the stateoccupancy density of the old policy :
(17) 
Let
Then, the performance of can be bounded as follows:^{6}^{6}6Comparing this bound to the results shown in [22], the denominator term is not squared due to return normalization.
(18) 
As a consequence, we can devise a trustregion variant of VOLAPG, called TRVO (Trust Region Volatility Optimization) as outlined in Algorithm 2. Note that this would not have been possible without the linear Bellman equations, which are not available for more common risk measures. For the practical implementation, we follow [22].
6 Related Works
Two streams of RL literature have been merged together in this paper (for the first time, to the best of our knowledge): riskaverse objective functions and safe policy updates. As stated in [3, 4], these two themes are related as they both reduce risk. The first is the reduction of inherent risk, which is generated by the stochastic nature of the environment, while the second is the reduction of model risk, which is related to the imperfect knowledge of the model parameters.
Several risk criteria have been taken into consideration in literature, such as coherent risk measures [28], utility functions [24] and return variance [30] (defined in Equation (4)); the latter is the most studied and the closest to reward volatility.
Typically, the riskaverse Bellman equation for the return variance defined by Sobel [25] is considered.
Unlike Equation (9), Sobel’s equation does not satisfy the monotonicity property of dynamic programming. This has lead to a variety of complex approaches to the returnvariance minimization [3, 30, 21, 20].
The popular meanvariance objective is similar to the one considered in this paper (Equation (7)), with the return variance instead of the reward volatility. An actorcritic algorithm that considers this objective function is presented in [30, 20, 21], the latter also proposing an actoronly policy gradient. Sometimes, the objective is slightly modified by maximizing the expected value of the return while constraining the variance, such as in [3].
The Sharpe ratio is an important financial index, which describes how much excess return you receive for the extra volatility you endure for holding a riskier asset. This ratio is used in [3], where an actoronly policy gradient is proposed.
In [10] and [11] different approaches are used which do not need a Bellman equation for the return variance. In [10], one of the first applications of RL to trading, the authors consider the differential Sharpe ratio (a first order approximation of the Sharpe ratio), bypassing the direct variance calculation and proposing a policy gradient algorithm. In [11], the problem is tackled by approximating the distribution of the possible returns, deriving a Distributional Bellman Equation and minimizing the CVaR. Actorcritic algorithms for maximizing the expected return under CVaR constraints are proposed in [1].
Even if from some of these measures a Bellman Equation can be derived, unfortunately, such recursive relationships are always nonlinear. This prevents one to extend the safe guarantees (Theorem 5) and the TRPO algorithm to risk measures other than rewardvolatility
The second literature stream is dedicated to the safe update, which, until now, has only been defined for the standard riskneutral objective function. The seminal paper for this setting is [8], which proposes a conservative policy iteration algorithm with monotonic improvement guarantees for mixtures of greedy policies. This approach is generalized to stationary policies in [19] and to stochastic policies in [22]. Building on the former, monotonically improving policy gradient algorithms are devised in [17, 14] for Gaussian policies. Similar algorithms are developed for Lipschitz policies [18] under the assumption of Lipscthizcontinuous reward and transition functions and, more recently, for smoothing policies [15] on general MDPs^{7}^{7}7A safe version of VOLAPG based on this line of work is presented in Appendix B as a more rigorous (albeit less practical) alternative to TRVO.. On the other hand, [22] propose TRPO, a general policy gradient algorithm inspired by the monotonicallyimproving policy iteration strategy. Although the theoretical guarantees are formally lost due a series of approximations, TRPO enjoyed great empirical success in recent years, especially in combination with deep policies. Moreover, the exact version of TRPO has been shown to converge to the optimal policy [12]. Proximal Policy Optimization [23] is a further approximation of TRPO which has been used to tackle complex, largescale control problems [13, 7].
7 Experiments
In this section, we show an empirical analysis of the performance of VOLAPG (Algorithm 1) and TRVO (Algorithm 2) applied to a simplified portfolio management task taken from [3], and a trading task. We compare these results with two other approaches: a meanvariance policy gradient approach presented in the same article, which we adjusted to take into account discounting, and a risk averse transformation of rewards in the TRPO algorithm. Indeed, a possible way to control the risk of the immediate reward, is to consider its expected utility, converting the the immediate reward into , with a parameter controlling the sensitivity to the risk: . This transformation generates a strong parallelism to the literature regarding the variance of the return: as the maximization of the exponential utility function applied to the return is approximately equal to the optimization of the meanvariance objective used in [30], the suggested reward transformation is similarly related to our meanvolatility objective (in a loose approximation, as depicted in the Appendix C). By applying such a transformation, it could be possible to obtain riskaverse policies similar to ours, but running standard returnoptimizing algorithms. However, the aforementioned approximation is sound only for small values of the risk aversion coefficient and it suffers the same limitations as each expected utility, as for example shown in [5]. Moreover, the possibility for the agent to get negative rewards might be followed by a strong instability of the learning process. We use it as a baseline in the following experiments, with the name of TRPOexp.
7.1 Portfolio Management
Setting
The portfolio domain considered is composed of a liquid and a nonliquid asset. The liquid asset has a fixed interest rate , while the nonliquid asset has a stochastic interest rate that switches between two values, and , with probability . If the nonliquid asset does not default (a default can happen with probability ), it is sold when it reaches maturity after a fixed period of time steps. At , the portfolio is composed only of the liquid asset. At every time step, the investor can choose to buy an amount of nonliquid assets (with a maximum of assets), each one with a fixed cost . Let us denote the state at time t as , where is the allocation in the liquid asset, are the allocations in the nonliquid assets (with time to maturity respectively equal to timesteps), and . When the nonliquid asset is sold, the gain is added to the liquid asset. The reward at time is computed (unlike [3]) as the liquid P&L, i.e., the difference between the liquid assets at time and . Further details on this domain are specified in the Appendix D.
Results
Figure 2 shows how the optimal solutions of the two algorithms trade off between the maximization of the expected return and the minimization of one of the two forms of variability when changing the risk aversion parameter . The plot on the right shows that the meanvariance frontier obtained by Algorithm 1 almost coincides with the frontier obtained using the algorithm proposed in [3]. This means that, at least in this domain, the return variance is equally reduced by optimizing either the meanvolatility or the meanvariance objectives. Instead, from the plot on the left, we can notice that the rewardvolatility is better optimized with VOLAPG. These results are consistent with Lemma 1.
7.2 Trading Framework
The second environment is a realistic setting: trading on the S&P 500 index. The possible actions are buy, sell, stay flat.^{8}^{8}8This means we are assuming that short selling is possible: thus selling the stock even if you don’t own it, in this case you are betting the stock price is going to decrease Transaction costs are also considered, i.e. a fee is paid when the position changes.The data used is the daily price of the S&P starting from the 1980s, until 2019. The state consists of the last 10 days of percentage price changes, with the addition of the previous portfolio position as well as the fraction of episode to go, with episodes 50 days long.
Results
In Figure 3 we can see that the frontier generated by TRVO dominates the naive approach (TRPOexp). The same Figure has VOLAPG trained with the same number of iterations as TRVO. The obtained frontier is entirely dominated since the algorithm has not reached convergence. This reflects the faster convergence of TRPO with respect to REINFORCE. TRPOexp becomes unstable when raising too much the risk aversion parameter , so it is not possible to find the value for which the risk aversion is maximal, which is why there are no points at the bottom left.
8 Conclusions
Throughout this paper, we propose a combination of methodologies to control risk in a RL framework. We define a risk measure called reward volatility that captures the variability of the rewards between steps. Optimizing this measure allows to obtain smoother trajectories that avoid shocks, which is a fundamental feature in riskaverse contexts. This measure bounds the variance of the returns and it is possible to derive a linear Bellman equation for it (differently from other risk measures). We propose a policy gradient algorithm that optimizes the meanvolatility objective, and, thanks to the aforementioned linearity, we derive a trust region update that ensures a monotonic improvement of our objective at each gradient update, a fundamental characteristic for online algorithms in a real world context. The proposed algorithms are tested on two financial environments and TRVO was shown to outperform the naive approach on the Trading Framework and reach convergence faster than its policy gradient counterpart. Future developements could consists in testing the algorithm in a real online financial setting where new datapoints do not belong to the training distribution. To conclude, the framework is the first to take into account two kinds of safety, as it is capable of keeping risk under control while maintaining the same training and convergence properties as stateoftheart riskneutral approaches.
References
 [1] (2017) Riskconstrained reinforcement learning with percentile risk criteria. JMLR 18 (1), pp. 6070–6120. Cited by: §6.
 [2] (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §1.
 [3] (201206) Policy gradients with variance related risk criteria. ICML 1, pp. . Cited by: §6, §6, §7.1, §7.1, §7.
 [4] (2015) A comprehensive survey on safe reinforcement learning. JMLR 16, pp. 1437–1480. External Links: Link Cited by: §6.
 [5] (201412) Beyond exponential utility functions: a varianceadjusted approach for riskaverse reinforcement learning. In 2014 ADPRL, Vol. , pp. 1–8. External Links: Document, ISSN 23251824 Cited by: §7.
 [6] (2012) Applied multivariate statistical analysis. Vol. 22007, Springer. Cited by: Appendix B.
 [7] (2017) Emergence of locomotion behaviours in rich environments. CoRR abs/1707.02286. Cited by: §1, §6.
 [8] (2002) Approximately optimal approximate reinforcement learning. In ICML, Vol. 2, pp. 267–274. Cited by: §A.3, §5, §6.
 [9] (2012) Risk aversion in Markov decision processes via near optimal Chernoff bounds. In NeurIPS, pp. 3131–3139. Cited by: §1.

[10]
(2001)
Learning to trade via direct reinforcement.
IEEE transactions on neural Networks
12 (4), pp. 875–889. Cited by: §1, §6.  [11] (2010) Nonparametric return distribution approximation for reinforcement learning. In ICML, Cited by: §6.
 [12] (2017) A unified view of entropyregularized markov decision processes. CoRR abs/1705.07798. Cited by: §6.
 [13] (2018) OpenAI five. Note: https://blog.openai.com/openaifive/ Cited by: §1, §6.
 [14] (2017) Adaptive batch size for safe policy gradients. In NeurIPS, pp. 3591–3600. Cited by: §6.
 [15] (2019) Smoothing policies and safe policy gradients. External Links: 1905.03231 Cited by: §A.2, Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, §6.
 [16] (2008) Reinforcement learning of motor skills with policy gradients. Neural networks 21 (4), pp. 682–697. Cited by: §4.1, §4.1.
 [17] (2013) Adaptive stepsize for policy gradient methods. In NeurIPS 26, pp. 1394–1402. Cited by: §6.
 [18] (2015) Policy gradient in lipschitz markov decision processes. Machine Learning 100 (23), pp. 255–283. Cited by: §6.
 [19] (2013) Safe policy iteration. In ICML, pp. 307–315. Cited by: §6.
 [20] (2014) Actorcritic algorithms for risksensitive reinforcement learning. arXiv preprint arXiv:1403.6530. Cited by: §6.
 [21] (2014) Varianceconstrained actorcritic algorithms for discounted and average reward mdps. CoRR abs/1403.6530. External Links: Link, 1403.6530 Cited by: §1, §6.
 [22] (2015) Trust region policy optimization. In ICML, pp. 1889–1897. Cited by: §1, §5, §5, §5, §6, footnote 6.
 [23] (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. Cited by: §1, §6.
 [24] (201403) Riskaverse reinforcement learning for algorithmic trading. In 2014 IEEE Conference on Computational Intelligence for Financial Engineering Economics (CIFEr), Vol. , pp. 391–398. External Links: Document, ISSN 23808454 Cited by: §1, §6.
 [25] (1982) The variance of discounted Markov decision processes. Journal of Applied Probability 19 (4), pp. 794–802. Cited by: §6.
 [26] (2000) Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, pp. 1057–1063. Cited by: §A.2, §4.1.
 [27] (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §1, §2.1, §4.1.
 [28] (2015) Policy Gradient for Coherent Risk Measures. CoRR, pp. 9 (en). Cited by: §6.
 [29] (201707) Sequential Decision Making With Coherent Risk. IEEE Transactions on Automatic Control 62 (7), pp. 3323–3338 (en). External Links: ISSN 00189286, 15582523, Link, Document Cited by: §1.
 [30] (2013) Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697. Cited by: §1, §6, §7.
Appendix A Proofs
First, let us define the stateoccupancy measure more formally. The step marginal state transition density under policy is defined recursively as follows:
This allows to define the following stateoccupancy densities:
measuring the discounted probability of visiting a state starting from another state or from the start, respectively.
a.1 Variance inequality
See 1
Proof Taking the left hand side (Equation 4) and expanding the square we obtain^{9}^{9}9To shorten the notation, is used instead of .:
Similarly, for the right hand side of the inequality:
Thus, the inequality we want to prove reduces to:
Consider the left hand side. By the CauchySchwarz inequality, it reduces to:
a.2 Riskaverse policy gradient theorem
See 2
Proof
First, we need the following property (see e.g., Lemma 1 in [15]):
Lemma 6
Any integrable function that can be recursively defined as:
where is any integrable function, is equal to:
a.3 Trust Region Volatility Optimization
See 4
Proof ^{11}^{11}11For the sake of clarity, we use the notation to denote the expectation over trajectories: .
Now, the goal is to obtain the discounted sum of the meanvolatility advantages defined in Eqaution (14); however, it must be evaluated using policy instead of . Hence, we recall the result in [8]^{12}^{12}12the difference is only in the normalization terms, which are used accordingly to the definitions above.:
Comments
There are no comments yet.