Risk-Averse Trust Region Optimization for Reward-Volatility Reduction

12/06/2019 ∙ by Lorenzo Bisi, et al. ∙ Politecnico di Milano 0

In real-world decision-making problems, for instance in the fields of finance, robotics or autonomous driving, keeping uncertainty under control is as important as maximizing expected returns. Risk aversion has been addressed in the reinforcement learning literature through risk measures related to the variance of returns. However, in many cases, the risk is measured not only on a long-term perspective, but also on the step-wise rewards (e.g., in trading, to ensure the stability of the investment bank, it is essential to monitor the risk of portfolio positions on a daily basis). In this paper, we define a novel measure of risk, which we call reward volatility, consisting of the variance of the rewards under the state-occupancy measure. We show that the reward volatility bounds the return variance so that reducing the former also constrains the latter. We derive a policy gradient theorem with a new objective function that exploits the mean-volatility relationship, and develop an actor-only algorithm. Furthermore, thanks to the linearity of the Bellman equations defined under the new objective function, it is possible to adapt the well-known policy gradient algorithms with monotonic improvement guarantees such as TRPO in a risk-averse manner. Finally, we test the proposed approach in two simulated financial environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) [27] methods have recently grown in popularity in many types of applications. Powerful policy search [2] algorithms, such as TRPO [22] and PPO [23], give very exciting results [13, 7] in terms of efficiently maximizing the expected value of the cumulative discounted rewards (referred to as expected return). These types of algorithms are proving to be very effective in many sectors, but they are leaving behind problems in which maximizing the return is not the only goal and risk aversion becomes an important objective too. Risk-averse reinforcement learning is not a new theme: a utility based approach has been introduced in [24] and [9] where the value function becomes the expected value of a utility function of the reward. A category of objective functions called coherent risk functions, characterized by convexity, monotonicity, translation invariance and positive homogeneity, has been defined and studied in [29]. These include known risk functions such as CVaR (Conditional Value at Risk) and mean-semideviation. Another category of risk-averse objective functions are those which include the variance of the returns (referred to as return variance throughout the paper), which is then combined with the standard return in a mean-variance [30, 21] or a Sharpe ratio [10] fashion.
In certain domains, return variance and CVaR are not suitable to correctly capture risk. In finance, for instance, keeping a low return variance could be appropriate in the case of long term investments, where performance can be measured on the Profit and Loss (P&L) made at the end of the year. However, in any other type of investment, interim results are evaluated frequently, thus keeping a low-varying daily P&L becomes crucial.

This paper analyzes, for the first time, the variance of the reward at each time step w.r.t. state visitation probabilities. We call this quantity

reward volatility. Intuitively, the return variance measures the variation of accumulated reward among trajectories, while reward volatility is concerned with the variation of single-step rewards among visited states. Reward volatility is used to define a new risk-averse performance objective which trades off the maximization of expected return with the minimization of short-term risk (called mean-volatility).
The main contribution of this paper is the derivation of a policy gradient theorem for the new risk-averse objective, based on a novel risk-averse Bellman equation. These theoretical results are made possible by the simplified tractability of reward volatility compared to return variance, as the former lacks the problematic inter-temporal correlation terms of the latter. However, we also show that reward volatility upper bounds the return variance (albeit for a normalization term). This is an interesting result, indicating that it is possible to use the analytic results we derived for the reward volatility to keep under control the return variance.
If correctly optimized, the mean-volatility objective allows to limit the inherent risk due to the stochastic nature of the environment. However, the imperfect knowledge of the model parameters, and the consequent imprecise optimization process, is another relevant source of risk, known as model risk. This is especially important when the optimization is performed on-line, as may happen for an autonomous, adaptive trading system. To avoid any kind of performance oscillation, the intermediate solutions implemented by the learning algorithm must guarantee continuing improvement. The TRPO algorithm provides this kind of guarantees (at least in its ideal formulation) for the risk-neutral objective. The second contribution of our paper is the derivation of the Trust Region Volatility Optimization (TRVO) algorithm, a TRPO-style algorithm for the new mean-volatility objective. After some background on policy gradients (Section 2), the volatility measure is introduced in Section 3 and compared to the return variance. The Policy Gradient Theorem for the mean-volatility objective is provided in Section 4. In Section 4.1

, we introduce an estimator for the gradient which is based on the sample trajectories obtained from direct interaction with the environment, which yields a practical risk-averse policy gradient algorithm (VOLA-PG). The TRVO algorithm is introduced in Section 

5. Finally, in Section 7, we test our algorithms on two simulated financial environments; in the former the agent has to balance investing in liquid and non-liquid assets with simulated dynamics while in the latter the agent has to learn trading on a real asset using historical data.

2 Preliminaries

A discrete-time Markov Decision Process (MDP) is defined as a tuple

, where is the (continuous) state space, the (continuous) action space, is a Markovian transition model that assigns to each state-action pair the probability of reaching the next state , is a bounded reward function, i.e. , is the discount factor, and is the distribution of the initial state. The policy of an agent is characterized by , which assigns to each state the density distribution over the action space .
We consider infinite-horizon problems in which future rewards are exponentially discounted with . Following a trajectory ), let the returns be defined as the discounted cumulative reward: For each state and action , the action-value function is defined as:

(1)

which can be recursively defined by the following Bellman equation:

For each state , we define the state-value function of the stationary policy as:

(2)

It is useful to introduce the (discounted) state-occupancy measure induced by :

where is the probability of reaching state in steps from following policy (see Appendix A).

2.1 Actor-only policy gradient

From the previous subsection, it is possible to define the normalized111In our notation, the expected return (as commonly defined in the RL literature) is expected return using two distinct formulations, one based on transition probabilities and the other on state occupancy :

For the rest of the paper, we consider parametric policies, where the policy

is parametrized by a vector

. In this case, the goal is to find the optimal parametric policy maximizing the performance, i.e. .222For the sake of brevity, when a variable depends on the policy , in subscripts only is shown, omitting the dependency on the parameters . The Policy Gradient Theorem (PGT) [27] states that, for a given policy :

(3)

In the following, we omit the gradient subscript whenever clear from the context.

3 Risk Measures

This section introduces the concept of reward volatility, comparing it with the more common return variance. The latter, denoted with , is defined as:

(4)

In our case, it is useful to define reward volatility in terms of the distribution . As it is not possible to define the return variance in the same way, we also rewrite reward volatility as an expected sum over trajectories:333

In finance, the term “volatility” refers to a generic measure of variation, often defined as a standard deviation. In this paper, volatility is defined as a variance.

(5)
(6)

Once we have set a mean-variance parameter , the performance or objective function related to the policy can be defined as:

(7)

called mean-volatility hereafter, where allows to trade-off expected return maximization with risk minimization. Similarly, the mean-variance objective is . An important result on the relationship between the two variance measures is the following:

Lemma 1

Consider the return variance defined in Equation (4) and the reward volatility defined in Equation (6). The following inequality holds:

The proofs for this and other formal statements can be found in Appendix A. It is important to notice that the factor simply comes from the fact that the return variance is not normalized, unlike the reward volatility (intuitively, volatility measures risk on a shorter time scale). What is lost in the reward volatility compared to the return variance are the inter-temporal correlations between the rewards. However, Lemma 1 shows that the minimization of the reward volatility yields a low return variance. The opposite is clearly not true: a counterexample can be a stock price, having the same value at the beginning and at the end of a day, but making complex movements in-between.

[width=]figures/MDP.pdf

Figure 1: A deterministic MDP. The available actions are and , is the initial state and rewards are reported on the arrows.

To better understand the difference between the two types of variance, consider the deterministic MDP in Figure 1. First assume . Every optimal policy (thus avoiding the rewards) yields an expected return . The reward volatility of a deterministic policy that repeats the action is while the reward volatility of repeating the action is . If we were minimizing the reward volatility, we would prefer the first policy, while we would be indifferent between the two policies based on the return variance ( is 0 in both cases). Now let us complicate the example, setting . The returns are now . As a consequence, a mean-variance objective would always choose action , since the return variance is still , while the mean-volatility objective may choose the other path, depending on the value of the risk-aversion parameter . This simple example shows how the mean-variance objective can be insensitive to short-term risk (the reward), even when the gain in expected return is very small in comparison (). Instead, the mean-volatility objective correctly captures this kind of trade-off.

4 Risk-Averse Policy Gradient

In this section, we derive a policy gradient theorem for the reward volatility , and we propose an unbiased gradient estimator. This will allow us to solve the optimization problem via stochastic gradient ascent. We introduce a volatility equivalent of the action-value function (Equation (1)), which is the volatility observed by starting from state and taking action thereafter:

(8)

called action-volatility function. Like the function, can be written recursively by means of a Bellman equation:

(9)

The state-volatility function, which is the equivalent of the function (Equation (2)), can then be defined as:

This allows to derive a Policy Gradient Theorem (PGT) for , as done in (Sutton et al. 2000) for the expected return:

Theorem 2 (Reward Volatility PGT)

Using the definitions of action-variance and state-variance function, the variance term can be rewritten as:

(10)

Moreover, for a given policy :

The existence of a linear Bellman equation (9) and the consequent policy gradient theorem are two non-trivial theoretical advantages of the new reward volatility approach with respect to the return variance. With a simple extension it is possible to obtain the policy gradient theorem for the mean-volatility objective defined in equation (7). The action value and state value functions are obtained by combining the action value functions of the expected return (1) and of the volatility (8); the same holds for the state value functions:

The policy gradient theorem thus states:

In the following part, we use these results to design a VOLatility-Averse Policy Gradient algorithm (VOLA-PG).

4.1 Estimating the risk-averse policy gradient

To design a practical actor-only policy gradient algorithm, the action-value function needs to be estimated as in [27, 16]. Similarly, we need an estimator for the action-variance function . In this approximate framework, we consider to collect finite trajectories

per each policy update. An unbiased estimator of

can be defined as:

(11)

where rewards are denoted as . This can be used to compute an estimator for the action-volatility function:

Lemma 3

Let be the following estimator for the action-volatility function:

(12)

where and , defined as in Equation (11), are taken from two different sets of trajectories and , and a third set of samples is used for the rewards in Equation (12).
Then, is unbiased.

Note that, in order to obtain an unbiased estimator for , a triple sampling procedure is needed. This may result in being very restrictive. However, by adopting single sampling instead, the bias introduced is equivalent to the variance of , so the estimator is still consistent. This result can be used to build a consistent estimator for the gradient :444Obtained by adopting single sampling.

(13)

This is just a PGT [26] estimator with in place of the simple reward. As shown in [16], this can be turned into a GPOMDP estimator, for which variance-minimizing baselines can be easily computed. Pseudocode for the resulting actor-only policy gradient method is reported in Algorithm 1.

Input: initial policy parameter , batch size , number of iterations , learning-rate .
for  do
     Collect trajectories with to obtain dataset
     Compute estimates as in Equation (11)
     Estimate gradient as in Equation (13)
     Update policy parameters as
end for
Algorithm 1 Volatility-Averse Policy Gradient (VOLA-PG)

5 Trust Region Volatility Optimization

In this section, we go beyond the standard policy gradient theorem and show it is possible to guarantee a monotonic improvement of the mean-volatility performance measure (7) at each policy update. Safe (in the sense of non-pejorative) updates are of fundamental importance when learning online on a real system, such as when controlling a robot or when trading in the financial markets; but also give interesting results during offline training, guaranteeing an automatic tuning of the optimal learning rate. While the mean-volatility objective ensures a risk averse behavior of the policy, the safe update ensures a risk averse update of the paramenters of the policy. Thus, if we care about the agent’s performance within the learning process, we must emphasize the importance of the step sizes at each update of the parameters. Adapting the approach in [22] to our mean-volatility objective, we show it is possible to obtain a learning rate that guarantees that the performance of the updated policy is bounded with respect to the previous policy. The safe update is based on the advantage function, defined as the difference between the action value and state value function, because of the linearity of the Bellman equations, we can extend the definition to obtain the mean-volatility advantage function:

(14)

Furthermore, with the mean-volatility objective all the theoretical results leading to the TRPO algorithm hold. The main ones are in this section, but the full derivation can be found in Appendix A. Lemma 4 is a -extension of Lemma 6.1 in [8]:555The only slight difference is in the normalization term, consequent to the choice of different definitions.

Lemma 4 (Performance Difference Lemma)

The difference of the performances between to two policies and can be expressed in terms of expected advantage:

(15)

This result is very interesting, since the last term adds a gain related to the square of the difference of the expected returns of the policies; therefore there is a bonus if the expected return of the second policy is either higher or lower than the first one. However, this is a difficult term to consider, and we can bound the performance difference by neglecting it; in this case the bound becomes:

(16)

This result is the one that could be obtained considering a transformation of the MDP with a new, policy-dependent reward . In any case, the reader must be careful not to underestimate the choice to adopt a reward of this kind, since, in general, for policy-dependent rewards the PGT in Equation (3) and the performance bound in Equation  (16) do not hold without proper assumptions.

Following the approach proposed in  [22], it is possible to adopt an approximation of the surrogate function, which provides monotonic improvement guarantees by considering the KL divergence between the policies:

Theorem 5

Consider the following approximation of , replacing the state-occupancy density of the old policy :

(17)

Let

Then, the performance of can be bounded as follows:666Comparing this bound to the results shown in  [22], the denominator term is not squared due to return normalization.

(18)

As a consequence, we can devise a trust-region variant of VOLA-PG, called TRVO (Trust Region Volatility Optimization) as outlined in Algorithm 2. Note that this would not have been possible without the linear Bellman equations, which are not available for more common risk measures. For the practical implementation, we follow [22].

Input: initial policy parameter , batch size , number of iterations , discount factor .
for  do
     Collect trajectories with to obtain dataset
     Compute estimates as in Equation (11)
     Estimate advantage values
     Solve the constrained optimization problem
end for
Algorithm 2 Trust Region Volatility Optimization (TRVO)

6 Related Works

[width=0.9]figures/pareto_tamar_vola.pdf

[width=0.9]figures/pareto_tamar_variance.pdf

Figure 2: Portfolio Management: comparison of the expected return, reward volatility and return variance obtained with four different learning algorithms: TRVO, TRPO-exp, Vola-PG and Mean-Variance.

Two streams of RL literature have been merged together in this paper (for the first time, to the best of our knowledge): risk-averse objective functions and safe policy updates. As stated in [3, 4], these two themes are related as they both reduce risk. The first is the reduction of inherent risk, which is generated by the stochastic nature of the environment, while the second is the reduction of model risk, which is related to the imperfect knowledge of the model parameters.

Several risk criteria have been taken into consideration in literature, such as coherent risk measures [28], utility functions [24] and return variance [30] (defined in Equation (4)); the latter is the most studied and the closest to reward volatility. Typically, the risk-averse Bellman equation for the return variance defined by Sobel [25] is considered. Unlike Equation (9), Sobel’s equation does not satisfy the monotonicity property of dynamic programming. This has lead to a variety of complex approaches to the return-variance minimization [3, 30, 21, 20]. The popular mean-variance objective is similar to the one considered in this paper (Equation  (7)), with the return variance instead of the reward volatility. An actor-critic algorithm that considers this objective function is presented in [30, 20, 21], the latter also proposing an actor-only policy gradient. Sometimes, the objective is slightly modified by maximizing the expected value of the return while constraining the variance, such as in  [3]. The Sharpe ratio is an important financial index, which describes how much excess return you receive for the extra volatility you endure for holding a riskier asset. This ratio is used in [3], where an actor-only policy gradient is proposed. In [10] and [11] different approaches are used which do not need a Bellman equation for the return variance. In [10], one of the first applications of RL to trading, the authors consider the differential Sharpe ratio (a first order approximation of the Sharpe ratio), bypassing the direct variance calculation and proposing a policy gradient algorithm. In [11], the problem is tackled by approximating the distribution of the possible returns, deriving a Distributional Bellman Equation and minimizing the CVaR. Actor-critic algorithms for maximizing the expected return under CVaR constraints are proposed in [1].
Even if from some of these measures a Bellman Equation can be derived, unfortunately, such recursive relationships are always non-linear. This prevents one to extend the safe guarantees (Theorem 5) and the TRPO algorithm to risk measures other than reward-volatility The second literature stream is dedicated to the safe update, which, until now, has only been defined for the standard risk-neutral objective function. The seminal paper for this setting is [8], which proposes a conservative policy iteration algorithm with monotonic improvement guarantees for mixtures of greedy policies. This approach is generalized to stationary policies in [19] and to stochastic policies in [22]. Building on the former, monotonically improving policy gradient algorithms are devised in [17, 14] for Gaussian policies. Similar algorithms are developed for Lipschitz policies [18] under the assumption of Lipscthiz-continuous reward and transition functions and, more recently, for smoothing policies [15] on general MDPs777A safe version of VOLA-PG based on this line of work is presented in Appendix B as a more rigorous (albeit less practical) alternative to TRVO.. On the other hand, [22] propose TRPO, a general policy gradient algorithm inspired by the monotonically-improving policy iteration strategy. Although the theoretical guarantees are formally lost due a series of approximations, TRPO enjoyed great empirical success in recent years, especially in combination with deep policies. Moreover, the exact version of TRPO has been shown to converge to the optimal policy [12]. Proximal Policy Optimization [23] is a further approximation of TRPO which has been used to tackle complex, large-scale control problems [13, 7].

7 Experiments

In this section, we show an empirical analysis of the performance of VOLA-PG (Algorithm 1) and TRVO (Algorithm 2) applied to a simplified portfolio management task taken from [3], and a trading task. We compare these results with two other approaches: a mean-variance policy gradient approach presented in the same article, which we adjusted to take into account discounting, and a risk averse transformation of rewards in the TRPO algorithm. Indeed, a possible way to control the risk of the immediate reward, is to consider its expected utility, converting the the immediate reward into , with a parameter controlling the sensitivity to the risk: . This transformation generates a strong parallelism to the literature regarding the variance of the return: as the maximization of the exponential utility function applied to the return is approximately equal to the optimization of the mean-variance objective used in [30], the suggested reward transformation is similarly related to our mean-volatility objective (in a loose approximation, as depicted in the Appendix C). By applying such a transformation, it could be possible to obtain risk-averse policies similar to ours, but running standard return-optimizing algorithms. However, the aforementioned approximation is sound only for small values of the risk aversion coefficient and it suffers the same limitations as each expected utility, as for example shown in [5]. Moreover, the possibility for the agent to get negative rewards might be followed by a strong instability of the learning process. We use it as a baseline in the following experiments, with the name of TRPO-exp.

7.1 Portfolio Management

Setting

The portfolio domain considered is composed of a liquid and a non-liquid asset. The liquid asset has a fixed interest rate , while the non-liquid asset has a stochastic interest rate that switches between two values, and , with probability . If the non-liquid asset does not default (a default can happen with probability ), it is sold when it reaches maturity after a fixed period of time steps. At , the portfolio is composed only of the liquid asset. At every time step, the investor can choose to buy an amount of non-liquid assets (with a maximum of assets), each one with a fixed cost . Let us denote the state at time t as , where is the allocation in the liquid asset, are the allocations in the non-liquid assets (with time to maturity respectively equal to time-steps), and . When the non-liquid asset is sold, the gain is added to the liquid asset. The reward at time is computed (unlike [3]) as the liquid P&L, i.e., the difference between the liquid assets at time and . Further details on this domain are specified in the Appendix D.

Results

Figure 2 shows how the optimal solutions of the two algorithms trade off between the maximization of the expected return and the minimization of one of the two forms of variability when changing the risk aversion parameter . The plot on the right shows that the mean-variance frontier obtained by Algorithm 1 almost coincides with the frontier obtained using the algorithm proposed in [3]. This means that, at least in this domain, the return variance is equally reduced by optimizing either the mean-volatility or the mean-variance objectives. Instead, from the plot on the left, we can notice that the reward-volatility is better optimized with VOLA-PG. These results are consistent with Lemma 1.

[width=]figures/pareto_sp.pdf

Figure 3: Trading Framework: comparison of the expected return and reward volatility obtained with three different learning algorithms: TRVO, TRPO-exp and Vola-PG.

7.2 Trading Framework

The second environment is a realistic setting: trading on the S&P 500 index. The possible actions are buy, sell, stay flat.888This means we are assuming that short selling is possible: thus selling the stock even if you don’t own it, in this case you are betting the stock price is going to decrease Transaction costs are also considered, i.e. a fee is paid when the position changes.The data used is the daily price of the S&P starting from the 1980s, until 2019. The state consists of the last 10 days of percentage price changes, with the addition of the previous portfolio position as well as the fraction of episode to go, with episodes 50 days long.

Results

In Figure 3 we can see that the frontier generated by TRVO dominates the naive approach (TRPO-exp). The same Figure has VOLA-PG trained with the same number of iterations as TRVO. The obtained frontier is entirely dominated since the algorithm has not reached convergence. This reflects the faster convergence of TRPO with respect to REINFORCE. TRPO-exp becomes unstable when raising too much the risk aversion parameter , so it is not possible to find the value for which the risk aversion is maximal, which is why there are no points at the bottom left.

8 Conclusions

Throughout this paper, we propose a combination of methodologies to control risk in a RL framework. We define a risk measure called reward volatility that captures the variability of the rewards between steps. Optimizing this measure allows to obtain smoother trajectories that avoid shocks, which is a fundamental feature in risk-averse contexts. This measure bounds the variance of the returns and it is possible to derive a linear Bellman equation for it (differently from other risk measures). We propose a policy gradient algorithm that optimizes the mean-volatility objective, and, thanks to the aforementioned linearity, we derive a trust region update that ensures a monotonic improvement of our objective at each gradient update, a fundamental characteristic for online algorithms in a real world context. The proposed algorithms are tested on two financial environments and TRVO was shown to outperform the naive approach on the Trading Framework and reach convergence faster than its policy gradient counterpart. Future developements could consists in testing the algorithm in a real online financial setting where new data-points do not belong to the training distribution. To conclude, the framework is the first to take into account two kinds of safety, as it is capable of keeping risk under control while maintaining the same training and convergence properties as state-of-the-art risk-neutral approaches.

References

  • [1] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2017) Risk-constrained reinforcement learning with percentile risk criteria. JMLR 18 (1), pp. 6070–6120. Cited by: §6.
  • [2] M. P. Deisenroth, G. Neumann, J. Peters, et al. (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §1.
  • [3] D. Di Castro, A. Tamar, and S. Mannor (2012-06) Policy gradients with variance related risk criteria. ICML 1, pp. . Cited by: §6, §6, §7.1, §7.1, §7.
  • [4] J. García and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. JMLR 16, pp. 1437–1480. External Links: Link Cited by: §6.
  • [5] A.A. Gosavi et al. (2014-12) Beyond exponential utility functions: a variance-adjusted approach for risk-averse reinforcement learning. In 2014 ADPRL, Vol. , pp. 1–8. External Links: Document, ISSN 2325-1824 Cited by: §7.
  • [6] W. Härdle and L. Simar (2012) Applied multivariate statistical analysis. Vol. 22007, Springer. Cited by: Appendix B.
  • [7] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller, and D. Silver (2017) Emergence of locomotion behaviours in rich environments. CoRR abs/1707.02286. Cited by: §1, §6.
  • [8] S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In ICML, Vol. 2, pp. 267–274. Cited by: §A.3, §5, §6.
  • [9] T. M. Moldovan and P. Abbeel (2012) Risk aversion in Markov decision processes via near optimal Chernoff bounds. In NeurIPS, pp. 3131–3139. Cited by: §1.
  • [10] J. Moody and M. Saffell (2001) Learning to trade via direct reinforcement.

    IEEE transactions on neural Networks

    12 (4), pp. 875–889.
    Cited by: §1, §6.
  • [11] T. Morimura, M. Sugiyama, H. Kashima, H. Hachiya, and T. Tanaka (2010) Nonparametric return distribution approximation for reinforcement learning. In ICML, Cited by: §6.
  • [12] G. Neu, A. Jonsson, and V. Gómez (2017) A unified view of entropy-regularized markov decision processes. CoRR abs/1705.07798. Cited by: §6.
  • [13] OpenAI (2018) OpenAI five. Note: https://blog.openai.com/openai-five/ Cited by: §1, §6.
  • [14] M. Papini, M. Pirotta, and M. Restelli (2017) Adaptive batch size for safe policy gradients. In NeurIPS, pp. 3591–3600. Cited by: §6.
  • [15] M. Papini, M. Pirotta, and M. Restelli (2019) Smoothing policies and safe policy gradients. External Links: 1905.03231 Cited by: §A.2, Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, Appendix B, §6.
  • [16] J. Peters and S. Schaal (2008) Reinforcement learning of motor skills with policy gradients. Neural networks 21 (4), pp. 682–697. Cited by: §4.1, §4.1.
  • [17] M. Pirotta, M. Restelli, and L. Bascetta (2013) Adaptive step-size for policy gradient methods. In NeurIPS 26, pp. 1394–1402. Cited by: §6.
  • [18] M. Pirotta, M. Restelli, and L. Bascetta (2015) Policy gradient in lipschitz markov decision processes. Machine Learning 100 (2-3), pp. 255–283. Cited by: §6.
  • [19] M. Pirotta, M. Restelli, A. Pecorino, and D. Calandriello (2013) Safe policy iteration. In ICML, pp. 307–315. Cited by: §6.
  • [20] L. A. Prashanth and M. Ghavamzadeh (2014) Actor-critic algorithms for risk-sensitive reinforcement learning. arXiv preprint arXiv:1403.6530. Cited by: §6.
  • [21] L. A. Prashanth and M. Ghavamzadeh (2014) Variance-constrained actor-critic algorithms for discounted and average reward mdps. CoRR abs/1403.6530. External Links: Link, 1403.6530 Cited by: §1, §6.
  • [22] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz (2015) Trust region policy optimization. In ICML, pp. 1889–1897. Cited by: §1, §5, §5, §5, §6, footnote 6.
  • [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. Cited by: §1, §6.
  • [24] Y. Shen, R. Huang, C. Yan, and K. Obermayer (2014-03) Risk-averse reinforcement learning for algorithmic trading. In 2014 IEEE Conference on Computational Intelligence for Financial Engineering Economics (CIFEr), Vol. , pp. 391–398. External Links: Document, ISSN 2380-8454 Cited by: §1, §6.
  • [25] M. J. Sobel (1982) The variance of discounted Markov decision processes. Journal of Applied Probability 19 (4), pp. 794–802. Cited by: §6.
  • [26] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, pp. 1057–1063. Cited by: §A.2, §4.1.
  • [27] R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §1, §2.1, §4.1.
  • [28] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor (2015) Policy Gradient for Coherent Risk Measures. CoRR, pp. 9 (en). Cited by: §6.
  • [29] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor (2017-07) Sequential Decision Making With Coherent Risk. IEEE Transactions on Automatic Control 62 (7), pp. 3323–3338 (en). External Links: ISSN 0018-9286, 1558-2523, Link, Document Cited by: §1.
  • [30] A. Tamar and S. Mannor (2013) Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697. Cited by: §1, §6, §7.

Appendix A Proofs

First, let us define the state-occupancy measure more formally. The -step marginal state transition density under policy is defined recursively as follows:

This allows to define the following state-occupancy densities:

measuring the discounted probability of visiting a state starting from another state or from the start, respectively.

a.1 Variance inequality

See 1

Proof  Taking the left hand side (Equation 4) and expanding the square we obtain999To shorten the notation, is used instead of .:

Similarly, for the right hand side of the inequality:

Thus, the inequality we want to prove reduces to:

Consider the left hand side. By the Cauchy-Schwarz inequality, it reduces to:

 

a.2 Risk-averse policy gradient theorem

See 2

Proof 
First, we need the following property (see e.g., Lemma 1 in [15]):

Lemma 6

Any integrable function that can be recursively defined as:

where is any integrable function, is equal to:

From Equation 9 and the definition of , we have101010To simplify the notation, the dependency on is left implicit.:

For the second part, we follow a similar argument as in [26]. We first consider the gradient of and :

where the last equality is from Lemma 6. Finally:

 

a.3 Trust Region Volatility Optimization

See 4

Proof 111111For the sake of clarity, we use the notation to denote the expectation over trajectories: .

Now, the goal is to obtain the discounted sum of the mean-volatility advantages defined in Eqaution (14); however, it must be evaluated using policy instead of . Hence, we recall the result in [8]121212the difference is only in the normalization terms, which are used accordingly to the definitions above.: