1 Introduction
Control is one of the key problems of Reinforcement Learning (RL, Sutton and Barto 2018), in which we seek a policy that maximizes certain performance metrics. The performance metric is usually the expectation of some random variable, for example, the expected episodic return (Puterman, 2014; Sutton and Barto, 2018). Although this paradigm has enjoyed great success in various domains (Mnih et al., 2015; Silver et al., 2016; Vinyals et al., 2019), we sometimes want to minimize certain risk measures of that random variable while maximizing its expectation. For example, a portfolio manager is usually willing to sacrifice the return of a portfolio to lower its risk. Riskaverse RL is a framework for studying such problems and has broad applications (Wang, 2000; Parker, 2009; Lai et al., 2011; Matthaeia et al., 2015; Majumdar and Pavone, 2020). Many risk measures have been applied to the episodic return random variable to control risk, for example, variance (Sobel, 1982; Mannor and Tsitsiklis, 2011; Tamar et al., 2012; Prashanth and Ghavamzadeh, 2013; Xie et al., 2018), value at risk (VaR, Chow et al. 2018), and conditional value at risk (CVaR, Chow and Ghavamzadeh 2014; Tamar et al. 2015; Chow et al. 2018). In this paper, we mainly focus on variance given its advantages in interpretability and computation (Markowitz and Todd, 2000; Li and Ng, 2000).
When the primary performance metric is the expectation of the episodic return random variable, it is natural to use the variance of the episodic return random variable as a risk measure. We, however, are not obligated to do so. In this paper, we design a new random variable, perstep reward, and use its variance for riskaverse RL. The expectation of the perstep reward matches the expectation of the episodic return up to a constant multiplier. Furthermore, we prove that the variance of the perstep reward bounds the variance of the episodic return from above, indicating minimizing the variance of the perstep reward implicitly minimizes the variance of the episodic return.
Considering the variance of the perstep reward, we derive the meanvariance policy iteration (MVPI) framework for riskaverse RL, with the help of cyclic coordinate ascent (Luenberger and Ye, 1984) and the Fenchel duality. MVPI is flexible in that all existing policy evaluation methods and riskneutral control methods can be dropped in to obtain riskaverse control off the shelf, in both onpolicy and offpolicy settings. This flexibility offers two significant benefits: (1) It enables riskaverse RL to scale up easily to challenging domains with neural network function approximation. We propose riskaverse Proximal Policy Optimization (PPO, Schulman et al. 2017) as an instance of MVPI, which outperforms PPO in many Mujoco robot simulation domains. By contrast, previous riskaverse control methods that optimize the variance of the episodic return (Tamar et al., 2012; Prashanth and Ghavamzadeh, 2013; Xie et al., 2018) cannot be easily combined with advanced policy optimization techniques like PPO due to their reliance on the squared episodic return. As shown in our empirical study, the methods of Tamar et al. (2012); Prashanth and Ghavamzadeh (2013); Xie et al. (2018) suffer from poor performance in most Mujoco domains with neural network function approximation. (2) It enables offpolicy riskaverse learning, which was difficult to achieve previously. For example, enabling offpolicy learning for the methods of Tamar et al. (2012); Prashanth and Ghavamzadeh (2013); Xie et al. (2018) usually involves products of importance sampling ratios to reweight the squared episodic return, which suffer from high variance (Precup et al., 2001; Liu et al., 2018) and are compatible only with the setting where we have a single known behavior policy. By contrast, MVPI can leverage recent advances in density ratio learning (Hallak and Mannor, 2017; Gelada and Bellemare, 2019; Liu et al., 2018; Nachum et al., 2019a; Zhang et al., 2020a, b), which significantly reduces the variance from offpolicy learning and is compatible with the behavioragnostic offpolicy learning setting (Nachum et al., 2019a), where we may have multiple unknown behavior policies.
2 Background
We consider an infinite horizon MDP with a state space , an action space , a bounded reward function , a transition kernel , an initial distribution , and a discount factor . The initial state is sampled from . At time step , an agent takes an action according to , where is the policy followed by the agent. The agent then gets a reward and proceeds to the next state according to . In this paper, we consider a deterministic reward setting for the ease of presentation, following Chow (2017); Xie et al. (2018). The return at time step is defined as . When , is always well defined. When , to ensure remains well defined, it is usually assumed that all polices are proper (Bertsekas and Tsitsiklis, 1996), i.e., for any policy , the chain induced by
has some absorbing states, one of which the agent will eventually go to with probability 1. Furthermore, the rewards are always 0 thereafter. For any
, is the random variable indicating the episodic return, and we use its expectation(1) 
as our performance metric. In particular, when , we can express as , where is a random variable indicating the first time the agent goes to an absorbing state. For any , the state value function and the stateaction value function are defined as and respectively.
MeanVariance RL. As is a random variable, we sometimes want to control its variance while maximizing its expectation (Prashanth and Ghavamzadeh, 2013; Tamar et al., 2012; Xie et al., 2018), which is usually referred to as meanvariance RL. Namely, we consider the following problem:
(2) 
where indicates the variance of a random variable, indicates user’s tolerance for variance, and is parameterized by . We use and interchangeably in the rest of the paper.
Prashanth and Ghavamzadeh (2013) consider the setting . To solve (2), they use a Lagrangian relaxation procedure to convert it into an unconstrained saddlepoint problem:
(3) 
where is the dual variable. Prashanth and Ghavamzadeh (2013)
use stochastic gradient descent to find the saddlepoint of
. To estimate
, they propose two simultaneous perturbation methods: simultaneous perturbation stochastic approximation and smoothed functional (Bhatnagar et al., 2013), yielding a threetimescale algorithm. Empirical success is observed in a simple traffic control MDP.Tamar et al. (2012) consider the setting . Instead of using the saddlepoint formulation (3), they consider the following unconstrained problem:
(4) 
where
is a hyperparameter to be tuned and
is a penalty function, which they define as . The analytical expression of they provide involves a term , leading to a double sampling issue. To address this, Tamar et al. (2012) consider a twotimescale algorithm and keep running estimates for and in a faster timescale, yielding an episodic algorithm. Given the th episode , they propose the following updates:(5)  
(6) 
where ; and are running estimates for and ; and are learning rates. Empirical success is observed in a simple portfolio management MDP.
Xie et al. (2018) consider the setting and set in (4) to the identity function. To address the double sampling issue, they exploit the Fenchel duality and transform (4) into an equivalent problem:
(7) 
where is the dual variable. Xie et al. (2018) use stochastic coordinate ascent to solve (7), which updates and alternatively. Given the th episode , they propose the following updates:
(8) 
We remark: (1) does not matter in Xie et al. (2018) as is the identity function. (2) can also be regraded as the dual variable in Tamar et al. (2012); Xie et al. (2018). Consequently, we can use gradient descent to optimize it. (3) Tamar et al. (2012); Xie et al. (2018) can also cope with the setting , if all policies are proper. Otherwise, it is infeasible to compute .
3 The PerStep Reward Perspective
Although in many problems our goal is to maximize the undiscounted expected episodic return, practitioners often find that optimizing the discounted objective () as a proxy for the undiscounted objective () is better than optimizing the undiscounted objective directly, especially when deep neural networks are used as function approximator (Mnih et al., 2015; Lillicrap et al., 2015; Espeholt et al., 2018; Xu et al., 2018; Van Seijen et al., 2019). We, therefore, focus on the discounted setting in this paper.
It is well known that the expected discounted episodic return can be expressed as
(9) 
where is the normalized discounted stateaction distribution:
(10) 
We now formally define the perstep reward random variable . Let and be the Borel algebra. We define a probability measure such that for any . Then forms a probability space, and we define the random variable , a deterministic mapping from to , as . Intuitively,
is a discrete random variable taking values in the image of
with a probability mass function , where is the indicator function. It follows that(11) 
and remarkably, we have
Theorem 1.
Proof.
As , , it suffices to show .
(12)  
(13) 
where the inequality comes from the infinite discrete form of Jensen’s inequality. ∎
Eq (11) suggests that the expectation of the random variable correctly indicates the performance of the policy , up to a constant multiplier. Theorem 1 suggests that minimizing the variance of implicitly minimizes the variance of . We, therefore, consider the following objective for riskaverse RL:
(14)  
(15) 
where is a hyperparameter and we have used the Fenchel duality to avoid a doublesampling issue. We, therefore, consider the following problem
(16) 
3.1 MeanVariance Policy Iteration
We propose a cyclic coordinate ascent (CCA, Luenberger and Ye 1984; Tseng 2001; Saha and Tewari 2010, 2013; Wright 2015) framework to solve (16), which updates and alternatively as shown in Algorithm 1.
In Algorithm 1, at the the iteration, we first fix and update (Step 1). As is quadratic in , can be computed analytically as , i.e., all we need in this step is , which is exactly the performance metric of the policy . We, therefore, refer to Step 1 as policy evaluation. We then fix and update (Step 2). This maximization problem seems complicated at first glance, and we will soon discuss the solution to it. In this step, a new policy is computed. An intuitive conjecture is that this step is a policy improvement step, and we confirm this with the following theorem:
Theorem 2.
(Monotonic Policy Improvement) .
Though the monotonic improvement w.r.t. the objective in Eq (16) follows directly from standard CCA theories, Theorem 2 provides the monotonic improvement w.r.t. the objective in Eq (14). Details are provided in the appendix. Given Theorem 2, we can now consider the whole CCA framework in Algorithm 1 as a policy iteration framework, which we call meanvariance policy iteration (MVPI), and we have
Corollary 1.
(Convergence of MVPI) Under mild conditions, the policies generated by MVPI satisfies that converges and .
The precise statement of Corollary 1 and its proof are provided in the appendix. As we have shown that Algorithm 1 is a policy iteration framework, the problem narrows down to Step 2. The maximization in Step 2 seems challenging as it is nonlinear and nonconvex. A key observation is that it can be reduced to the following formulation:
(17) 
where . In other words, to compute , we need to solve a new MDP, which is the same as the original MDP except that the reward function is instead of . Any riskneutral control method can be used to solve this new MDP in a plugplay manner.
MVPI differs from the standard policy iteration (PI, e.g., see Bertsekas and Tsitsiklis 1996; Puterman 2014; Sutton and Barto 2018) in two key ways: (1) policy evaluation in MVPI requires only a scalar performance metric, while standard policy evaluation involves computing the value of all states. (2) policy improvement in MVPI considers an augmented reward (i.e., ), which is different at each iteration, while standard policy improvement always considers the original reward. Standard PI can be used to solve the policy improvement step in MVPI.
Motivated by the empirical success of PPO (OpenAI, 2018) and its convergence to the globally optimal policy (Liu et al., 2019a) (with overparameterized neural networks), we now present MeanVariance PPO (MVPPO, Algorithm 2), instantiating the idea of MVPI.
In MVPPO, is set to the empirical average reward directly. Theoretically, we should use a weighted average as is a discounted distribution. Though implementing this weighted average is straightforward, practitioners usually ignore discounting for state visitation in policy gradient methods to improve sample efficiency (Mnih et al., 2016; Schulman et al., 2015, 2017; Bacon et al., 2017). As the policy evaluation and the policy improvement in MVPPO are only partial, the convergence of MVPPO does not follow directly from Corollary 1, and we leave it for future work. Here we provide MVPPO mainly to verify the idea of MVPI empirically in challenging domains.
3.2 OffPolicy Learning
Previous work on meanvariance RL from the episodic return perspective considers only the onpolicy setting and cannot be easily extended to the offpolicy setting. For example, it is not clear whether perturbation methods for estimating gradients (Prashanth and Ghavamzadeh, 2013) can be used offpolicy. Furthermore, the methods of Tamar et al. (2012); Xie et al. (2018) are episodic, involving terms like . To reweight and in the offpolicy setting, we would need to compute the product of importance sampling ratios , where is the behavior policy. This product usually suffers from high variance (Precup et al., 2001; Liu et al., 2018) and requires knowing the behavior policy , both of which are practical obstacles in real applications.
We now consider offpolicy meanvariance RL under the perstep reward perspective, i.e., we want to perform MVPI with offpolicy samples. In particular, we consider the behavioragnostic offpolicy learning setting (Nachum et al., 2019a), where we have access to a batch of transitions . The stateactions pairs are distributed according to some unknown distribution , which may result from multiple unknown behavior policies. The successor state is distributed according to and .
In an offpolicy setting, the policy evaluation step in MVPI becomes the standard offpolicy evaluation problem (Thomas et al., 2015; Thomas and Brunskill, 2016; Jiang and Li, 2016; Liu et al., 2018), where we want to estimate a scalar performance metric of a policy with offpolicy samples. One promising approach to offpolicy evaluation is density ratio learning, where we use function approximation to learn the density ratio directly (Hallak and Mannor, 2017; Liu et al., 2018; Gelada and Bellemare, 2019), which we then use to reweight . Compared with products of importance sampling ratios, this density ratio learning approach significantly reduces the variance (Liu et al., 2018). Furthermore, it has recently been extended to the behavioragnostic offpolicy learning setting (Nachum et al., 2019a; Zhang et al., 2020a, b; Mousavi et al., 2020), and can thus be integrated into MVPI in a plugandplay manner. In an offpolicy setting, the policy improvement step in MVPI becomes the standard offpolicy policy optimization problem. In the behavioragnostic offpolicy learning setting, we can reweight the canonical onpolicy actorcritic (Sutton et al., 2000; Konda, 2002) with the density ratio as in Liu et al. (2019b) to achieve offpolicy policy optimization.
4 Experiments
All curves in this section are averaged over 30 independent runs with shaded regions indicate standard errors. All implementations are publicly available.
^{1}^{1}1Link available upon publication.OnPolicy Setting. In many real world robot applications, e.g., in a warehouse, it is crucial that the robots’ performance be consistent. In such cases, riskaverse RL is an appealing option to train the robots. Motivated by this, we investigate meanvariance RL algorithms with six Mujoco robot manipulation tasks from OpenAI gym (Brockman et al., 2016). We use twohiddenlayer neural networks for function approximation. Details are provided in the appendix.
Comparison with other meanvariance RL methods. We compare MVPPO against the meanvariance RL algorithms in Prashanth and Ghavamzadeh (2013); Tamar et al. (2012); Xie et al. (2018). Previously, those algorithms were tested only on simple domains with tabular representations or linear function approximation. None has been benchmarked on Mujoco domains with neural network function approximation. Experience replay (Lin, 1992) and multiple parallel actors (Mnih et al., 2016) are common techniques for stabilizing RL agents when neural network function approximators are used. To make the comparison fair, we also implement the algorithms in Prashanth and Ghavamzadeh (2013); Tamar et al. (2012); Xie et al. (2018) with multiple parallel actors (they are onpolicy algorithms so we cannot use experience replay). We also introduce MeanVariance A2C (MVA2C) as a baseline, where A2C, a synchronized version of A3C (Mnih et al., 2016), is used for the policy improvement step in MVPI. For the compared algorithms, we tune in , in , and the initial learning rates in to maximize at the end of training. More details are in the appendix.
The results are reported Figure 3a. MVPPO with outperforms all baselines. By contrast, previous meanvariance RL methods generally suffer from poor performance in Mujoco domains with neural network function approximation, with the exception that Xie et al. (2018) acheive reasonable performance in Reacherv2. No matter how we tune , the learning curves of the baseline meanvariance RL methods always remain flat, indicating they fail to achieve the riskperformance tradeoff in Mujoco domains with neural network function approximation. We conjecture that perturbationbased gradient estimation in Prashanth and Ghavamzadeh (2013) does not work well with neural networks, and the term in Tamar et al. (2012); Xie et al. (2018) suffers from high variance, yielding instability. MVA2C also outperforms previous meanvariance RL methods, though it is outperformed by MVPPO. This indicates that part of the performance advantage of MVPPO comes from the more advanced policy optimization techniques (i.e., PPO). This flexibility of using any existing riskneural control method for riskaverse control is exactly the key feature of MVPI.
HalfCheetahv2  9%  6%  8%  42%  84%  84%  79%  99%  99% 

Walker2dv2  15%  11%  11%  14%  56%  56%  59%  92%  92% 
Swimmerv2  11%  36%  12%  32%  63%  24%  27%  168%  361% 
Hopperv2  14%  50%  52%  4%  58%  58%  9%  57%  57% 
Reacherv2  5%  3%  5%  9%  21%  16%  33%  28%  24% 
Humanoidv2  1%  7%  7%  35%  150%  150%  21%  51%  51% 
Comparison with PPO. We compare MVPPO with vanilla PPO. Our PPO implementation uses the same architecture and hyperparameters as Schulman et al. (2017), and the performance of our PPO matches the PPO performance reported in Achiam (2018). The results are reported Figure 3b. In 4 out of the 6 tested domains, there is a such that MVPPO outperforms PPO, while in the remaining two domains, the performance of MVPPO with is also competitive. Though it is infeasible to expect a riskaverse algorithm to always outperform its riskneutral counterpart in terms of a riskneutral performance metric, our experimental results suggest that this does occur sometimes with taskspecific risk level (). We conjecture that this is because the penalty term from the variance regularizes and stabilizes the training of neural networks. We also demonstrate that MVPPO achieves the tradeoff between risk and performance. To this end, we test the agent at the end of training for an extra 100 episodes. We then report the normalized statistics of the episodic returns in Table 1. The original statistics are provided in the appendix. When we increase from 0.1 to 10, the decrease in the variance often becomes more and more significant. Furthermore, MVPPO usually outperforms PPO significantly in terms of the risksensitive performance metric when and . For , MVPPO has higher variance than PPO. As PPO is exactly MVPPO(). We conjecture that this increase in variance results from variability in estimating the average reward, which further translates into variability of the augmented reward , in turn increasing variability of the learned policy.
OffPolicy Setting. We consider a tabular infinite horizon MDP (Figure 6a). Two actions and are available at , and we have . The discount factor is and the agent is initialized at . We consider the objective in Eq (14). If , the optimal policy is to choose . If is large enough, the optimal policy is to choose . We consider the behavioragnostic offpolicy setting, where the sampling distribution satisfies . This sampling distribution may result from multiple unknown behavior policies. We use densityratiolearningbased offpolicy MVPI. Details are provided in the appendix. In particular, we use GradientDICE (Zhang et al., 2020b) for learning the density ratio. We report the probability of selecting against training iterations. As shown in Figure 6b, offpolicy MVPI behaves correctly for both and . The main challenge in offpolicy MVPI rests on learning the density ratio. Scaling up density ratio learning algorithms reliably to more challenging domains like Mujoco is out of the scope of this paper.
5 Related Work
Besides risk measures like variance, VaR, and CVaR as discussed in Section 1, exponential utility functions (Borkar, 2002) and the sharp ratio (Tamar et al., 2012)
are also used for riskaverse control. Risk measures can be classified into coherent/incoherent risk measures and time consistent/inconsistent risk measures, see
Chow (2017) for details. Besides the expected episodic return, which is the primary objective for many meanvariance RL methods as discussed in Section 2, the average reward (Puterman, 2014) is also a commonly used performance metric, where is the stationary distribution of the chain induced by . Prashanth and Ghavamzadeh (2013) develop riskaverse control methods under the average reward criterion. Although our proposed metric has a similar structure as the average reward and can be interpreted as the average reward of a different MDP (see Konda (2002) for details), is literally the expected episodic return of the original MDP, up to a constant multiplier. Furthermore, our approach for optimizing and differs dramatically from Prashanth and Ghavamzadeh (2013). Deriving the MVPI framework under the average reward criterion is straightforward, and we leave it for future work.6 Conclusion
In this paper, we propose the perstep reward perspective for riskaverse RL. Considering the variance of the perstep reward random variable, we derive the MVPI framework. MVPI enjoys great flexibility such that all policy evaluation methods and riskneutral control methods can be dropped in for riskaverse control off the shelf, in both onpolicy and offpolicy settings. MVPI is the first empirical success of riskaverse RL in Mujoco robot simulation domains. Offpolicy MVPI is the first success of offpolicy riskaverse RL. Possibilities for future work include considering other risk measures (e.g., VaR and CVaR) of the perstep reward random variable, integrating more advanced offpolicy policy optimization techniques (e.g., Nachum et al. 2019b) in offpolicy MVPI, optimizing with metagradients (Xu et al., 2018), conducting a sample complexity analysis for MVPI, and developing theories for approximate MVPI.
Broader Impact
Our work is beneficial for those exploiting RL to solve sequential decision making problems, including but not limited to portfolio management, autonomous driving, and warehouse robotics. They will be better at controlling the risk of their automated decision making systems. Failure of the proposed system may increase the variability of their automated decision making systems, yielding property loss. Like any artificial intelligence system, our work may reduce the need for human workers, resulting in job losses.
Acknowledgments and Disclosure of Funding
SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. BL’s research is funded by the National Science Foundation (NSF) under grant NSF IIS1910794 and an Amazon Research Award.
References
 Spinning up in deep reinforcement learning. Cited by: §4.
 The optioncritic architecture.. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, Cited by: §3.1.
 Nonlinear programming. Athena Scientific. Cited by: §A.2.
 Neurodynamic programming. Athena Scientific Belmont, MA. Cited by: §2, §3.1.
 Stochastic recursive algorithms for optimization. Springer London. Cited by: §2.
 Qlearning for risksensitive control. Mathematics of operations research. Cited by: §5.
 Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Appendix B, §4.

Riskconstrained reinforcement learning with percentile risk criteria.
The Journal of Machine Learning Research
. Cited by: §1.  Algorithms for cvar optimization in mdps. In Advances in Neural Information Processing Systems, Cited by: §1.
 Risksensitive and datadriven sequential decision making. Ph.D. Thesis, Stanford University. Cited by: §2, §5.
 Multistep reinforcement learning: a unifying algorithm. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, Cited by: 3.
 OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: Appendix B.
 Impala: scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561. Cited by: §3.
 Offpolicy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Cited by: §1, §3.2.
 Consistent online offpolicy evaluation. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1, §3.2.
 Doubly robust offpolicy value evaluation for reinforcement learning. In International Conference on Machine Learning, Cited by: §3.2.
 Actorcritic algorithms. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §3.2, §5.
 Meanvariance portfolio optimization when means and covariances are unknown. The Annals of Applied Statistics. Cited by: §1.
 Optimal dynamic portfolio selection: multiperiod meanvariance formulation. Mathematical finance. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.
 Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning. Cited by: §4.
 Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, Cited by: §3.1.
 Breaking the curse of horizon: infinitehorizon offpolicy estimation. In Advances in Neural Information Processing Systems, Cited by: §1, §3.2, §3.2.
 Offpolicy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473. Cited by: §3.2.
 Linear and nonlinear programming (3rd edition). Springer. Cited by: §A.2, §1, §3.1, Remark 1.
 How should a robot assess risk? towards an axiomatic theory of risk in robotics. In Robotics Research, Cited by: §1.

Meanvariance optimization in markov decision processes
. arXiv preprint arXiv:1104.5601. Cited by: §1.  Meanvariance analysis in portfolio choice and capital markets. John Wiley & Sons. Cited by: §1.
 Autonomous driving: technical, legal and social aspects. Springer, Berlin. Cited by: §1.
 Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, Cited by: §3.1, §4.
 Humanlevel control through deep reinforcement learning. Nature. Cited by: §1, §3.
 Blackbox offpolicy estimation for infinitehorizon reinforcement learning. In International Conference on Learning Representations, Cited by: §3.2.
 DualDICE: behavioragnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §1, §3.2, §3.2.
 AlgaeDICE: policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074. Cited by: §6.
 Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Cited by: Appendix B.
 OpenAI five. Note: https://openai.com/five/ Cited by: §3.1.
 Managing risk in healthcare: understanding your safety culture using the manchester patient safety framework (mapsaf). Journal of nursing management. Cited by: §1.
 Actorcritic algorithms for risksensitive mdps. In Advances in neural information processing systems, Cited by: Appendix B, Appendix B, §1, §1, §2, §2, §3.2, §4, §4, §5.
 Offpolicy temporaldifference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning, Cited by: §1, §3.2.
 Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §1, §3.1, §5.
 On the finite time convergence of cyclic coordinate descent methods. arXiv preprint arXiv:1005.2146. Cited by: §3.1.
 On the nonasymptotic convergence of cyclic coordinate descent methods. SIAM Journal on Optimization 23 (1), pp. 576–601. Cited by: §3.1.
 Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: Appendix B, §3.1.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix B, Appendix B, §1, §3.1, §4, 2.
 Mastering the game of go with deep neural networks and tree search. Nature. Cited by: §1.
 The variance of discounted markov decision processes. Journal of Applied Probability. Cited by: §1.
 Reinforcement learning: an introduction (2nd edition). MIT press. Cited by: §1, §3.1.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, Cited by: §3.2.
 Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404. Cited by: Appendix B, Appendix B, §1, §1, §2, §2, §2, §3.2, §4, §4, §5.
 Optimizing the cvar via sampling. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Cited by: §1.
 Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §3.2.
 Highconfidence offpolicy evaluation. In TwentyNinth AAAI Conference on Artificial Intelligence, Cited by: §3.2.
 Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications 109 (3), pp. 475–494. Cited by: §3.1, Remark 1.
 Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §3.
 Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature. Cited by: §1.
 A class of distortion operators for pricing financial and insurance risks. Journal of risk and insurance. Cited by: §1.
 Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §3.1.
 A block coordinate ascent algorithm for meanvariance optimization. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B, §1, §1, §2, §2, §2, §2, §3.2, §4, §4.
 Metagradient reinforcement learning. In Advances in neural information processing systems, Cited by: §3, §6.
 GenDICE: generalized offline estimation of stationary values. In International Conference on Learning Representations, Cited by: §1, §3.2.
 GradientDICE: rethinking generalized offline estimation of stationary values. arXiv preprint arXiv:2001.11113. Cited by: §1, §3.2, §4, 3.
Appendix A Proofs
a.1 Proof of Theorem 2
Proof.
(18)  
(19)  
(20)  
(21)  
(22)  
(By definition, is the maximizer.)  
(24)  
(25)  
(By definition, is the maximizer of the quadratic.)  
(27) 
∎
a.2 Proof of Corollary 1
Let be our function class for the policy optimization. We assume
Assumption 1.
The policy is differentiable in , is a compact set, and for each , .
Assumption 2.
For each , there is a unique maximizing .
Remark 1.
As is quadratic in , the maximizing for each is . As rewards are bounded, we have , allowing us to specify a compact set such that for each , . To make it clear, we rephrase Corollary 1 as
Proof.
Under Assumptions 1 and 2, standard CCA theories (e.g., page 253 in Luenberger and Ye (1984)) show that the limit of any convergent subsequence , referred to as , satisfies and . As is quadratic in , implies . Recall the Fenchel’s duality
(30) 
where . Applying Danskin’s theorem (Proposition B.25 in Bertsekas (1995)) to Fenchel’s duality yields
(31) 
Note Danskin’s theorem shows that we can treat as a constant independent of when computing the gradients in the RHS of Eq (31). Applying Danskin’s theorem in the Fenchel duality used in Eq (14) yields
(32) 
Eq (32) can also be easily verified without invoking Danskin’s theorem by expanding the gradients explicitly. Eq (32) indicates that the subsequence converges to a stationary point of . As and are compact, such a convergent subsequence always exists, implying . The convergence of follows directly from Theorem 1.
∎
Appendix B Experiment Details
Task Selection: We use 6 Mujoco tasks from Open AI gym ^{2}^{2}2https://gym.openai.com/(Brockman et al., 2016) and implement the tabular MDP in Figure 6a by ourselves.
Function Parameterization:
For all compared algorithms in Mujoco domains, we use twohiddenlayer networks to parameterize policy and value function. Each hidden layer has 64 hidden units and a ReLU
(Nair and Hinton, 2010)activation function. In particular, we parameterizedas a diagonal Gaussian distribution with the mean being the output of the network. The standard derivation is a global stateindependent variable. This is a common policy parameterization for continuousaction problems
(Schulman et al., 2015, 2017).Hyperparameter Tuning: For PPO, we use the same hyperparameters as Schulman et al. (2017). MVPPO inherits the hyperparameters from PPO directly without any further tuning. For MVA2C, we use the common A2C hyperparameters from Dhariwal et al. (2017). We implement the methods of Prashanth and Ghavamzadeh (2013); Tamar et al. (2012); Xie et al. (2018) with multiple parallelized actors like A2C.
Hyperparameters of PPO and MVPPO: We use an Adam optimizer with and an initial learning rate . The discount factor is 0.99. The GAE coefficient is 0.95. We clip the gradient by norm with threshold 0.5. The rollout length ( in Algorithm 2
) is 2048. The number of optimization epochs is 10 with batch size 64. We clip the action probability ratio with threshold 0.2.
Hyperparameters of MVA2C:
We use 16 parallelized actors. The initial learning rate for the RMSprop optimizer is
, The discount factor is 0.99. We use policy entropy as a regularization term, whose weight is 0.01. The rollout length is 5. As the rollout length is much smaller than PPO/MVPPO, we use running estimates for the policy evaluation step. We clip the gradient by norm with threshold 0.5. We tune from and find the best.Hyperparameters of Prashanth and Ghavamzadeh (2013): To increase stability, we treat as a hyperparameter instead of a variable. Consequently, does not matter. We tune from and find the best. We set the perturbation in Prashanth and Ghavamzadeh (2013) to . We use 16 parallelized actors. The initial learning rate of the RMSprop optimizer is , tuned from . We also test the Adam optimizer, which performs the same as the RMSprop optimizer. We use policy entropy as a regularization term, whose weight is 0.01. The discount factor is 0.99. We clip the gradient by norm with threshold 0.5.
Hyperparameters of Tamar et al. (2012): We use , tuned from . We use , tuned from . We set the initial learning rate of the RMSprop optimizer to , tuned from . We also test the Adam optimizer, which performs the same as the RMSprop optimizer. The learning rates for the running estimates of and is 100 times of the initial learning rate of the RMSprop optimizer. We use 16 parallelized actors. We use policy entropy as a regularization term, whose weight is 0.01. We clip the gradient by norm with threshold 0.5.
Hyperparameters of Xie et al. (2018): We use , tuned from . We set the initial learning rate of the RMSprop optimizer to , tuned from . We also test the Adam optimizer, which performs the same as the RMSprop optimizer. We use 16 parallelized actors. We use policy entropy as a regularization term, whose weight is 0.01. We clip the gradient by norm with threshold 0.5.
Computing Infrastructure:
We conduct our experiments on an Nvidia DGX1 with PyTorch, though no GPU is used.
The pseudocode of OffPolicy MVPI is provided in Algorithm 3. In our offpolicy experiments, we set to and use tabular representation for , and .
Appendix C Other Experimental Results
We report the original evaluation performance of MVPPO and PPO in Table 2.
PPO  MVPPO()  MVPPO()  MVPPO()  

HalfCheetahv2  2159.3(396.1)  1962.5(407.7)  1243.7(159.3)  443.3(30.6) 
Walker2dv2  2087.5(802.5)  2401.1(845.8)  1786.2(530.4)  851.2(224.9) 
Swimmerv2  62.5(3.5)  55.7(4.1)  42.7(2.2)  45.5(5.8) 
Hopperv2  1858.1(596.3)  2111.6(729.3)  1933.5(386.9)  1694.6(390.9) 
Reacherv2  6.7(3.1)  7.0(3.1)  6.0(2.7)  8.9(2.6) 
Humanoidv2  1133.6(505.3)  1148.7(523.1)  1527.5(798.9)  896.2(355.5) 
Comments
There are no comments yet.