Per-Step Reward: A New Perspective for Risk-Averse Reinforcement Learning

04/22/2020 ∙ by Shangtong Zhang, et al. ∙ University of Oxford 15

We present a new per-step reward perspective for risk-averse control in a discounted infinite horizon MDP. Unlike previous work, where the variance of the episodic return random variable is used for risk-averse control, we design a new random variable indicating the per-step reward and consider its variance for risk-averse control. The expectation of the per-step reward matches the expectation of the episodic return up to a constant multiplier, and the variance of the per-step reward bounds the variance of the episodic return above. Furthermore, we derive the mean-variance policy iteration framework under this per-step reward perspective, where all existing policy evaluation methods and risk-neutral control methods can be dropped in for risk-averse control off the shelf, in both on-policy and off-policy settings. We propose risk-averse PPO as an example for mean-variance policy iteration, which outperforms PPO in many Mujoco domains. By contrast, previous risk-averse control methods cannot be easily combined with advanced policy optimization techniques like PPO due to their reliance on the squared episodic return, and all those that we test suffer from poor performance in Mujoco domains with neural network function approximation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Control is one of the key problems of Reinforcement Learning (RL, Sutton and Barto 2018), in which we seek a policy that maximizes certain performance metrics. The performance metric is usually the expectation of some random variable, for example, the expected episodic return (Puterman, 2014; Sutton and Barto, 2018). Although this paradigm has enjoyed great success in various domains (Mnih et al., 2015; Silver et al., 2016; Vinyals et al., 2019), we sometimes want to minimize certain risk measures of that random variable while maximizing its expectation. For example, a portfolio manager is usually willing to sacrifice the return of a portfolio to lower its risk. Risk-averse RL is a framework for studying such problems and has broad applications (Wang, 2000; Parker, 2009; Lai et al., 2011; Matthaeia et al., 2015; Majumdar and Pavone, 2020). Many risk measures have been applied to the episodic return random variable to control risk, for example, variance (Sobel, 1982; Mannor and Tsitsiklis, 2011; Tamar et al., 2012; Prashanth and Ghavamzadeh, 2013; Xie et al., 2018), value at risk (VaR, Chow et al. 2018), and conditional value at risk (CVaR, Chow and Ghavamzadeh 2014; Tamar et al. 2015; Chow et al. 2018). In this paper, we mainly focus on variance given its advantages in interpretability and computation (Markowitz and Todd, 2000; Li and Ng, 2000).

When the primary performance metric is the expectation of the episodic return random variable, it is natural to use the variance of the episodic return random variable as a risk measure. We, however, are not obligated to do so. In this paper, we design a new random variable, per-step reward, and use its variance for risk-averse RL. The expectation of the per-step reward matches the expectation of the episodic return up to a constant multiplier. Furthermore, we prove that the variance of the per-step reward bounds the variance of the episodic return from above, indicating minimizing the variance of the per-step reward implicitly minimizes the variance of the episodic return.

Considering the variance of the per-step reward, we derive the mean-variance policy iteration (MVPI) framework for risk-averse RL, with the help of cyclic coordinate ascent (Luenberger and Ye, 1984) and the Fenchel duality. MVPI is flexible in that all existing policy evaluation methods and risk-neutral control methods can be dropped in to obtain risk-averse control off the shelf, in both on-policy and off-policy settings. This flexibility offers two significant benefits: (1) It enables risk-averse RL to scale up easily to challenging domains with neural network function approximation. We propose risk-averse Proximal Policy Optimization (PPO, Schulman et al. 2017) as an instance of MVPI, which outperforms PPO in many Mujoco robot simulation domains. By contrast, previous risk-averse control methods that optimize the variance of the episodic return (Tamar et al., 2012; Prashanth and Ghavamzadeh, 2013; Xie et al., 2018) cannot be easily combined with advanced policy optimization techniques like PPO due to their reliance on the squared episodic return. As shown in our empirical study, the methods of Tamar et al. (2012); Prashanth and Ghavamzadeh (2013); Xie et al. (2018) suffer from poor performance in most Mujoco domains with neural network function approximation. (2) It enables off-policy risk-averse learning, which was difficult to achieve previously. For example, enabling off-policy learning for the methods of Tamar et al. (2012); Prashanth and Ghavamzadeh (2013); Xie et al. (2018) usually involves products of importance sampling ratios to reweight the squared episodic return, which suffer from high variance (Precup et al., 2001; Liu et al., 2018) and are compatible only with the setting where we have a single known behavior policy. By contrast, MVPI can leverage recent advances in density ratio learning (Hallak and Mannor, 2017; Gelada and Bellemare, 2019; Liu et al., 2018; Nachum et al., 2019a; Zhang et al., 2020a, b), which significantly reduces the variance from off-policy learning and is compatible with the behavior-agnostic off-policy learning setting (Nachum et al., 2019a), where we may have multiple unknown behavior policies.

2 Background

We consider an infinite horizon MDP with a state space , an action space , a bounded reward function , a transition kernel , an initial distribution , and a discount factor . The initial state is sampled from . At time step , an agent takes an action according to , where is the policy followed by the agent. The agent then gets a reward and proceeds to the next state according to . In this paper, we consider a deterministic reward setting for the ease of presentation, following Chow (2017); Xie et al. (2018). The return at time step is defined as . When , is always well defined. When , to ensure remains well defined, it is usually assumed that all polices are proper (Bertsekas and Tsitsiklis, 1996), i.e., for any policy , the chain induced by

has some absorbing states, one of which the agent will eventually go to with probability 1. Furthermore, the rewards are always 0 thereafter. For any

, is the random variable indicating the episodic return, and we use its expectation

(1)

as our performance metric. In particular, when , we can express as , where is a random variable indicating the first time the agent goes to an absorbing state. For any , the state value function and the state-action value function are defined as and respectively.

Mean-Variance RL. As is a random variable, we sometimes want to control its variance while maximizing its expectation (Prashanth and Ghavamzadeh, 2013; Tamar et al., 2012; Xie et al., 2018), which is usually referred to as mean-variance RL. Namely, we consider the following problem:

(2)

where indicates the variance of a random variable, indicates user’s tolerance for variance, and is parameterized by . We use and interchangeably in the rest of the paper.

Prashanth and Ghavamzadeh (2013) consider the setting . To solve (2), they use a Lagrangian relaxation procedure to convert it into an unconstrained saddle-point problem:

(3)

where is the dual variable. Prashanth and Ghavamzadeh (2013)

use stochastic gradient descent to find the saddle-point of

. To estimate

, they propose two simultaneous perturbation methods: simultaneous perturbation stochastic approximation and smoothed functional (Bhatnagar et al., 2013), yielding a three-timescale algorithm. Empirical success is observed in a simple traffic control MDP.

Tamar et al. (2012) consider the setting . Instead of using the saddle-point formulation (3), they consider the following unconstrained problem:

(4)

where

is a hyperparameter to be tuned and

is a penalty function, which they define as . The analytical expression of they provide involves a term , leading to a double sampling issue. To address this, Tamar et al. (2012) consider a two-timescale algorithm and keep running estimates for and in a faster timescale, yielding an episodic algorithm. Given the -th episode , they propose the following updates:

(5)
(6)

where ; and are running estimates for and ; and are learning rates. Empirical success is observed in a simple portfolio management MDP.

Xie et al. (2018) consider the setting and set in (4) to the identity function. To address the double sampling issue, they exploit the Fenchel duality and transform (4) into an equivalent problem:

(7)

where is the dual variable. Xie et al. (2018) use stochastic coordinate ascent to solve (7), which updates and alternatively. Given the -th episode , they propose the following updates:

(8)

We remark: (1) does not matter in Xie et al. (2018) as is the identity function. (2) can also be regraded as the dual variable in Tamar et al. (2012); Xie et al. (2018). Consequently, we can use gradient descent to optimize it. (3) Tamar et al. (2012); Xie et al. (2018) can also cope with the setting , if all policies are proper. Otherwise, it is infeasible to compute .

3 The Per-Step Reward Perspective

Although in many problems our goal is to maximize the undiscounted expected episodic return, practitioners often find that optimizing the discounted objective () as a proxy for the undiscounted objective () is better than optimizing the undiscounted objective directly, especially when deep neural networks are used as function approximator (Mnih et al., 2015; Lillicrap et al., 2015; Espeholt et al., 2018; Xu et al., 2018; Van Seijen et al., 2019). We, therefore, focus on the discounted setting in this paper.

It is well known that the expected discounted episodic return can be expressed as

(9)

where is the normalized discounted state-action distribution:

(10)

We now formally define the per-step reward random variable . Let and be the Borel algebra. We define a probability measure such that for any . Then forms a probability space, and we define the random variable , a deterministic mapping from to , as . Intuitively,

is a discrete random variable taking values in the image of

with a probability mass function , where is the indicator function. It follows that

(11)

and remarkably, we have

Theorem 1.

Proof.

As , , it suffices to show .

(12)
(13)

where the inequality comes from the infinite discrete form of Jensen’s inequality. ∎

Eq (11) suggests that the expectation of the random variable correctly indicates the performance of the policy , up to a constant multiplier. Theorem 1 suggests that minimizing the variance of implicitly minimizes the variance of . We, therefore, consider the following objective for risk-averse RL:

(14)
(15)

where is a hyperparameter and we have used the Fenchel duality to avoid a double-sampling issue. We, therefore, consider the following problem

(16)

3.1 Mean-Variance Policy Iteration

We propose a cyclic coordinate ascent (CCA, Luenberger and Ye 1984; Tseng 2001; Saha and Tewari 2010, 2013; Wright 2015) framework to solve (16), which updates and alternatively as shown in Algorithm 1.

for k = 1, … do
       Step 1: // The exact solution for
       Step 2:
end for
Algorithm 1 Mean-Variance Policy Iteration (MVPI)

In Algorithm 1, at the -the iteration, we first fix and update (Step 1). As is quadratic in , can be computed analytically as , i.e., all we need in this step is , which is exactly the performance metric of the policy . We, therefore, refer to Step 1 as policy evaluation. We then fix and update (Step 2). This maximization problem seems complicated at first glance, and we will soon discuss the solution to it. In this step, a new policy is computed. An intuitive conjecture is that this step is a policy improvement step, and we confirm this with the following theorem:

Theorem 2.

(Monotonic Policy Improvement) .

Though the monotonic improvement w.r.t. the objective in Eq (16) follows directly from standard CCA theories, Theorem 2 provides the monotonic improvement w.r.t. the objective in Eq (14). Details are provided in the appendix. Given Theorem 2, we can now consider the whole CCA framework in Algorithm 1 as a policy iteration framework, which we call mean-variance policy iteration (MVPI), and we have

Corollary 1.

(Convergence of MVPI) Under mild conditions, the policies generated by MVPI satisfies that converges and .

The precise statement of Corollary 1 and its proof are provided in the appendix. As we have shown that Algorithm 1 is a policy iteration framework, the problem narrows down to Step 2. The maximization in Step 2 seems challenging as it is nonlinear and nonconvex. A key observation is that it can be reduced to the following formulation:

(17)

where . In other words, to compute , we need to solve a new MDP, which is the same as the original MDP except that the reward function is instead of . Any risk-neutral control method can be used to solve this new MDP in a plug-play manner.

MVPI differs from the standard policy iteration (PI, e.g., see Bertsekas and Tsitsiklis 1996; Puterman 2014; Sutton and Barto 2018) in two key ways: (1) policy evaluation in MVPI requires only a scalar performance metric, while standard policy evaluation involves computing the value of all states. (2) policy improvement in MVPI considers an augmented reward (i.e., ), which is different at each iteration, while standard policy improvement always considers the original reward. Standard PI can be used to solve the policy improvement step in MVPI.

Motivated by the empirical success of PPO (OpenAI, 2018) and its convergence to the globally optimal policy (Liu et al., 2019a) (with over-parameterized neural networks), we now present Mean-Variance PPO (MVPPO, Algorithm 2), instantiating the idea of MVPI.

Input: : parameters for the policy and the value function : rollout length and weight for variance while True do
       Empty a buffer Run for steps in the environment, storing into // Update , policy evaluation
       for  do
             // Recompute rewards in
            
       end for
      // Policy improvement
       Use PPO (Algorithm 1 in Schulman et al. (2017)) with transitions in to optimize and .
end while
Algorithm 2 Mean-Variance PPO

In MVPPO, is set to the empirical average reward directly. Theoretically, we should use a weighted average as is a discounted distribution. Though implementing this weighted average is straightforward, practitioners usually ignore discounting for state visitation in policy gradient methods to improve sample efficiency (Mnih et al., 2016; Schulman et al., 2015, 2017; Bacon et al., 2017). As the policy evaluation and the policy improvement in MVPPO are only partial, the convergence of MVPPO does not follow directly from Corollary 1, and we leave it for future work. Here we provide MVPPO mainly to verify the idea of MVPI empirically in challenging domains.

3.2 Off-Policy Learning

Previous work on mean-variance RL from the episodic return perspective considers only the on-policy setting and cannot be easily extended to the off-policy setting. For example, it is not clear whether perturbation methods for estimating gradients (Prashanth and Ghavamzadeh, 2013) can be used off-policy. Furthermore, the methods of Tamar et al. (2012); Xie et al. (2018) are episodic, involving terms like . To reweight and in the off-policy setting, we would need to compute the product of importance sampling ratios , where is the behavior policy. This product usually suffers from high variance (Precup et al., 2001; Liu et al., 2018) and requires knowing the behavior policy , both of which are practical obstacles in real applications.

We now consider off-policy mean-variance RL under the per-step reward perspective, i.e., we want to perform MVPI with off-policy samples. In particular, we consider the behavior-agnostic off-policy learning setting (Nachum et al., 2019a), where we have access to a batch of transitions . The state-actions pairs are distributed according to some unknown distribution , which may result from multiple unknown behavior policies. The successor state is distributed according to and .

In an off-policy setting, the policy evaluation step in MVPI becomes the standard off-policy evaluation problem (Thomas et al., 2015; Thomas and Brunskill, 2016; Jiang and Li, 2016; Liu et al., 2018), where we want to estimate a scalar performance metric of a policy with off-policy samples. One promising approach to off-policy evaluation is density ratio learning, where we use function approximation to learn the density ratio directly (Hallak and Mannor, 2017; Liu et al., 2018; Gelada and Bellemare, 2019), which we then use to reweight . Compared with products of importance sampling ratios, this density ratio learning approach significantly reduces the variance (Liu et al., 2018). Furthermore, it has recently been extended to the behavior-agnostic off-policy learning setting (Nachum et al., 2019a; Zhang et al., 2020a, b; Mousavi et al., 2020), and can thus be integrated into MVPI in a plug-and-play manner. In an off-policy setting, the policy improvement step in MVPI becomes the standard off-policy policy optimization problem. In the behavior-agnostic off-policy learning setting, we can reweight the canonical on-policy actor-critic (Sutton et al., 2000; Konda, 2002) with the density ratio as in Liu et al. (2019b) to achieve off-policy policy optimization.

4 Experiments

All curves in this section are averaged over 30 independent runs with shaded regions indicate standard errors. All implementations are publicly available.

111Link available upon publication.

On-Policy Setting. In many real world robot applications, e.g., in a warehouse, it is crucial that the robots’ performance be consistent. In such cases, risk-averse RL is an appealing option to train the robots. Motivated by this, we investigate mean-variance RL algorithms with six Mujoco robot manipulation tasks from OpenAI gym (Brockman et al., 2016). We use two-hidden-layer neural networks for function approximation. Details are provided in the appendix.

(a) MVPPO vs. other mean-variance RL
(b) MVPPO vs. PPO
Figure 3: Online performance of MVPPO and baselines. Curves are smoothed by a window of size 10.

Comparison with other mean-variance RL methods. We compare MVPPO against the mean-variance RL algorithms in Prashanth and Ghavamzadeh (2013); Tamar et al. (2012); Xie et al. (2018). Previously, those algorithms were tested only on simple domains with tabular representations or linear function approximation. None has been benchmarked on Mujoco domains with neural network function approximation. Experience replay (Lin, 1992) and multiple parallel actors (Mnih et al., 2016) are common techniques for stabilizing RL agents when neural network function approximators are used. To make the comparison fair, we also implement the algorithms in Prashanth and Ghavamzadeh (2013); Tamar et al. (2012); Xie et al. (2018) with multiple parallel actors (they are on-policy algorithms so we cannot use experience replay). We also introduce Mean-Variance A2C (MVA2C) as a baseline, where A2C, a synchronized version of A3C (Mnih et al., 2016), is used for the policy improvement step in MVPI. For the compared algorithms, we tune in , in , and the initial learning rates in to maximize at the end of training. More details are in the appendix.

The results are reported Figure 3a. MVPPO with outperforms all baselines. By contrast, previous mean-variance RL methods generally suffer from poor performance in Mujoco domains with neural network function approximation, with the exception that Xie et al. (2018) acheive reasonable performance in Reacher-v2. No matter how we tune , the learning curves of the baseline mean-variance RL methods always remain flat, indicating they fail to achieve the risk-performance trade-off in Mujoco domains with neural network function approximation. We conjecture that perturbation-based gradient estimation in Prashanth and Ghavamzadeh (2013) does not work well with neural networks, and the term in Tamar et al. (2012); Xie et al. (2018) suffers from high variance, yielding instability. MVA2C also outperforms previous mean-variance RL methods, though it is outperformed by MVPPO. This indicates that part of the performance advantage of MVPPO comes from the more advanced policy optimization techniques (i.e., PPO). This flexibility of using any existing risk-neural control method for risk-averse control is exactly the key feature of MVPI.

HalfCheetah-v2 -9% 6% -8% -42% -84% 84% -79% -99% 99%
Walker2d-v2 15% 11% -11% -14% -56% 56% -59% -92% 92%
Swimmer-v2 -11% 36% -12% -32% -63% -24% -27% 168% -361%
Hopper-v2 14% 50% -52% 4% -58% 58% -9% -57% 57%
Reacher-v2 -5% 3% -5% 9% -21% 16% -33% -28% 24%
Humanoid-v2 1% 7% -7% 35% 150% -150% -21% -51% 51%
Table 1: Normalized evaluation performance of MVPPO with different risk levels (). Each cell consists of (from left to right) the normalized mean change , the normalized variance change , and the normalized risk-sensitive performance metric change , where . All numbers are averaged over 30 independent runs.

Comparison with PPO. We compare MVPPO with vanilla PPO. Our PPO implementation uses the same architecture and hyperparameters as Schulman et al. (2017), and the performance of our PPO matches the PPO performance reported in Achiam (2018). The results are reported Figure 3b. In 4 out of the 6 tested domains, there is a such that MVPPO outperforms PPO, while in the remaining two domains, the performance of MVPPO with is also competitive. Though it is infeasible to expect a risk-averse algorithm to always outperform its risk-neutral counterpart in terms of a risk-neutral performance metric, our experimental results suggest that this does occur sometimes with task-specific risk level (). We conjecture that this is because the penalty term from the variance regularizes and stabilizes the training of neural networks. We also demonstrate that MVPPO achieves the trade-off between risk and performance. To this end, we test the agent at the end of training for an extra 100 episodes. We then report the normalized statistics of the episodic returns in Table 1. The original statistics are provided in the appendix. When we increase from 0.1 to 10, the decrease in the variance often becomes more and more significant. Furthermore, MVPPO usually outperforms PPO significantly in terms of the risk-sensitive performance metric when and . For , MVPPO has higher variance than PPO. As PPO is exactly MVPPO(). We conjecture that this increase in variance results from variability in estimating the average reward, which further translates into variability of the augmented reward , in turn increasing variability of the learned policy.

Off-Policy Setting. We consider a tabular infinite horizon MDP (Figure 6a). Two actions and are available at , and we have . The discount factor is and the agent is initialized at . We consider the objective in Eq (14). If , the optimal policy is to choose . If is large enough, the optimal policy is to choose . We consider the behavior-agnostic off-policy setting, where the sampling distribution satisfies . This sampling distribution may result from multiple unknown behavior policies. We use density-ratio-learning-based off-policy MVPI. Details are provided in the appendix. In particular, we use GradientDICE (Zhang et al., 2020b) for learning the density ratio. We report the probability of selecting against training iterations. As shown in Figure 6b, off-policy MVPI behaves correctly for both and . The main challenge in off-policy MVPI rests on learning the density ratio. Scaling up density ratio learning algorithms reliably to more challenging domains like Mujoco is out of the scope of this paper.

(a)
(b)
Figure 6: (a) A tabular MDP. (b) The training progress of off-policy MVPI.

5 Related Work

Besides risk measures like variance, VaR, and CVaR as discussed in Section 1, exponential utility functions (Borkar, 2002) and the sharp ratio (Tamar et al., 2012)

are also used for risk-averse control. Risk measures can be classified into coherent/incoherent risk measures and time consistent/inconsistent risk measures, see

Chow (2017) for details. Besides the expected episodic return, which is the primary objective for many mean-variance RL methods as discussed in Section 2, the average reward (Puterman, 2014) is also a commonly used performance metric, where is the stationary distribution of the chain induced by . Prashanth and Ghavamzadeh (2013) develop risk-averse control methods under the average reward criterion. Although our proposed metric has a similar structure as the average reward and can be interpreted as the average reward of a different MDP (see Konda (2002) for details), is literally the expected episodic return of the original MDP, up to a constant multiplier. Furthermore, our approach for optimizing and differs dramatically from Prashanth and Ghavamzadeh (2013). Deriving the MVPI framework under the average reward criterion is straightforward, and we leave it for future work.

6 Conclusion

In this paper, we propose the per-step reward perspective for risk-averse RL. Considering the variance of the per-step reward random variable, we derive the MVPI framework. MVPI enjoys great flexibility such that all policy evaluation methods and risk-neutral control methods can be dropped in for risk-averse control off the shelf, in both on-policy and off-policy settings. MVPI is the first empirical success of risk-averse RL in Mujoco robot simulation domains. Off-policy MVPI is the first success of off-policy risk-averse RL. Possibilities for future work include considering other risk measures (e.g., VaR and CVaR) of the per-step reward random variable, integrating more advanced off-policy policy optimization techniques (e.g., Nachum et al. 2019b) in off-policy MVPI, optimizing with meta-gradients (Xu et al., 2018), conducting a sample complexity analysis for MVPI, and developing theories for approximate MVPI.

Broader Impact

Our work is beneficial for those exploiting RL to solve sequential decision making problems, including but not limited to portfolio management, autonomous driving, and warehouse robotics. They will be better at controlling the risk of their automated decision making systems. Failure of the proposed system may increase the variability of their automated decision making systems, yielding property loss. Like any artificial intelligence system, our work may reduce the need for human workers, resulting in job losses.

Acknowledgments and Disclosure of Funding

SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA. BL’s research is funded by the National Science Foundation (NSF) under grant NSF IIS1910794 and an Amazon Research Award.

References

  • J. Achiam (2018) Spinning up in deep reinforcement learning. Cited by: §4.
  • P. Bacon, J. Harb, and D. Precup (2017) The option-critic architecture.. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, Cited by: §3.1.
  • D.P. Bertsekas (1995) Nonlinear programming. Athena Scientific. Cited by: §A.2.
  • D. P. Bertsekas and J. N. Tsitsiklis (1996) Neuro-dynamic programming. Athena Scientific Belmont, MA. Cited by: §2, §3.1.
  • S. Bhatnagar, H.L. Prasad, and L.A. Prashanth (2013) Stochastic recursive algorithms for optimization. Springer London. Cited by: §2.
  • V. S. Borkar (2002) Q-learning for risk-sensitive control. Mathematics of operations research. Cited by: §5.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: Appendix B, §4.
  • Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2018) Risk-constrained reinforcement learning with percentile risk criteria.

    The Journal of Machine Learning Research

    .
    Cited by: §1.
  • Y. Chow and M. Ghavamzadeh (2014) Algorithms for cvar optimization in mdps. In Advances in Neural Information Processing Systems, Cited by: §1.
  • Y. Chow (2017) Risk-sensitive and data-driven sequential decision making. Ph.D. Thesis, Stanford University. Cited by: §2, §5.
  • K. De Asis, J. F. Hernandez-Garcia, G. Z. Holland, and R. S. Sutton (2018) Multi-step reinforcement learning: a unifying algorithm. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, Cited by: 3.
  • P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: Appendix B.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §3.
  • C. Gelada and M. G. Bellemare (2019) Off-policy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Cited by: §1, §3.2.
  • A. Hallak and S. Mannor (2017) Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning, Cited by: §1, §3.2.
  • N. Jiang and L. Li (2016) Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, Cited by: §3.2.
  • V. R. Konda (2002) Actor-critic algorithms. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §3.2, §5.
  • T. L. Lai, H. Xing, Z. Chen, et al. (2011) Mean-variance portfolio optimization when means and covariances are unknown. The Annals of Applied Statistics. Cited by: §1.
  • D. Li and W. Ng (2000) Optimal dynamic portfolio selection: multiperiod mean-variance formulation. Mathematical finance. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.
  • L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning. Cited by: §4.
  • B. Liu, Q. Cai, Z. Yang, and Z. Wang (2019a) Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems, Cited by: §3.1.
  • Q. Liu, L. Li, Z. Tang, and D. Zhou (2018) Breaking the curse of horizon: infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems, Cited by: §1, §3.2, §3.2.
  • Y. Liu, A. Swaminathan, A. Agarwal, and E. Brunskill (2019b) Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473. Cited by: §3.2.
  • D. G. Luenberger and Y. Ye (1984) Linear and nonlinear programming (3rd edition). Springer. Cited by: §A.2, §1, §3.1, Remark 1.
  • A. Majumdar and M. Pavone (2020) How should a robot assess risk? towards an axiomatic theory of risk in robotics. In Robotics Research, Cited by: §1.
  • S. Mannor and J. Tsitsiklis (2011)

    Mean-variance optimization in markov decision processes

    .
    arXiv preprint arXiv:1104.5601. Cited by: §1.
  • H. M. Markowitz and G. P. Todd (2000) Mean-variance analysis in portfolio choice and capital markets. John Wiley & Sons. Cited by: §1.
  • R. Matthaeia, A. Reschkaa, J. Riekena, F. Dierkesa, S. Ulbricha, T. Winkleb, and M. Maurera (2015) Autonomous driving: technical, legal and social aspects. Springer, Berlin. Cited by: §1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, Cited by: §3.1, §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature. Cited by: §1, §3.
  • A. Mousavi, L. Li, Q. Liu, and D. Zhou (2020) Black-box off-policy estimation for infinite-horizon reinforcement learning. In International Conference on Learning Representations, Cited by: §3.2.
  • O. Nachum, Y. Chow, B. Dai, and L. Li (2019a) DualDICE: behavior-agnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733. Cited by: §1, §3.2, §3.2.
  • O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans (2019b) AlgaeDICE: policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074. Cited by: §6.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Cited by: Appendix B.
  • OpenAI (2018) OpenAI five. Note: https://openai.com/five/ Cited by: §3.1.
  • D. Parker (2009) Managing risk in healthcare: understanding your safety culture using the manchester patient safety framework (mapsaf). Journal of nursing management. Cited by: §1.
  • L. Prashanth and M. Ghavamzadeh (2013) Actor-critic algorithms for risk-sensitive mdps. In Advances in neural information processing systems, Cited by: Appendix B, Appendix B, §1, §1, §2, §2, §3.2, §4, §4, §5.
  • D. Precup, R. S. Sutton, and S. Dasgupta (2001) Off-policy temporal-difference learning with function approximation. In Proceedings of the 18th International Conference on Machine Learning, Cited by: §1, §3.2.
  • M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §1, §3.1, §5.
  • A. Saha and A. Tewari (2010) On the finite time convergence of cyclic coordinate descent methods. arXiv preprint arXiv:1005.2146. Cited by: §3.1.
  • A. Saha and A. Tewari (2013) On the nonasymptotic convergence of cyclic coordinate descent methods. SIAM Journal on Optimization 23 (1), pp. 576–601. Cited by: §3.1.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: Appendix B, §3.1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix B, Appendix B, §1, §3.1, §4, 2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature. Cited by: §1.
  • M. J. Sobel (1982) The variance of discounted markov decision processes. Journal of Applied Probability. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction (2nd edition). MIT press. Cited by: §1, §3.1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, Cited by: §3.2.
  • A. Tamar, D. Di Castro, and S. Mannor (2012) Policy gradients with variance related risk criteria. arXiv preprint arXiv:1206.6404. Cited by: Appendix B, Appendix B, §1, §1, §2, §2, §2, §3.2, §4, §4, §5.
  • A. Tamar, Y. Glassner, and S. Mannor (2015) Optimizing the cvar via sampling. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Cited by: §1.
  • P. Thomas and E. Brunskill (2016) Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148. Cited by: §3.2.
  • P. S. Thomas, G. Theocharous, and M. Ghavamzadeh (2015) High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §3.2.
  • P. Tseng (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications 109 (3), pp. 475–494. Cited by: §3.1, Remark 1.
  • H. Van Seijen, M. Fatemi, and A. Tavakoli (2019) Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature. Cited by: §1.
  • S. S. Wang (2000) A class of distortion operators for pricing financial and insurance risks. Journal of risk and insurance. Cited by: §1.
  • S. J. Wright (2015) Coordinate descent algorithms. Mathematical Programming 151 (1), pp. 3–34. Cited by: §3.1.
  • T. Xie, B. Liu, Y. Xu, M. Ghavamzadeh, Y. Chow, D. Lyu, and D. Yoon (2018) A block coordinate ascent algorithm for mean-variance optimization. In Advances in Neural Information Processing Systems, Cited by: Appendix B, Appendix B, §1, §1, §2, §2, §2, §2, §3.2, §4, §4.
  • Z. Xu, H. P. van Hasselt, and D. Silver (2018) Meta-gradient reinforcement learning. In Advances in neural information processing systems, Cited by: §3, §6.
  • R. Zhang, B. Dai, L. Li, and D. Schuurmans (2020a) GenDICE: generalized offline estimation of stationary values. In International Conference on Learning Representations, Cited by: §1, §3.2.
  • S. Zhang, B. Liu, and S. Whiteson (2020b) GradientDICE: rethinking generalized offline estimation of stationary values. arXiv preprint arXiv:2001.11113. Cited by: §1, §3.2, §4, 3.

Appendix A Proofs

a.1 Proof of Theorem 2

Proof.
(18)
(19)
(20)
(21)
(22)
(By definition, is the maximizer.)
(24)
(25)
(By definition, is the maximizer of the quadratic.)
(27)

a.2 Proof of Corollary 1

Let be our function class for the policy optimization. We assume

Assumption 1.

The policy is differentiable in , is a compact set, and for each , .

Assumption 2.

For each , there is a unique maximizing .

Remark 1.

Assumption 1 is standard in optimization literature (e.g., Luenberger and Ye (1984)). A lookup table is a simple function class that satisfies Assumption 1. Assumption 2 is standard in CCA literature (e.g., page 253 in Luenberger and Ye (1984) or Theorem 4.1 in Tseng (2001)).

As is quadratic in , the maximizing for each is . As rewards are bounded, we have , allowing us to specify a compact set such that for each , . To make it clear, we rephrase Corollary 1 as

Corollary 1. Under Assumptions 1 and 2, let

(28)
(29)

then converges and .

Proof.

Under Assumptions 1 and 2, standard CCA theories (e.g., page 253 in Luenberger and Ye (1984)) show that the limit of any convergent subsequence , referred to as , satisfies and . As is quadratic in , implies . Recall the Fenchel’s duality

(30)

where . Applying Danskin’s theorem (Proposition B.25 in Bertsekas (1995)) to Fenchel’s duality yields

(31)

Note Danskin’s theorem shows that we can treat as a constant independent of when computing the gradients in the RHS of Eq (31). Applying Danskin’s theorem in the Fenchel duality used in Eq (14) yields

(32)

Eq (32) can also be easily verified without invoking Danskin’s theorem by expanding the gradients explicitly. Eq (32) indicates that the subsequence converges to a stationary point of . As and are compact, such a convergent subsequence always exists, implying . The convergence of follows directly from Theorem 1.

Appendix B Experiment Details

Task Selection: We use 6 Mujoco tasks from Open AI gym 222https://gym.openai.com/(Brockman et al., 2016) and implement the tabular MDP in Figure 6a by ourselves.

Function Parameterization:

For all compared algorithms in Mujoco domains, we use two-hidden-layer networks to parameterize policy and value function. Each hidden layer has 64 hidden units and a ReLU

(Nair and Hinton, 2010)activation function. In particular, we parameterized

as a diagonal Gaussian distribution with the mean being the output of the network. The standard derivation is a global state-independent variable. This is a common policy parameterization for continuous-action problems

(Schulman et al., 2015, 2017).

Hyperparameter Tuning: For PPO, we use the same hyperparameters as Schulman et al. (2017). MVPPO inherits the hyperparameters from PPO directly without any further tuning. For MVA2C, we use the common A2C hyperparameters from Dhariwal et al. (2017). We implement the methods of Prashanth and Ghavamzadeh (2013); Tamar et al. (2012); Xie et al. (2018) with multiple parallelized actors like A2C.

Hyperparameters of PPO and MVPPO: We use an Adam optimizer with and an initial learning rate . The discount factor is 0.99. The GAE coefficient is 0.95. We clip the gradient by norm with threshold 0.5. The rollout length ( in Algorithm 2

) is 2048. The number of optimization epochs is 10 with batch size 64. We clip the action probability ratio with threshold 0.2.

Hyperparameters of MVA2C:

We use 16 parallelized actors. The initial learning rate for the RMSprop optimizer is

, The discount factor is 0.99. We use policy entropy as a regularization term, whose weight is 0.01. The rollout length is 5. As the rollout length is much smaller than PPO/MVPPO, we use running estimates for the policy evaluation step. We clip the gradient by norm with threshold 0.5. We tune from and find the best.

Hyperparameters of Prashanth and Ghavamzadeh (2013): To increase stability, we treat as a hyperparameter instead of a variable. Consequently, does not matter. We tune from and find the best. We set the perturbation in Prashanth and Ghavamzadeh (2013) to . We use 16 parallelized actors. The initial learning rate of the RMSprop optimizer is , tuned from . We also test the Adam optimizer, which performs the same as the RMSprop optimizer. We use policy entropy as a regularization term, whose weight is 0.01. The discount factor is 0.99. We clip the gradient by norm with threshold 0.5.

Hyperparameters of Tamar et al. (2012): We use , tuned from . We use , tuned from . We set the initial learning rate of the RMSprop optimizer to , tuned from . We also test the Adam optimizer, which performs the same as the RMSprop optimizer. The learning rates for the running estimates of and is 100 times of the initial learning rate of the RMSprop optimizer. We use 16 parallelized actors. We use policy entropy as a regularization term, whose weight is 0.01. We clip the gradient by norm with threshold 0.5.

Hyperparameters of Xie et al. (2018): We use , tuned from . We set the initial learning rate of the RMSprop optimizer to , tuned from . We also test the Adam optimizer, which performs the same as the RMSprop optimizer. We use 16 parallelized actors. We use policy entropy as a regularization term, whose weight is 0.01. We clip the gradient by norm with threshold 0.5.

Computing Infrastructure:

We conduct our experiments on an Nvidia DGX-1 with PyTorch, though no GPU is used.

The pseudocode of Off-Policy MVPI is provided in Algorithm 3. In our off-policy experiments, we set to and use tabular representation for , and .

Input: a batch of data : parameters for the policy while True do
       Learn the density ratio // For example, use GradientDICE (Zhang et al., 2020b)
       for  do
            
       end for
      Learn w.r.t. the reward // For example, use Off-Policy Expected SARSA (De Asis et al., 2018)
       Update in the direction of
end while
Algorithm 3 Off-Policy MVPI

Appendix C Other Experimental Results

We report the original evaluation performance of MVPPO and PPO in Table 2.

PPO MVPPO() MVPPO() MVPPO()
HalfCheetah-v2 2159.3(396.1) 1962.5(407.7) 1243.7(159.3) 443.3(30.6)
Walker2d-v2 2087.5(802.5) 2401.1(845.8) 1786.2(530.4) 851.2(224.9)
Swimmer-v2 62.5(3.5) 55.7(4.1) 42.7(2.2) 45.5(5.8)
Hopper-v2 1858.1(596.3) 2111.6(729.3) 1933.5(386.9) 1694.6(390.9)
Reacher-v2 -6.7(3.1) -7.0(3.1) -6.0(2.7) -8.9(2.6)
Humanoid-v2 1133.6(505.3) 1148.7(523.1) 1527.5(798.9) 896.2(355.5)
Table 2: Evaluation performance of algorithms on Mujoco domains. Mean episodic returns are reported with one standard derivation bracketed. Numbers are averaged over 30 independent runs.