The objective in reinforcement learning (RL) is to find a policy which maximizes the mean of the cumulative reward. But risk-sensitive RL goes beyond the mean of the cumulative reward, and considers other aspects of the reward distribution such as variance, tail probabilities, and shape. Such attributes are quantified using a risk measure.
Though there are no scarcity for risk measures in literature, there is no consensus on an ideal risk measure. A risk measure is said to be coherent if it is translation invariant, sub-additive, positive homogeneous, and monotonic . Coherent risk measures are very desirable as the aforementioned properties help to avoid inconsistent decisions. Later,  suggest that coherency may not be sufficient, and introduce a new smooth coherent risk measure. The risk measures like Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR) , and cumulative prospect theory (CPT)  are studied in RL literature. The popular risk measures like VaR and CVaR attained a lot of criticism in the past since these measures overlook any information from infrequent high severity outcomes. Though risk-neutral RL gives equal focus to all the outcomes, it is intuitive to emphasize desirable events, and de-emphasize undesirable events without ignoring infrequent extreme outcomes altogether.
A family of risk measures called distortion risk measures (DRM) [5, 6] uses a distortion function to distort the original distribution, and calculate the mean of the rewards with respect to the distorted distribution. A distortion function allows one to vary the emphasis on each possible reward value. The choice of the distortion function governs the risk measure. Further, choosing a concave distortion function ensures that the DRM is coherent . The spectral risk functions are equivalent to distortion functions . A DRM with an identity distortion function is simply the mean of the rewards. The popular risk measures like VaR and CVaR can be expressed as a DRM using appropriate distortion functions. But the distortion function is discontinuous for VaR, and though continuous, it is not differentiable at every point for CVaR. Hence in , the author disfavors such distortion functions and focuses on smooth distortion functions.
In this paper, we consider the family of DRMs with smooth distortion functions. Some examples of smooth distortion functions are dual-power function, quadratic function, square-root function, exponential function, and logarithmic function (see [9, 10] for more examples). In risk-neutral RL, the occasional extreme events get equal priority as other events. In a DRM, the distortion function is operating on the reward distribution without discarding any information. Hence, it is possible to emphasis frequent events, and still account for infrequent high severity events. As there is no universal ideal risk measure, it is intuitive to consider a risk measure which best fits the problem in hand. For DRMs, we may concentrate only on the distortion function.
In this paper, we consider optimizing the DRM in a risk-sensitive RL context. The goal in our formulation is to find a policy that maximizes the DRM of the cumulative reward in an episodic Markov decision process (MDP). We consider this problem in on-policy as well as off-policy settings, and employ the gradient ascent solution approach. Solving a DRM-sensitive MDP is challenging for two reasons. First, DRM is a risk measure that focuses on the entire distribution of the cumulative reward, while the regular value function objective in a risk-neutral RL setting is concerned with only the mean of this distribution. This observation implies a sample average of the total reward across sample episodes would not be sufficient to estimate DRM. Secondly, a gradient ascent algorithm requires an estimate of the gradient of the DRM objective, and such gradient information is not directly available in a typical RL setting. For the risk-neutral case, one has the policy gradient theorem, which leads to a straightforward gradient estimate from sample episodes.
For estimating DRM from sample episodes, we use the empirical distribution function (EDF) as a proxy for the true distribution. We provide a non-asymptotic bound on the mean-square error of this estimator, and this may be of independent interest. Next, to estimate the DRM gradient, we employ the smoothed functional (SF) method [11, 12, 13]. We use a variant of SF which use two function measurements corresponding to two perturbed policies. An SF-based estimation scheme may be restrictive for some applications in an on-policy RL setting, since we need separate sets of episodes corresponding to two perturbed policies. But, in an off-policy RL context, we only need a single set of episodes corresponding to a behavior policy. We provide bounds on the bias and variance of the aforementioned gradient estimates. Using these bounds, we establish that our DRM gradient ascent algorithms requires iterations to find an -stationary point of the DRM objective. To the best of our knowledge, non-asymptotic bounds have not been derived for an SF-based DRM gradient ascent algorithm in the current literature.
Related work. In , the authors propose a policy gradient algorithm for an abstract coherent risk measure, and derive a policy gradient theorem using the dual representation of a coherent risk measure. Their estimation scheme requires solving a convex optimization problem. Also, they establish asymptotic consistency of their proposed gradient estimate. In , the authors survey policy gradient algorithms for optimizing different risk measures in a constrained as well as an unconstrained RL setting. In a non-RL context, the authors in 
study the sensitivity of DRM using an estimator that is based on the generalized likelihood ratio method, and establish a central limit theorem for their gradient estimator. In, authors study DRM, and derive a policy gradient theorem that caters to the DRM objective. They establish non-asymptotic bounds for their policy gradient algorithms which uses likelihood ratio (LR) based gradient estimation scheme. In  the authors consider a CPT-based objective in an RL setting, and they employ simultaneous perturbation stochastic approximation (SPSA) method for the gradient estimation, and provide asymptotic convergence guarantees for their algorithm. In comparison to the aforementioned works, we would like to note the following aspects:
(i) For the DRM measure, we estimate the gradient using SF-based estimation scheme while  uses a LR-based gradient estimation scheme. Similar to our work,  establishes a convergence rate that implies convergence to a stationary point of the DRM objective. Here denotes the number of iterations of the DRM gradient ascent algorithm. But the algorithms in  require episodes per iteration for both on policy and off-policy RL settings, whereas our algorithm for off-policy RL setting requires only a constant episodes per iteration, though our algorithm for on-policy RL-setting require episodes per iteration. The algorithms in  directly estimate the gradient using order statistics. Our algorithms uses a two part estimation scheme, where we first estimate the DRM using order statistics, and then estimate its gradient using SF-based estimation scheme.
(ii) For a general coherent risk measure,  uses gradient estimation scheme which requires solving a convex optimization problem, whereas our algorithms can directly estimate the gradient from the samples without solving any optimization sub-problem.
(iii) In , the guarantees for a gradient ascent algorithm based on SPSA are asymptotic in nature, and is for CPT in an on-policy RL setting. CPT is also based on a distortion function, but the distortion function underlying CPT is neither concave nor convex, and hence, it is non-coherent.
(iv) In , the authors derive a non-asymptotic bound of for an abstract smooth risk measure. They uses abstract gradient oracles which satisfies certain bias-variance conditions. In contrast, we provide concrete gradient estimation schemes in RL settings, and our bounds feature an improved rate of .
Ii Problem formulation
Ii-a Distortion risk measure (DRM)
The DRM of a random variableis the expected value of
under a distortion of the cumulative distribution function (CDF), attained using a given distortion function . A DRM is defined using a Choquet integral as follows:
The DRMs are well studied from an ‘attitude towards risk’ perspective, and we refer the reader to [19, 20] for details. In this paper, we focus on ‘risk-sensitive decision making under uncertainty’, with DRM as the chosen risk measure. We incorporate DRMs into a risk-sensitive RL framework, and the following section describes our problem formulation.
Ii-B DRM-sensitive Markov decision process
We consider an Markov decision process (MDP) with a state space and an action space . We assume that and are finite spaces. Let be the single stage scalar reward, and be the transition probability function. We consider episodic problems, where each episode starts at a fixed state , and terminates at a special zero reward absorbing state . The action selection is based on parameterized stochastic policies . We assume that the parameterized policies are proper, i.e., satisfying the following assumption:
We denote by and , the state and the action at time respectively. The cumulative discounted reward is defined by , where , , is the discount factor, and is the random length of an episode. We can see that a.s.
The DRM of , is defined as follows:
where is the CDF of , and .
Our goal is to find that maximizes the DRM, i.e,
Iii DRM policy gradient algorithms
The optimization problem in (2) can be solved by a gradient ascent algorithm. But, in a typical RL setting, we have direct measurements of neither nor . In the following sections, we describe our algorithms that estimate these quantities in on-policy as well as off-policy RL settings.
Iii-a DRM optimization in an on-policy RL setting
Our first algorithm DRM-OnP-SF solves (2) in an off-policy RL setting. In the following sections, we describe the estimation of the DRM , and its gradient .
Iii-A1 DRM estimation
We generate episodes using the policy , and estimate the CDF using sample averages. We denote by the cumulative reward of the episode . We form the estimate of as follows:
Now, we form an estimate of as follows:
We simplify (4) in terms of order statistics as follows:
where is the smallest order statistic of the samples . The reader is referred to Lemma 1 in Appendix A for a proof. If we choose the distortion function as the identity function, then the estimator in (5) is merely the sample mean.
Iii-A2 DRM gradient estimation
We use a SF approach  to estimate . SF-based methods form a smoothed version of as , and use as an approximation for . The smoothed functional is defined as
where is sampled uniformly at random from the unit ball , and is the smoothing parameter. Here denotes the -dimensional Euclidean norm. The gradient of is
where is sampled uniformly at random from the unit sphere (cf. [22, Lemma 2.1]).
We estimate the gradient using two randomly perturbed policies, namely and , where
is a random unit vector sampled uniformly from the surface of a unit sphere. The estimateof is formed as follows:
where is sampled uniformly at random from , and is as defined in (5). The gradient estimate is averaged over unit vectors to reduce the variance.
The update iteration in DRM-OnP-SF is as follows:
where is set arbitrarily, and is the step-size. Algorithm 1 presents the pseudocode of DRM-OnP-SF.
Iii-B DRM optimization in an off-policy RL setting
Every iteration of DRM-OnP-SF needs episodes corresponds to perturbed policies (see Algorithm 1). In some practical applications, it may not be feasible to generate system trajectories corresponding to different perturbed policies. In our second algorithm DRM-OffP-SF, we overcome the aforementioned problem by performing off-policy evaluation, i.e., we collect episodes from a behavior policy , and estimate the values of the perturbed target policies. Using off-policy setting, the number of episodes needed in each iteration of our algorithm can be reduced to . Using episodes, we can calculate values for perturbed policies, hence we can use a constant number of episodes in each iteration of our algorithm.
For the analysis, we require the behavior policy to be proper, i.e.,
Also, we require the target policy to be absolutely continuous with respect to , i.e.,
In the following sections, we provide the estimation scheme for the DRM , and its gradient .
Iii-B1 DRM estimation
The cumulative discounted reward is defined by , where , , and .
We generate episodes using the policy to estimate the cdf using importance sampling. The importance sampling ratio is defined as
We denote by the cumulative reward, and the importance sampling ratio of the episode . We form the estimate of as follows:
In the above, is an empirical estimate of as . Because of the importance sampling ratio, can get a value above . Since we are estimating a CDF, we restrict to . The mean squared error of our estimator is . The reader is referred to Lemma 18 in Appendix E for a proof.
Iii-B2 DRM gradient estimation
We use the SF-based gradient estimation scheme as in Section III-A2, and an estimate of the gradient is formed as follows:
The update iteration in DRM-OffP-SF is as follows:
The pseudocode of DRM-offP-SF algorithm is similar to Algorithm 1, but in each iteration, we get episodes from policy . Then, we generate DRM estimates using (14). We estimate the gradient using (15), and use the policy parameter update rule (16). The reader is referred to Algorithm 2 in Appendix F.
Iv Main results
Our non-asymptotic analysis establishes a bound on the number of iterations of our proposed algorithms to find an-stationary point of the DRM, which is defined below.
Definition 1 (-stationary point)
Let be the output of an algorithm. Then, is called an -stationary point of problem (2), if .
Iv-a Non-asymptotic bounds for DRM-OnP-SF
We make the following assumptions to ensure the Lipschitzness, and smoothness of the DRM .
, , and .
, , and .
An assumption like (A4) is common in the literature for the non-asymptotic analysis of policy gradient algorithms (cf. [26, 25]). The assumption (A5) helps us establish that the distortion functions and its derivative are Lipschitz continuous. A few examples of distortion functions, which satisfy (A5) are given in Table I. Since is bounded by definition, we can see that any whose second derivative is bounded, will have a bounded first derivative also.
Letting denote is the (random) episode length of a proper policy , we have from (A1). This fact in conjunction with implies
The main result that establishes a non-asymptotic bound for DRM-OnP-SF is given below. This result is for a random iterate , that is chosen uniformly at random from the policy parameters . Such a randomized stochastic gradient algorithm has been studied earlier in an stochastic optimization setting in .
The result above shows that after iterations of (9), DRM-OnP-SF returns an iterate that satisfies . To put it differently, to find an -stationary point of the DRM objective, an order iterations of DRM-OnP-SF are enough.
The proof uses the following results related to our on-policy estimation scheme:
1) DRM and its gradient are Lipschitz, i.e., ,
2) The DRM estimation error satisfies
3) The bias of the DRM gradient estimate satisfies
4) The variance of the DRM gradient estimate is bounded by
We now turn to proving the main result. Using the fundamental theorem of calculus, we obtain
Rearranging and taking expectations on both sides of (23), we obtain
Summing up (25) from , we obtain
Since is chosen uniformly at random from the policy iterates , we obtain
where last inequality follows since , , and .
Iv-B Non-asymptotic bounds for DRM-OffP-SF
The main result that establishes a non-asymptotic bound for our algorithm DRM-OffP-SF is given below.
For establishing the main result, we follow the technique employed in the proof of Theorem 1, and use the following results related to our off-policy estimation scheme in place of their on-policy counterparts:
1) The estimation error of the DRM satisfies .
2) The bias of the DRM gradient estimate is bounded by .
3) The variance of the DRM gradient estimate is bounded by .
The reader is referred to Appendix E for the detailed proof.
We proposed DRM-based approximate gradient algorithms for risk sensitive RL control. We employed SF-based gradient estimation schemes in on-policy as well as off-policy RL settings, and provided non-asymptotic bounds that establish convergence to an approximate stationary point of the DRM.
As future work, it would be interesting to study DRM optimization in a risk-sensitive RL setting with feature-based representations, and function approximation.
-  P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 203–228, 1999.
-  S. Wang, “A risk measure that goes beyond coherence,” 2002.
-  R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of risk, vol. 2, pp. 21–42, 2000.
-  A. Tversky and D. Kahneman, “Advances in prospect theory: Cumulative representation of uncertainty,” J. Risk Uncertain., vol. 5, 1992.
-  D. Denneberg, “Distorted probabilities and insurance premiums,” Methods of Operations Research, vol. 63, no. 3, pp. 3–5, 1990.
-  S. S. Wang, V. R. Young, and H. H. Panjer, “Axiomatic characterization of insurance prices,” Insur. Math. Econ., vol. 21, no. 2, 1997.
-  J. Wirch and M. Hardy, “Distortion risk measures: Coherence and stochastic dominance,” Insur. Math. Econ., vol. 32, pp. 168–168, 2003.
-  H. Gzyl and S. Mayoral, “On a relationship between distorted and spectral risk measures,” 2006.
-  B. Jones and R. Zitikis, “Empirical estimation of risk measures and related quantities,” North American Actuarial Journal, vol. 7, 2003.
-  S. Wang, “Premium calculation by transforming the layer premium density,” ASTIN Bulletin, vol. 26, no. 1, pp. 71–92, 1996.
-  V. Katkovnik and Y. Kulchitsky, “Convergence of a class of random search algorithms,” Autom. Remote. Control., vol. 33, 1972.
-  Y. Nesterov and V. Spokoiny, “Random gradient-free minimization of convex functions,” Found. Comut. Math., vol. 17, pp. 527– 566, 2017.
-  S. Bhatnagar, H. Prasad, and L. A. Prashanth, “Stochastic recursive algorithms for optimization. simultaneous perturbation methods,” Lecture Notes in Control and Inform. Sci., vol. 434, 2013.
-  A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor, “Policy gradient for coherent risk measures,” in Adv. Neural Inf. Process. Syst., 2015.
-  L. A. Prashanth and M. Fu, “Risk-sensitive reinforcement learning,” arXiv preprint arXiv:1810.09126, 2021.
-  P. Glynn, Y. Peng, M. Fu, and J. Hu, “Computing sensitivities for distortion risk measures,” INFORMS J. Comp., 2021.
-  N. Vijayan and L. A. Prashanth, “Policy gradient methods for distortion risk measures,” arXiv preprint arXiv:2107.04422, 2021.
-  L. A. Prashanth, C. Jie, M. Fu, S. Marcus, and C. Szepesvari, “Cumulative prospect theory meets reinforcement learning: Prediction and control,” in ICML, vol. 48, 2016, pp. 1406–1415.
K. Dowd and D. Blake, “After var: The theory, estimation, and insurance applications of quantile-based risk measures,”The Journal of Risk and Insurance, vol. 73, no. 2, pp. 193–229, 2006.
-  A. Balbás, J. Garrido, and S. Mayoral, “Properties of distortion risk measures,” Methodology and Computing in Applied Probability, vol. 11, no. 3, pp. 385–399, 2009.
-  D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, 1st ed. Athena Scientific, 1996.
-  A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: Gradient descent without a gradient,” in ACM-SIAM Symposium on Discrete Algorithms, 2005, pp. 385–394.
-  R. S. Sutton, H. Maei, and C. Szepesvári, “A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation,” in Adv. Neural Inf. Process. Syst., vol. 21, 2009.
-  M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli, “Stochastic variance-reduced policy gradient,” in ICML, 2018.
-  Z. Shen, A. Ribeiro, H. Hassani, H. Qian, and C. Mi, “Hessian aided policy gradient,” in ICML, 2019, pp. 5729–5738.
-  K. Zhang, A. Koppel, H. Zhu, and T. Basar, “Global convergence of policy gradient methods to (almost) locally optimal policies,” SIAM J. Control. Optim., vol. 58, no. 6, pp. 3586–3612, 2020.
-  S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM J. Optim., vol. 23, 2013.
-  J. Kim, “Bias correction for estimated distortion risk measure using the bootstrap,” Insur.: Math. Econ., vol. 47, pp. 198–205, 2010.
-  Y. E. Nesterov, Introductory Lectures on Convex Optimization - A Basic Course, ser. Applied Optimization, 2004, vol. 87.
-  X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of the admm: An iteration complexity perspective,” J. Sci. Comput., vol. 76, no. 1, pp. 327–363, 2018.
-  O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 1703–1713, 2017.
Appendix A Estimating DRM using Order statistics
The following lemma estimate the DRM in an on-policy RL setting.
where is the smallest order statistic from the samples .
We assume without loss of generality that , and obtain,