# Approximate gradient ascent methods for distortion risk measures

We propose approximate gradient ascent algorithms for risk-sensitive reinforcement learning control problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using distortion risk measure (DRM) of the cumulative discounted reward. Our algorithms estimate the DRM using order statistics of the cumulative rewards, and calculate approximate gradients from the DRM estimates using a smoothed functional-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.

## Authors

• 3 publications
• 18 publications
07/09/2021

### Likelihood ratio-based policy gradient methods for distorted risk measures: A non-asymptotic analysis

We propose policy-gradient algorithms for solving the problem of control...
06/23/2020

### Risk-Sensitive Reinforcement Learning: a Martingale Approach to Reward Uncertainty

We introduce a novel framework to account for sensitivity to rewards unc...
05/12/2014

### Policy Gradients for CVaR-Constrained MDPs

We study a risk-constrained version of the stochastic shortest path (SSP...
09/03/2020

### Bounded Risk-Sensitive Markov Game and Its Inverse Reward Learning Problem

Classical game-theoretic approaches for multi-agent systems in both the ...
12/05/2015

### Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

In many sequential decision-making problems one is interested in minimiz...
02/28/2018

### Verification of Markov Decision Processes with Risk-Sensitive Measures

We develop a method for computing policies in Markov decision processes ...
02/10/2020

### Statistically Efficient Off-Policy Policy Gradients

Policy gradient methods in reinforcement learning update policy paramete...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The objective in reinforcement learning (RL) is to find a policy which maximizes the mean of the cumulative reward. But risk-sensitive RL goes beyond the mean of the cumulative reward, and considers other aspects of the reward distribution such as variance, tail probabilities, and shape. Such attributes are quantified using a risk measure.

Though there are no scarcity for risk measures in literature, there is no consensus on an ideal risk measure. A risk measure is said to be coherent if it is translation invariant, sub-additive, positive homogeneous, and monotonic [1]. Coherent risk measures are very desirable as the aforementioned properties help to avoid inconsistent decisions. Later, [2] suggest that coherency may not be sufficient, and introduce a new smooth coherent risk measure. The risk measures like Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR) [3], and cumulative prospect theory (CPT) [4] are studied in RL literature. The popular risk measures like VaR and CVaR attained a lot of criticism in the past since these measures overlook any information from infrequent high severity outcomes. Though risk-neutral RL gives equal focus to all the outcomes, it is intuitive to emphasize desirable events, and de-emphasize undesirable events without ignoring infrequent extreme outcomes altogether.

A family of risk measures called distortion risk measures (DRM) [5, 6] uses a distortion function to distort the original distribution, and calculate the mean of the rewards with respect to the distorted distribution. A distortion function allows one to vary the emphasis on each possible reward value. The choice of the distortion function governs the risk measure. Further, choosing a concave distortion function ensures that the DRM is coherent [7]. The spectral risk functions are equivalent to distortion functions [8]. A DRM with an identity distortion function is simply the mean of the rewards. The popular risk measures like VaR and CVaR can be expressed as a DRM using appropriate distortion functions. But the distortion function is discontinuous for VaR, and though continuous, it is not differentiable at every point for CVaR. Hence in [2], the author disfavors such distortion functions and focuses on smooth distortion functions.

In this paper, we consider the family of DRMs with smooth distortion functions. Some examples of smooth distortion functions are dual-power function, quadratic function, square-root function, exponential function, and logarithmic function (see [9, 10] for more examples). In risk-neutral RL, the occasional extreme events get equal priority as other events. In a DRM, the distortion function is operating on the reward distribution without discarding any information. Hence, it is possible to emphasis frequent events, and still account for infrequent high severity events. As there is no universal ideal risk measure, it is intuitive to consider a risk measure which best fits the problem in hand. For DRMs, we may concentrate only on the distortion function.

In this paper, we consider optimizing the DRM in a risk-sensitive RL context. The goal in our formulation is to find a policy that maximizes the DRM of the cumulative reward in an episodic Markov decision process (MDP). We consider this problem in on-policy as well as off-policy settings, and employ the gradient ascent solution approach. Solving a DRM-sensitive MDP is challenging for two reasons. First, DRM is a risk measure that focuses on the entire distribution of the cumulative reward, while the regular value function objective in a risk-neutral RL setting is concerned with only the mean of this distribution. This observation implies a sample average of the total reward across sample episodes would not be sufficient to estimate DRM. Secondly, a gradient ascent algorithm requires an estimate of the gradient of the DRM objective, and such gradient information is not directly available in a typical RL setting. For the risk-neutral case, one has the policy gradient theorem, which leads to a straightforward gradient estimate from sample episodes.

For estimating DRM from sample episodes, we use the empirical distribution function (EDF) as a proxy for the true distribution. We provide a non-asymptotic bound on the mean-square error of this estimator, and this may be of independent interest. Next, to estimate the DRM gradient, we employ the smoothed functional (SF) method [11, 12, 13]. We use a variant of SF which use two function measurements corresponding to two perturbed policies. An SF-based estimation scheme may be restrictive for some applications in an on-policy RL setting, since we need separate sets of episodes corresponding to two perturbed policies. But, in an off-policy RL context, we only need a single set of episodes corresponding to a behavior policy. We provide bounds on the bias and variance of the aforementioned gradient estimates. Using these bounds, we establish that our DRM gradient ascent algorithms requires iterations to find an -stationary point of the DRM objective. To the best of our knowledge, non-asymptotic bounds have not been derived for an SF-based DRM gradient ascent algorithm in the current literature.

Related work. In [14], the authors propose a policy gradient algorithm for an abstract coherent risk measure, and derive a policy gradient theorem using the dual representation of a coherent risk measure. Their estimation scheme requires solving a convex optimization problem. Also, they establish asymptotic consistency of their proposed gradient estimate. In [15], the authors survey policy gradient algorithms for optimizing different risk measures in a constrained as well as an unconstrained RL setting. In a non-RL context, the authors in [16]

study the sensitivity of DRM using an estimator that is based on the generalized likelihood ratio method, and establish a central limit theorem for their gradient estimator. In

[17], authors study DRM, and derive a policy gradient theorem that caters to the DRM objective. They establish non-asymptotic bounds for their policy gradient algorithms which uses likelihood ratio (LR) based gradient estimation scheme. In [18] the authors consider a CPT-based objective in an RL setting, and they employ simultaneous perturbation stochastic approximation (SPSA) method for the gradient estimation, and provide asymptotic convergence guarantees for their algorithm. In comparison to the aforementioned works, we would like to note the following aspects:
(i) For the DRM measure, we estimate the gradient using SF-based estimation scheme while [17] uses a LR-based gradient estimation scheme. Similar to our work, [17] establishes a convergence rate that implies convergence to a stationary point of the DRM objective. Here denotes the number of iterations of the DRM gradient ascent algorithm. But the algorithms in [17] require episodes per iteration for both on policy and off-policy RL settings, whereas our algorithm for off-policy RL setting requires only a constant episodes per iteration, though our algorithm for on-policy RL-setting require episodes per iteration. The algorithms in [17] directly estimate the gradient using order statistics. Our algorithms uses a two part estimation scheme, where we first estimate the DRM using order statistics, and then estimate its gradient using SF-based estimation scheme.
(ii) For a general coherent risk measure, [14] uses gradient estimation scheme which requires solving a convex optimization problem, whereas our algorithms can directly estimate the gradient from the samples without solving any optimization sub-problem.
(iii) In [18], the guarantees for a gradient ascent algorithm based on SPSA are asymptotic in nature, and is for CPT in an on-policy RL setting. CPT is also based on a distortion function, but the distortion function underlying CPT is neither concave nor convex, and hence, it is non-coherent.
(iv) In [15], the authors derive a non-asymptotic bound of for an abstract smooth risk measure. They uses abstract gradient oracles which satisfies certain bias-variance conditions. In contrast, we provide concrete gradient estimation schemes in RL settings, and our bounds feature an improved rate of .

The rest of the paper is organized as follows: Section II describes the DRM-sensitive MDP. Section III introduces our algorithms, namely DRM-OnP-SF and DRM-OffP-SF. Section IV presents the non-asymptotic bounds for our algorithms. Finally, Section V provides the concluding remarks.

## Ii Problem formulation

### Ii-a Distortion risk measure (DRM)

The DRM of a random variable

is the expected value of

under a distortion of the cumulative distribution function (CDF)

, attained using a given distortion function . A DRM is defined using a Choquet integral as follows:

 ρg(X)=∫0−∞(g(1−FX(x))−1)dx+∫∞0g(1−FX(x))dx.

The distortion function is non-decreasing, with and . We can see that , if is the identity function. A few examples of are given in Table I, and their plots in Figure 1 (cf. [17]).

The DRMs are well studied from an ‘attitude towards risk’ perspective, and we refer the reader to [19, 20] for details. In this paper, we focus on ‘risk-sensitive decision making under uncertainty’, with DRM as the chosen risk measure. We incorporate DRMs into a risk-sensitive RL framework, and the following section describes our problem formulation.

### Ii-B DRM-sensitive Markov decision process

We consider an Markov decision process (MDP) with a state space and an action space . We assume that and are finite spaces. Let be the single stage scalar reward, and be the transition probability function. We consider episodic problems, where each episode starts at a fixed state , and terminates at a special zero reward absorbing state . The action selection is based on parameterized stochastic policies . We assume that the parameterized policies are proper, i.e., satisfying the following assumption:

###### (A1)

.

The assumption (A1) is commonly used in the analysis of episodic MDPs (cf. [21]).

We denote by and , the state and the action at time respectively. The cumulative discounted reward is defined by , where , , is the discount factor, and is the random length of an episode. We can see that a.s.

The DRM of , is defined as follows:

 ρg(θ)=∫0−Mr(g(1−FRθ(x))−1)dx+∫Mr0g(1−FRθ(x))dx, (1)

where is the CDF of , and .

Our goal is to find that maximizes the DRM, i.e,

 θ∗∈argmaxθ∈Rdρg(θ). (2)

## Iii DRM policy gradient algorithms

The optimization problem in (2) can be solved by a gradient ascent algorithm. But, in a typical RL setting, we have direct measurements of neither nor . In the following sections, we describe our algorithms that estimate these quantities in on-policy as well as off-policy RL settings.

### Iii-a DRM optimization in an on-policy RL setting

Our first algorithm DRM-OnP-SF solves (2) in an off-policy RL setting. In the following sections, we describe the estimation of the DRM , and its gradient .

#### Iii-A1 DRM estimation

We generate episodes using the policy , and estimate the CDF using sample averages. We denote by the cumulative reward of the episode . We form the estimate of as follows:

 (3)

Now, we form an estimate of as follows:

 ^ρGg(θ)=∫0−Mr(g(1−GmRθ(x))−1)dx+∫Mr0g(1−GmRθ(x))dx. (4)

Comparing (4) with (1), it is apparent that we have used the EDF in place of the true cdf .

We simplify (4) in terms of order statistics as follows:

 ^ρGg(θ)=∑mi=1Rθ(i)(g(1−i−1m)−g(1−im)), (5)

where is the smallest order statistic of the samples . The reader is referred to Lemma 1 in Appendix A for a proof. If we choose the distortion function as the identity function, then the estimator in (5) is merely the sample mean.

The DRM estimator in (4) is biased since . However, we can control the bias by increasing the number of episodes , as the mean squared error of this estimator is . The reader is referred to Lemma 13 in Appendix D for a proof.

We use a SF approach [12] to estimate . SF-based methods form a smoothed version of as , and use as an approximation for . The smoothed functional is defined as

 ρg,μ(θ)=Eu∈Bd[ρg(θ+μu)], (6)

where is sampled uniformly at random from the unit ball , and is the smoothing parameter. Here denotes the -dimensional Euclidean norm. The gradient of is

 (7)

where is sampled uniformly at random from the unit sphere (cf. [22, Lemma 2.1]).

We estimate the gradient using two randomly perturbed policies, namely and , where

is a random unit vector sampled uniformly from the surface of a unit sphere. The estimate

of is formed as follows:

 ˆ∇μ,n^ρGg(θ)=dn∑ni=1^ρGg(θ+μvi)−^ρGg(θ−μvi)2μvi, (8)

where is sampled uniformly at random from , and is as defined in (5). The gradient estimate is averaged over unit vectors to reduce the variance.

The update iteration in DRM-OnP-SF is as follows:

 θk+1=θk+αˆ∇μ,n^ρGg(θk), (9)

where is set arbitrarily, and is the step-size. Algorithm 1 presents the pseudocode of DRM-OnP-SF.

### Iii-B DRM optimization in an off-policy RL setting

Every iteration of DRM-OnP-SF needs episodes corresponds to perturbed policies (see Algorithm 1). In some practical applications, it may not be feasible to generate system trajectories corresponding to different perturbed policies. In our second algorithm DRM-OffP-SF, we overcome the aforementioned problem by performing off-policy evaluation, i.e., we collect episodes from a behavior policy , and estimate the values of the perturbed target policies. Using off-policy setting, the number of episodes needed in each iteration of our algorithm can be reduced to . Using episodes, we can calculate values for perturbed policies, hence we can use a constant number of episodes in each iteration of our algorithm.

For the analysis, we require the behavior policy to be proper, i.e.,

###### (A2)

.

Also, we require the target policy to be absolutely continuous with respect to , i.e.,

###### (A3)

.

An assumption like (A3) is a standard requirement for off-policy evaluation (cf. [23]).

In the following sections, we provide the estimation scheme for the DRM , and its gradient .

#### Iii-B1 DRM estimation

The cumulative discounted reward is defined by , where , , and .

We generate episodes using the policy to estimate the cdf using importance sampling. The importance sampling ratio is defined as

 ψθ=∏T−1t=0πθ(At∣St)b(At∣St). (10)

We denote by the cumulative reward, and the importance sampling ratio of the episode . We form the estimate of as follows:

 HmRθ(x) =min{^HmRθ(x),1}, where (11) ^HmRθ(x) =1m∑mi=11{Rbi≤x}ψθi. (12)

In the above, is an empirical estimate of as . Because of the importance sampling ratio, can get a value above . Since we are estimating a CDF, we restrict to . The mean squared error of our estimator is . The reader is referred to Lemma 18 in Appendix E for a proof.

Now we form an estimate of as

 ^ρHg(θ)=∫0−Mr(g(1−HmRθ(x))−1)dx+∫Mr0g(1−HmRθ(x))dx. (13)

We can simplify (13) in terms of order statistics as

 ^ρHg(θ) =Rb(1)+m∑i=2Rb(i)g(1−min{1,1mi−1∑k=1ψθ(k)}) (14) −m−1∑i=1Rb(i)g(1−min{1,1mi∑k=1ψθ(k)}),

where is the smallest order statistic of the samples , and is the importance sampling ratio of . The reader is referred to Lemma 2 in Appendix A for a proof.

We use the SF-based gradient estimation scheme as in Section III-A2, and an estimate of the gradient is formed as follows:

 ˆ∇μ,n^ρHg(θ)=dn∑ni=1^ρHg(θ+μvi)−^ρHg(θ−μvi)2μvi, (15)

The update iteration in DRM-OffP-SF is as follows:

 θk+1=θk+αˆ∇μ,n^ρHg(θk). (16)

The pseudocode of DRM-offP-SF algorithm is similar to Algorithm 1, but in each iteration, we get episodes from policy . Then, we generate DRM estimates using (14). We estimate the gradient using (15), and use the policy parameter update rule (16). The reader is referred to Algorithm 2 in Appendix F.

## Iv Main results

Our non-asymptotic analysis establishes a bound on the number of iterations of our proposed algorithms to find an

-stationary point of the DRM, which is defined below.

###### Definition 1 (ϵ-stationary point)

Let be the output of an algorithm. Then, is called an -stationary point of problem (2), if .

The study of convergence of the policy gradient algorithms to an -stationary point is common for non-asymptotic analysis in RL, since the objective is non-convex (cf. [24, 25]).

### Iv-a Non-asymptotic bounds for DRM-OnP-SF

We make the following assumptions to ensure the Lipschitzness, and smoothness of the DRM .

, , and .

###### (A5)

, , and .

An assumption like (A4) is common in the literature for the non-asymptotic analysis of policy gradient algorithms (cf. [26, 25]). The assumption (A5) helps us establish that the distortion functions and its derivative are Lipschitz continuous. A few examples of distortion functions, which satisfy (A5) are given in Table I. Since is bounded by definition, we can see that any whose second derivative is bounded, will have a bounded first derivative also.

Letting denote is the (random) episode length of a proper policy , we have from (A1). This fact in conjunction with implies

 ∃Me>0:T≤Me a.s. (17)

The main result that establishes a non-asymptotic bound for DRM-OnP-SF is given below. This result is for a random iterate , that is chosen uniformly at random from the policy parameters . Such a randomized stochastic gradient algorithm has been studied earlier in an stochastic optimization setting in [27].

###### Theorem 1

(DRM-OnP-SF) Assume (A1),(A4)-(A5). Let be the policy parameters generated by DRM-OnP-SF, and let be chosen uniformly at random from this set. Set , , , and . Then

 E[∥∥∇ρg(θR)∥∥2]≤2(ρ∗g−ρg(θ0))√N+d2L2ρ′√N+2d2Lρ′L2ρN√N +16d2Lρ′M2rM2g′mN+4d2L2ρN+16d2M2rM2g′m√N.

In the above, , and . The constants , and , with as in (17). The constants , and are as in (A4)-(A5).

###### Remark 1

The result above shows that after iterations of (9), DRM-OnP-SF returns an iterate that satisfies . To put it differently, to find an -stationary point of the DRM objective, an order iterations of DRM-OnP-SF are enough.

[Theorem 1] We provide a proof sketch below. For a detailed proof, the reader is referred to Appendix D.

The proof uses the following results related to our on-policy estimation scheme:
1) DRM and its gradient are Lipschitz, i.e., ,

 (18) ∥∥∇ρg(θ1)−∇ρg(θ2)∥∥≤Lρ′∥θ1−θ2∥. (19)

2) The DRM estimation error satisfies

 E[∣∣ρg(θ)−^ρGg(θ)∣∣2]≤16M2rM2g′m. (20)

3) The bias of the DRM gradient estimate satisfies

 E∥∥ˆ∇μ,n^ρGg(θ)−∇ρg(θ)∥∥2≤μ2d2L2ρ′+4d2L2ρn+16d2M2rM2g′μ2mn. (21)

4) The variance of the DRM gradient estimate is bounded by

 E[∥∥ˆ∇μ,n^ρGg(θ)∥∥2]≤2d2L2ρn+16d2M2rM2g′μ2mn. (22)

We now turn to proving the main result. Using the fundamental theorem of calculus, we obtain

 ρg(θk)−ρg(θk+1)=⟨∇ρg(θk),θk−θk+1⟩ +∫10⟨∇ρg(θk+1+τ(θk−θk+1))−∇ρg(θk),θk−θk+1⟩dτ ≤⟨∇ρg(θk),θk−θk+1⟩ =⟨∇ρg(θk),θk−θk+1⟩+Lρ′2∥θk−θk+1∥2 =α⟨∇ρg(θk),−ˆ∇μ,n^ρGg(θk)⟩+Lρ′α22∥∥ˆ∇μ,n^ρGg(θk)∥∥2 =α⟨∇ρg(θk),∇ρg(θk)−ˆ∇μ,n^ρGg(θk)⟩ −α∥∥∇ρg(θk)∥∥2+Lρ′α22∥∥ˆ∇μ,n^ρGg(θk)∥∥2 −α∥∥∇ρg(θk)∥∥2+Lρ′α22∥∥ˆ∇μ,n^ρGg(θk)∥∥2 =α2∥∥∇ρg(θk)−ˆ∇μ,n^ρGg(θk)∥∥2 (23)

Rearranging and taking expectations on both sides of (23), we obtain

 ≤2E[ρg(θk+1)−ρg(θk)]+Lρ′α2E[∥∥ˆ∇μ,n^ρGg(θk)∥∥2] +αE[∥∥∇ρg(θk)−ˆ∇μ,n^ρGg(θk)∥∥2]. (24)

Using (21) and (22), we simplify (24) as

 +Lρ′α2⎛⎝2d2L2ρn+16d2M2rM2g′μ2mn⎞⎠ +α⎛⎝4d2L2ρn+μ2d2L2ρ′+16d2M2rM2g′μ2mn⎞⎠. (25)

Summing up (25) from , we obtain

 α∑N−1k=0E[∥∥∇ρg(θk)∥∥2]≤2E[ρg(θN)−ρg(θ0)] +NLρ′α2⎛⎝2d2L2ρn+16d2M2rM2g′μ2mn⎞⎠ +Nα⎛⎝4d2L2ρn+μ2d2L2ρ′+16d2M2rM2g′μ2mn⎞⎠.

Since is chosen uniformly at random from the policy iterates , we obtain

 E[∥∥∇ρg(θR)∥∥2]=1N∑N−1k=0E[∥∥∇ρg(θk)∥∥2] ≤2(ρ∗g−ρg(θ0))Nα+Lρ′α⎛⎝2d2L2ρn+16d2M2rM2g′μ2mn⎞⎠ +4d2L2ρn+μ2d2L2ρ′+16d2M2rM2g′μ2mn ≤2(ρ∗g−ρg(θ0))√N+d2L2ρ′√N+2d2Lρ′L2ρN√N +16d2Lρ′M2rM2g′mN+4d2L2ρN+16d2M2rM2g′m√N,

where last inequality follows since , , and .

### Iv-B Non-asymptotic bounds for DRM-OffP-SF

The main result that establishes a non-asymptotic bound for our algorithm DRM-OffP-SF is given below.

###### Theorem 2

(DRM-OffP-SF) Assume (A1)-(A5). Let be the policy parameters generated by DRM-OffP-SF, and let be chosen uniformly at random from this set. Set , , , and . Then

 E[∥∥∇ρg(θR)∥∥2]≤2(ρ∗g−ρg(θ0))√N+d2L2ρ′√N+2d2Lρ′L2ρN√N +16d2Lρ′M2rM2g′M2smN+4d2L2ρN+16d2M2rM2g′M2sm√N.

In the above, , and . The constants , and , with as in (17). The constants , and are as in (A4)-(A5), while is an uniform upper bound on the importance sampling ratio .

(Sketch) For establishing the main result, we follow the technique employed in the proof of Theorem 1, and use the following results related to our off-policy estimation scheme in place of their on-policy counterparts:
1) The estimation error of the DRM satisfies .
2) The bias of the DRM gradient estimate is bounded by .
3) The variance of the DRM gradient estimate is bounded by .
The reader is referred to Appendix E for the detailed proof.

## V Conclusions

We proposed DRM-based approximate gradient algorithms for risk sensitive RL control. We employed SF-based gradient estimation schemes in on-policy as well as off-policy RL settings, and provided non-asymptotic bounds that establish convergence to an approximate stationary point of the DRM.

As future work, it would be interesting to study DRM optimization in a risk-sensitive RL setting with feature-based representations, and function approximation.

## References

• [1] P. Artzner, F. Delbaen, J. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 203–228, 1999.
• [2] S. Wang, “A risk measure that goes beyond coherence,” 2002.
• [3] R. T. Rockafellar and S. Uryasev, “Optimization of conditional value-at-risk,” Journal of risk, vol. 2, pp. 21–42, 2000.
• [4] A. Tversky and D. Kahneman, “Advances in prospect theory: Cumulative representation of uncertainty,” J. Risk Uncertain., vol. 5, 1992.
• [5] D. Denneberg, “Distorted probabilities and insurance premiums,” Methods of Operations Research, vol. 63, no. 3, pp. 3–5, 1990.
• [6] S. S. Wang, V. R. Young, and H. H. Panjer, “Axiomatic characterization of insurance prices,” Insur. Math. Econ., vol. 21, no. 2, 1997.
• [7] J. Wirch and M. Hardy, “Distortion risk measures: Coherence and stochastic dominance,” Insur. Math. Econ., vol. 32, pp. 168–168, 2003.
• [8] H. Gzyl and S. Mayoral, “On a relationship between distorted and spectral risk measures,” 2006.
• [9] B. Jones and R. Zitikis, “Empirical estimation of risk measures and related quantities,” North American Actuarial Journal, vol. 7, 2003.
• [10] S. Wang, “Premium calculation by transforming the layer premium density,” ASTIN Bulletin, vol. 26, no. 1, pp. 71–92, 1996.
• [11] V. Katkovnik and Y. Kulchitsky, “Convergence of a class of random search algorithms,” Autom. Remote. Control., vol. 33, 1972.
• [12] Y. Nesterov and V. Spokoiny, “Random gradient-free minimization of convex functions,” Found. Comut. Math., vol. 17, pp. 527– 566, 2017.
• [13] S. Bhatnagar, H. Prasad, and L. A. Prashanth, “Stochastic recursive algorithms for optimization. simultaneous perturbation methods,” Lecture Notes in Control and Inform. Sci., vol. 434, 2013.
• [14] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor, “Policy gradient for coherent risk measures,” in Adv. Neural Inf. Process. Syst., 2015.
• [15] L. A. Prashanth and M. Fu, “Risk-sensitive reinforcement learning,” arXiv preprint arXiv:1810.09126, 2021.
• [16] P. Glynn, Y. Peng, M. Fu, and J. Hu, “Computing sensitivities for distortion risk measures,” INFORMS J. Comp., 2021.
• [17] N. Vijayan and L. A. Prashanth, “Policy gradient methods for distortion risk measures,” arXiv preprint arXiv:2107.04422, 2021.
• [18] L. A. Prashanth, C. Jie, M. Fu, S. Marcus, and C. Szepesvari, “Cumulative prospect theory meets reinforcement learning: Prediction and control,” in ICML, vol. 48, 2016, pp. 1406–1415.
• [19]

K. Dowd and D. Blake, “After var: The theory, estimation, and insurance applications of quantile-based risk measures,”

The Journal of Risk and Insurance, vol. 73, no. 2, pp. 193–229, 2006.
• [20] A. Balbás, J. Garrido, and S. Mayoral, “Properties of distortion risk measures,” Methodology and Computing in Applied Probability, vol. 11, no. 3, pp. 385–399, 2009.
• [21] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, 1st ed.   Athena Scientific, 1996.
• [22] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: Gradient descent without a gradient,” in ACM-SIAM Symposium on Discrete Algorithms, 2005, pp. 385–394.
• [23] R. S. Sutton, H. Maei, and C. Szepesvári, “A convergent o(n) temporal-difference algorithm for off-policy learning with linear function approximation,” in Adv. Neural Inf. Process. Syst., vol. 21, 2009.
• [24] M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli, “Stochastic variance-reduced policy gradient,” in ICML, 2018.
• [25] Z. Shen, A. Ribeiro, H. Hassani, H. Qian, and C. Mi, “Hessian aided policy gradient,” in ICML, 2019, pp. 5729–5738.
• [26] K. Zhang, A. Koppel, H. Zhu, and T. Basar, “Global convergence of policy gradient methods to (almost) locally optimal policies,” SIAM J. Control. Optim., vol. 58, no. 6, pp. 3586–3612, 2020.
• [27] S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM J. Optim., vol. 23, 2013.
• [28] J. Kim, “Bias correction for estimated distortion risk measure using the bootstrap,” Insur.: Math. Econ., vol. 47, pp. 198–205, 2010.
• [29] Y. E. Nesterov, Introductory Lectures on Convex Optimization - A Basic Course, ser. Applied Optimization, 2004, vol. 87.
• [30] X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of the admm: An iteration complexity perspective,” J. Sci. Comput., vol. 76, no. 1, pp. 327–363, 2018.
• [31] O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 1703–1713, 2017.

## Appendix A Estimating DRM using Order statistics

The following lemma estimate the DRM in an on-policy RL setting.

###### Lemma 1

.

Our proof follows the technique from [28]. We rewrite (3) as

 GmRθ(x)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩0,if x

where is the smallest order statistic from the samples .

We assume without loss of generality that , and obtain,

 ^ρGg(θ)=0∫−Mr(g(1−GmRθ(x))−1)dx+Mr∫0g(1−GmRθ(x))