# Learning with CVaR-based feedback under potentially heavy tails

We study learning algorithms that seek to minimize the conditional value-at-risk (CVaR), when all the learner knows is that the losses incurred may be heavy-tailed. We begin by studying a general-purpose estimator of CVaR for potentially heavy-tailed random variables, which is easy to implement in practice, and requires nothing more than finite variance and a distribution function that does not change too fast or slow around just the quantile of interest. With this estimator in hand, we then derive a new learning algorithm which robustly chooses among candidates produced by stochastic gradient-driven sub-processes. For this procedure we provide high-probability excess CVaR bounds, and to complement the theory we conduct empirical tests of the underlying CVaR estimator and the learning algorithm derived from it.

## Authors

• 14 publications
• 3 publications
• ### PAC-Bayes under potentially heavy tails

We derive PAC-Bayesian learning guarantees for heavy-tailed losses, and ...
05/20/2019 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Improved Estimator of the Conditional Tail Expectation in the case of heavy-tailed losses

In this paper, we investigate the extreme-value methodology, to propose ...
02/09/2020 ∙ by Mohamed Laidi, et al. ∙ 0

• ### Better scalability under potentially heavy-tailed gradients

We study a scalable alternative to robust gradient descent (RGD) techniq...
06/01/2020 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Efficient learning with robust gradient descent

Minimizing the empirical risk is a popular training strategy, but for le...
06/01/2017 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Robust learning with anytime-guaranteed feedback

Under data distributions which may be heavy-tailed, many stochastic grad...
05/24/2021 ∙ by Matthew J. Holland, et al. ∙ 0

• ### Fast learning rates with heavy-tailed losses

We study fast learning rates when the losses are not necessarily bounded...
09/29/2016 ∙ by Vu Dinh, et al. ∙ 0

• ### Better scalability under potentially heavy-tailed feedback

We study scalable alternatives to robust gradient descent (RGD) techniqu...
12/14/2020 ∙ by Matthew J. Holland, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In machine learning problems, since we only have access to limited information about the underlying data-generating phenomena or goal of interest, there is significant uncertainty inherent in the learning task. As a result, any meaningful performance guarantee for a learning procedure can only be stated with some degree of confidence (e.g., a high probability “good performance” event), usually with respect to the random draw of the data used for training. Assuming some loss

depending on parameter and data realization , given random data distributed as , the de facto standard performance metric in machine learning is the risk, or expected loss, defined

 \risk(w)\defeq\exx\ddist\loss(w;Z)=∫\ZZ\loss(w;z)\ddist(\difz),w∈\WW. (1)

The vast majority of research done on machine learning algorithms provides performance guarantees stated in terms of the risk [14, 12, 1]. This risk-centric paradigm goes beyond the theory and reaches into the typical workflow of any machine learning practitioner, since “off-sample performance” is typically evaluated by using the average loss on a separate set of “test data,” an empirical counterpart to the risk studied in theory. While the risk is convenient in terms of probabilistic analysis, it is merely one of countless possible descriptors of the distribution of . When using a learning algorithm designed to minimize the risk, one makes an implicit value judgement about how the learner should be penalized for “typical” mistakes versus “atypical” but egregious errors.

As machine learning techniques are applied in increasingly diverse domains, it is important to make this value judgement more explicit, and to offer users more flexibility in controlling the ultimate goal of learning. One of the best-known alternatives to the risk is the conditional value-at-risk (CVaR), which considers the expected loss, conditioned on the event that the loss exceeds a user-specified -level quantile, here denoted for each as

 \ccα(w)\defeq1α\exx\ddist\loss(w;Z)I{\loss(w;Z)≥\vvα(w)}=1α∫\loss(w;z)≥\vvα(w)\loss(w;z)\ddist(\difz), (2)

where (called value-at-risk, or VaR). Driven by influential work by Artzner et al. [2] and Rockafellar and Uryasev [32]

, under known parametric models, the problem of estimating and minimizing the CVaR reliably and efficiently has been rigorously studied, leading to a wide range of applications in finance

[22, 26], and even some specialized settings of machine learning tasks [37, 11]. In general machine learning tasks, however, a non-parametric scenario is more typical, where virtually nothing is known about the distribution of , adding significant challenges to both the design and analysis of procedures designed to minimize the CVaR with high confidence.

##### Our contributions

In this work, we consider the case of potentially heavy-tailed losses, namely a learning setup in which all the learner knows is that the distribution of

has finite variance. It is unknown in advance whether the losses are statistically congenial in the sub-Gaussian sense, or highly susceptible to outliers with infinite higher-order moments. Our main contributions:

• New error bounds for a large class of estimators of the CVaR for potentially heavy-tailed random variables (Algorithm 1, Theorem 3).

• A general-purpose learning algorithm which runs stochastic GD sub-processes in parallel and uses the new CVaR estimators to robustly validate the strongest candidate (Algorithm 2), which enjoys sharp excess CVaR bounds (Theorem 4).

• An empirical study (section 3) highlighting the potential computational advantages and robustness of the proposed approach to CVaR-based learning.

##### Review of related work

To put the contributions stated above in context, we give an overview of the two key strands of technical literature that are closely related to our work. First, an interesting line of work has recently developed which handles risk-averse learning scenarios where the losses can be heavy-tailed, with key works due to Kolla et al. [20], Prashanth et al. [30], Bhat and Prashanth [4], and Kagrecha et al. [19]. These works all consider some kind of sub-routine for robustly estimating the CVaR, as we do as well. The actual estimation procedures and proof techniques differ, and we provide a detailed comparison of resulting error bounds in section 2.2.1. Furthermore, the latter three works only consider rather specialized learning algorithms in the context of bandit-like online learning problems, whereas the generic gradient-based procedures we study in section 2.3 have a much wider range of applications. Second, recent work from Cardoso and Xu [8] and Soma and Yoshida [35] also consider tackling the CVaR-based learning problem using general-purpose gradient-based stochastic learning algorithms. However, these works assume a bounded (and thus sub-Gaussian) loss; we discuss differences in technical assumptions in detail in Remark 5, but the most important difference is that their setup precludes the possibility of heavy-tailed losses and is thus more restrictive statistically than ours, which naturally leads to different algorithms, proof techniques, and performance guarantees.

## 2 Theoretical analysis

This section is broken into three sub-sections. First we establish notation and basic technical conditions in section 2.1. We then study pointwise CVaR estimators in section 2.2, and subsequently leverage these results to derive a new learning algorithm with performance guarantees in section 2.3.

### 2.1 Preliminaries

In the context of learning problems, random variable denotes our data, taking values in some measurable space with the probability measure induced by . The set is a parameter set from which the learning algorithm chooses an element. We reinforce the point that the ultimate formal goal of learning here is to minimize defined in (2) over , where is a user-specified risk-level parameter. This is in contrast with the traditional risk-centric setup, which seeks to minimize defined in (1). For the pointwise estimation problem in section 2.2 to follow, to cut down on excess notation, we simply take , re-christen as the distribution of , and write the distribution function as for . Similarly, since the choice of is not important in section 2.2, there we shall write simply and for the CVaR and VaR of , and return to the -dependent notation and in section 2.3. For any , we denote by all positive integers less than or equal to . Finally, let denote the indicator function, returning when event is true, and otherwise.

Regarding technical assumptions, we shall henceforth assume that is continuous, which in particular implies that for all . This setup is entirely traditional; see for example the well-known work of Rockafellar and Uryasev [32]. In general, if has flat regions, there may be infinitely many quantiles; here as introduced in section 1 is simply defined to be the smallest one. See Figure 1 for an illustration. The key technical assumption that will be utilized is as follows:

• There exists values such that for any , the distribution function induced by satisfies .

Obviously, we are assuming that are within the domain of ; this is only for notational simplicity, and the range can be taken arbitrarily small. In words, assumption is a local assumption of both a -Lipschitz property and a -growth property, local in the sense that it need only hold around the particular point of interest. The former property ensures that cannot jump with arbitrary steepness in the region of interest. The latter ensures that is not flat in this region. Finally, we remark that the property of -growth is utilized in key recent work done on concentration of CVaR estimators under potentially heavy-tailed data, including Kolla et al. [20, Prop. 2] and Prashanth et al. [30, Lem. 5.1].

### 2.2 Robust estimation of the CVaR criterion

We begin by considering pointwise estimates, assuming that is a non-negative random variable, and that we have independent copies of , denoted for the first half, and for the second half. The latter half will be used to construct an estimator . The former half, with in hand, will be used to construct an estimator . As an initial approach to the problem, note that we can decompose the deviations as

 ∣∣\cchatα−\ccα∣∣ =1α∣∣α\cchatα−\exx\ddistXI{X≥\vvhatα}+\exx\ddistXI{X≥\vvhatα}−\exx\ddistXI{X≥\vvα}∣∣ ≤1α(∣∣α\cchatα−\exx\ddistXI{X≥\vvhatα}∣∣+∣∣\exx\ddistX(I{X≥\vvhatα}−I{X≥\vvα})∣∣). (3)

This gives us two terms to control. Starting with the left-most term, let us first make the notation a bit easier to manage. Conditioning on makes a fixed value, and based on this, we define

 X′\defeqXI{X≥\vvhatα}. (4)

Since is computed based on available data, and is observable, it follows that itself is observable. Denote the corresponding sample by , where we set . The most direct approach to this problem is to simply pass this transformed dataset to a sufficiently robust sub-routine for mean estimation. More precisely, we desire a sub-routine by which assuming only , for any choice of , we can guarantee

 (5)

where is a constant depending only on the nature of , is any quantity bounded as , and probability is taken with respect to the random draw of . The final estimator of interest, then, using observations in total, will simply be defined as

 \cchatα\defeq1α\cchat′α[\Xn,\Yn], where \cchat′α[\Xn,\Yn]\defeq\rmean[\X′n]. (6)

This general procedure is summarized in Algorithm 1.

Before proceeding any further, the first question to answer is whether or not such a procedure can be constructed. Fortunately, since and are independent, there are computationally efficient procedures which satisfy the key requirement (5). For concreteness, some well-known and useful examples of for arbitrary real values are as follows:

 ˆuMoM =\med{\overbaru(1),…,\overbaru(k)} (7) ˆuCat =\argminv∈\RRn∑i=1ρ(ui−vs) (8) ˆuLM =1nn∑i=1uiI{a≤ui≤b} (9) ˆuHol =snn∑i=1ψ(uis). (10)

The subscript MoM refers to classical median-of-means, and thus the set of points is partitioned into disjoint subsets, with referring to the arithmetic mean computed on the th subset [23, 18]. The estimator marked Cat refers to any M-estimator such that the convex function is differentiable, and satisfies the key conditions put forward by Catoni [9], with being a scaling parameter. The estimator marked LM refers to the truncated mean estimator studied by Lugosi and Mendelson [25, Sec. 2], where and are set using quantiles and a sample-splitting procedure. Finally, the estimator marked Hol is the soft truncation estimator studied by Holland [16, Sec. 3], where is a scaling parameter and

is a particular sigmoid function. In the following lemma, we summarize the robust mean estimation performance guarantees available for these estimators.

###### Lemma 1 (Procedures for good \Xn event).

The implementations of given in equations (7)–(10) satisfy (5) at confidence level , as follows.

• MoM: with and , whenever and .

• Cat: with and , whenever .

• LM: with and , whenever .

• Hol: with and .

###### Proof of Lemma 1.

All of these estimators require finite second moments, which trivially holds as by our assumptions on . For the median-of-means estimator MoM, see Devroye et al. [13, Sec. 4.1] or Hsu and Sabato [18] for a proof. For the Catoni-type estimator Cat, see Catoni [9, Prop. 2.4] for a proof and characteristics of and . For the truncated mean estimator LM, see the discussion and proofs from Lugosi and Mendelson [25, Thm. 1] and Lugosi and Mendelson [24, Thm. 6] for settings of and . For the soft truncation estimator Hol, see Holland [16, Prop. 4] for a proof and required properties of and . ∎

The preceding lemma settles any issues regarding the availability of a sufficiently accurate sub-routine under potentially heavy-tailed data. The key problem that remains is the fact that depends on , and thus the second sample . To remove this dependence, the following lemma will be useful (proof given in Appendix).

###### Lemma 2 (Good \Yn event).

Let the observations sorted in increasing order be denoted by , such that . It follows that with probability no less than over the draw of , we have that

 \vv2α≤Y∗(1−α)n≤\vvα/2.

Using the preceding lemma and setting , we have

 \vaa\ddistX′=\vaa\ddistXI{X≥\vvhatα} =\exx\ddistX2I{X≥\vvhatα}−(\exx\ddistXI{X≥\vvhatα})2 ≤σ2α \defeq\exx\ddistX2I{X≥V2α}−(\exx\ddistXI{X≥\vvα/2})2. (11)

As such, conditioning on and assuming that the good event of Lemma 2 holds, then using variance bound (11) and Lemma 1 for given by (6), writing for readability, it follows that

 \prr{|\cchat′α−\exx\ddistX′|>cσαε(n,δ)} ≤\prr{|\cchat′α−\exx\ddistX′|>cσ′ε(n,δ)}≤δ,

assuming that we use any of the first three methods listed in Lemma 1, since . Otherwise, setting will suffice. The bound (11) is useful since this gives us an upper bound which does not depend on the sample . Stated more precisely, over the random draw of , we have

 ∣∣α\cchatα−\exx\ddistXI{X≥\vvhatα}∣∣=|\cchat′α−\exx\ddistX′|≤cσα√1+log(δ−1)n (12)

with probability no less than .

Next, we consider the right-most summand in (3). This amounts to the error that must be incurred for not knowing exactly. To control this term, first observe that

 \exx\ddistX(I{X≥\vvα}−I{X≥\vvhatα}) ≤\exx\ddist\vvhatα(I{X≥\vvα}−I{X≥\vvhatα}) ≤\vvα/2(\ddist{X≥\vvα}−\ddist{X≥\vvhatα}) =\vvα/2(F\ddist(\vvhatα)−F\ddist(\vvα)) ≤\vvα/2\parasm(\vvhatα−\vvα).

The first inequality is immediate from the events attached to the two indicators being subtracted. The second inequality uses the good event of Lemma 2. The final inequality uses the local -Lipschitz property via . The problem has thus been reduced to obtaining two-sided bounds on the deviations , which can be done easily using standard concentration properties of the empirical distribution function, as follows. Based on sample , denote the empirical distribution function by , for . Considering the running assumption that , note that for any error level , if the deviations are , then we must have . It then follows that

 \prr{\vvhatα−\vvα>ε} ≤\prr{ˆFn(\vvα+ε)≤F\ddist(\vvα)} =\prr{F\ddist(\vvα+ε)−F\ddist(\vvα)≤F\ddist(\vvα+ε)−ˆFn(\vvα+ε)} ≤\prr{F\ddist(\vvα+ε)−F\ddist(\vvα)≤supu∈\RR[F\ddist(u)−ˆFn(u)]} ≤exp(−2n(F\ddist(\vvα+ε)−F\ddist(\vvα))2) ≤exp(−2n(\parasufε)2).

The first three lines are immediate from the facts just stated. The exponential tail bound is the refined version of Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which holds even if has at most a countably infinite number of discontinuities [21, Thm. 11.6]. The final inequality is due to the -growth assumption. For lower bounds, note that if , we must have , and a perfectly symmetric argument yields identical bounds on the probability of . Taking a union bound over these two events, it follows that with probability no less than , we have

 ∣∣\exx\ddistX(I{X≥\vvα}−I{X≥\vvhatα})∣∣≤\vvα/2\parasm|\vvhatα−\vvα|≤\vvα/2\parasmε,

for any

. Converting this into a high-probability confidence interval, we have

 ∣∣\exx\ddistX(I{X≥\vvα}−I{X≥\vvhatα})∣∣≤\vvα/2\parasm√2\parasuf√log(δ−1)n (13)

with probability no less than , assuming that . Taking (12) and (13) together, applied to (3), we have essentially proved the following result.

###### Theorem 3.

For any confidence level and risk level , assume that holds and . Letting be the output of Algorithm 1, and , with probability no less than , we have

 ∣∣\cchatα−\ccα∣∣≤1α(cσα+\vvα/2\parasm√2\parasuf)√1+log(δ−1)n,

where depends only on the choice of (specified in Lemma 1).

###### Proof of Theorem 3.

To prove this result simply involves sorting out the key facts presented above. The “good” event in the theorem statement is that in which both (12) and (13) hold together. This condition can fail if even one of the following bad events takes place:

 \EE1 \defeq{inequality (???) fails} \EE2 \defeq{event of Lemma ??? fails% } \EE3 \defeq⎧⎨⎩|\vvhatα−\vvα|> ⎷log(δ−1)2\parasuf2n⎫⎬⎭.

First of all, using Lemma 1 and the deviation bounds given by (5), we have

 \prr(\EE1)=\exx\Yn\prr[\EE1|\Yn]≤δ.

Next, by Lemma 2, if , then we have . Finally, by the two-sided DKW inequality, whenever , we have . If none of these three bad events take place, the good event holds, i.e., . A union bound implies that this holds with probability no less than , and via the original decomposition (3), we have

 ∣∣\cchatα−\ccα∣∣≤1α⎛⎝cσα√1+log(δ−1)n+\vvα/2\parasm√2\parasuf√log(δ−1)n⎞⎠,

which implies the desired result. ∎

#### 2.2.1 Comparison of estimation error bounds

From the technical literature on CVaR estimation under potentially heavy-tailed data, the work of Kolla et al. [20], Prashanth et al. [30], and Kagrecha et al. [19] are most closely related to our work, and in this remark we compare our results with theirs. To align our setup with theirs, we assume access to only data points in total, meaning the two data sets used in Theorem 3 will now be and , for simplicity assuming that is even. Furthermore, we convert our high-confidence interval into an exponential tail bound, which is the form taken by the main results in the cited works. First, given just observations, our Theorem 3 implies that

 \prr{∣∣\cchatα−\ccα∣∣>ε} B\textscours \defeqcσα+√2\vvα/2\parasm\parasuf.

The estimator considered by Prashanth et al. [30, Thm. 4.1], on the other hand, yields bounds of the form

 \prr{∣∣\cchatα−\ccα∣∣>ε} ≤8exp(−n(αε/B′)2),

where the factor is simply left as a “distribution-dependent factor.” Looking at their proof, in order to obtain concentration of the VaR estimator, they also effectively require a -growth property and have moment dependence. Furthermore, their proof is rather specialized to an estimator borrowed from Bubeck et al. [7], which does random truncation that is rather unintuitive when taken outside the context of online learning problems. Another closely related result published very recently is due to Kagrecha et al. [19]. They consider a more natural estimator, which simply truncates the data to before passing it to the classical empirical CVaR estimator routine. While is a user-specified parameter, it must be taken larger than a value which depends on the desired deviation level . In particular, since it must satisfy , when is sufficiently small, one ends up with bounds of the form

 \prr{∣∣\cchatα−\ccα∣∣>ε} ≤6exp(−nα3ε4/B′′), B′′ \defeq616(\exx\ddistX2)2.

Their results are obtained using very weak assumptions, the finiteness of is all that is required. The price paid for this generality is clearly the poor dependence on , , and the moments. In contrast, under mild additional assumptions on the behaviour of the distribution function around (namely ), we obtain much stronger results, using a very simple proof strategy, which can be readily applied to a wide collection of estimation routines.

### 2.3 CVaR-driven learning algorithms

We now proceed to our main point of interest, namely learning algorithms which seek to minimize the CVaR of the loss distribution, defined in (2), given only a sample , independent copies of . Computationally, it is convenient to introduce

 fα(w,v;Z)\defeqv+1α[\loss(w;Z)−v]+,w∈\WW,v∈\RR (14)

with expected value denoted by , not to be confused with from the previous section. This expectation has the useful property of being convex and continuously differentiable in , and being related to the quantities and through

 min{Fα(w,v):v∈\RR}=Fα(w,\vvα(w))=\ccα(w),

which holds for any choice of [32, Thm. 1]. This implies that if we have some candidates such that , then . Furthermore, solving the joint problem is equivalent to solving the two problems separately [32, Thm. 2], meaning that , where we denote , . When is convex in , the function is jointly convex in its arguments, and thus when is a convex set, convex optimization techniques can in principle be brought to bear on the problem. Of course in practice, this is a learning problem and the underlying distribution is never known.111This is also known as a stochastic convex optimization problem, and there is a rich literature on the subject. See the references given by Rockafellar and Uryasev [32, Sec. 2]. The traditional machine learning approach to this is empirical risk minimization, namely returning any

 (\what\erm,\vhat\erm)∈\argmin(w,v)∈\WW×\RR1nn∑i=1fα(w,v;Zi). (15)

While the objective function is not differentiable everywhere, sub-gradients can be readily computed, and descent methods using sub-gradients can be applied to implement this optimization [32, Sec. 4]. On the statistical side, however, under potentially heavy-tailed losses, only highly sub-optimal performance guarantees can be given in general for [6], which motivates the need for providing the learner with “better feedback.”

##### Problems with robust objectives

Recalling the analysis of the previous section 2.2, we constructed a procedure for obtaining sharp estimates of , pointwise in , under potentially heavy-tailed data. To extend the procedure given by Algorithm 1 and defined in (6) to this setting, one could naturally split the sample , compute

 \cchat′α(w;\Zn)\defeq\cchat′α[\X={\loss(w;Zi):i∈[⌊n/2⌋]},\Y={\loss(w;Zi):n/2

and set . For any candidate , the approximation is accurate with high confidence, as formalized in Theorem 3. This can naturally be interpreted as feedback to the learner which is “robust” to potentially heavy-tailed data. The most naive approach to this problem would be to replace the empirical mean with this robust estimator (16), namely any algorithm implementing

 \what∈\argminw∈\WW\cchat′α(w;\Zn)/α.

The statistical properties of such an are naturally of interest, but the computational task of actually obtaining such a is highly non-trivial; for example the work of Brownlees et al. [6] consider a similar quantity in the case of traditional risk minimization, but algorithmic considerations are left completely abstract. Indeed, even if is convex for all , we have no guarantee that will be. The exact same issues hold if we tackle a robustified version of the joint optimization task, namely

 (\what,\vhat)∈\argmin(w,v)∈\WW×\RR\rmean[{fα(w,v;Zi):i∈[n]}],

where is based on any procedure given in Lemma 1. All the robust estimates given by (or Algorithm 1) are easy to compute for any or , but are hard to minimize. It thus seems wiser to use such sub-routines for validation, i.e., to check that a particular candidate actually gets close to minimizing with sufficiently high confidence.

##### A more practical approach

With this intuition in mind, we present a procedure which utilizes the insights of section 2.2

to obtain strong statistical guarantees, without sacrificing computational efficiency. In words, we consider a simple divide-and-conquer procedure with independent sub-processes running stochastic gradient descent for the joint optimization of

, and a final robust validation step to determine a final candidate. This is summarized in Algorithm 2, and we unpack the notation below.

Most of the steps in Algorithm 2 are transparent; in the core validation step, we pass the sub-routine defined in (16) its own independent sample . It just remains to provide a more precise definition of the sequence referred to in the third line. Given a sequence of observations of arbitrary length , the core update is traditional projected stochastic sub-gradient descent:

 (\whatt,\vhatt)=\proj\WW×[0,\vv][(\whatt−1,\vhatt−1)−βtGα(\whatt−1,\vhatt−1;Zt)] (17)

The update direction here is

, namely any vector from the sub-differential of the map

. The operator denotes projection in the norm, and is a step-size parameter. The recursive definition in (17) bottoms out at , and is initialized by some pre-defined , passed to the algorithm as an input. The sequence referred to in Algorithm 2 is simply the sequence of iterates generated by (17) using data ; since all are independent copies of , the order does not matter. The key technical assumptions on the data are summarized below:

• Let hold for , for any choice of . Let be convex, have a diameter in norm of . Let and . Let be a convex, -Lipschitz continuous function of , for all .

The preceding assumptions clearly allow for potentially heavy-tailed losses. Note extends from section 2.2 to the case of . Under this setting, the following performance guarantee holds.

###### Theorem 4.

Under assumption 2.3, run Algorithm 2 with parameters , , for arbitrary choice of , and fix the step sizes in (17) to

 βt=α√\diameter2+\overbar\vvα(\parasm2\loss+(1−α)2)|\IIj|

for each sub-process, indexed by . We have

 \ccα(\overbarw(⋆))−\cc∗α≤2√2α(c\overbarσα+\overbar\vvα/2\parasm√2\parasuf)√1+log(5δ−1)n+\cteα√k(\parasm2\loss+(1−α)2)(\diameter2+\overbar\vv2α)n (18)

with probability no less than , where constant corresponds to those in Lemma 1.

###### Remark 5 (Discussion of related technical work).

As far as technical conditions go, the convexity, bounded diameter, and Lipschitz assumptions align with Soma and Yoshida [35, Thm. 3.6]. They run a single averaged SGD process using a surrogate objective, for multiple passes over the data; they assume bounded losses and Lipschitz-continuous gradients, yielding error bounds in expectation. In contrast, we do not require Lipschitz gradients, the losses can be unbounded (and potentially heavy-tailed of course), and we run multiple SGD processes in parallel, each of which takes only a single pass over the subset of data allocated to it. Finally, we remark that since their procedure does not actually make any direct estimates of , they do not use an assumption like 2.1. Note that it is certainly possible to modify our Algorithm 2 such that this assumption is not needed, by doing the final validation step based on an estimate of instead of . This would remove the need for 2.1, and instead result in bounds depending on the second moment of . The formal analysis goes through in a perfectly analogous fashion to our proof of Theorem 4 here. We leave empirical analysis of such an alternative procedure to future work.

Proving the preceding theorem just requires combining a few basic techniques and structural results. To open up the argument, note that for any choice of and , we can control the excess CVaR as

 \ccα(w)−\cc∗α=\ccα(w)−F∗α≤Fα(w,v)−F∗α. (19)

The equality and inequality follow respectively from Theorems 2 and 1 of Rockafellar and Uryasev [32]. Working on the right-hand side of this inequality, we can focus on (approximate) minimization of the function . While in principle this can be done in very sophisticated ways, for clarity of exposition, we adapt a well-known result for averaged stochastic gradient descent to the objective of interest here.

###### Lemma 6 (Convex, Lipschitz case; averaged SGD).

If the function is convex and -Lipschitz, consider running (17) for iterations, with fixed step size . Then averaging the iterates as

 (\what[m],\vhat[m])\defeq1mm∑t=1(\whatt−1,\vhatt−1),

it follows that in expectation over data that

 \exx[Fα(\what[m],\vhat[m])−F∗α]≤\parasm√\diameter2+V2m.

In order to utilize the preceding lemma, we simply need to confirm the required properties of , which we summarize in the following lemma.

###### Lemma 7.

Let be a convex set, and let the map defined on be convex and -Lipschitz, for all values of . Then for any , writing

 \parasmα\defeqmax⎧⎪ ⎪⎨⎪ ⎪⎩1,√\parasm2+(1−α)2α⎫⎪ ⎪⎬⎪ ⎪⎭,

we have that for all , the map defined on is convex and -Lipschitz.

Plugging in the content of Lemma 7 into Lemma 6, we have that the sub-processes in Algorithm 2 satisfy

 \exx[Fα(\overbarw(j),\overbarv(j))−F∗α]≤\parasmα ⎷\diameter2+\vv2⌊n/k⌋,j∈[k]. (20)

Finally, we use the fact that robust validations of the form studied in section 2.2 let us boost the confidence of the underlying SGD sub-processes [15, Lemma 2].

###### Lemma 8 (Boosting the confidence under potentially heavy tails).

Assume that we have an arbitrary learning algorithm , and a validation procedure such that for sample size , confidence level , and arbitrary , given samples and , we have

 \prr{\ccα(\learn[\Zn])−\cc∗α>ε(n)δ} ≤δ \prr{|\valid[w;\Z′n]−\ccα(w)|>ε′(n,δ)} ≤δ.

Then, if we split the sample into disjoint subsets indexed by , set for each , and , then for any choice of , it follows that

 \ccα(\what(⋆))−\cc∗α≤2ε′(n,δ)+\cteε(⌊nk⌋)

with probability no less than .

With these facts in hand, it is straightforward to prove the desired theorem.

###### Proof of Theorem 4.

Using inequality (19) to connect and , and Markov’s inequality to convert the bounds in expectation for the sub-processes given by (20) to high-probability bounds, it immediately follows that the requirement on in Lemma 8 is satisfied if we set , with corresponding to the right-hand side of the inequality (20), and Average simply denoting taking the arithmetic vector mean. As for the requirement on in Lemma 8, this is satisfied by setting