The goal in statistical learning is to learn hypotheses that generalize well, which is typically formalized by seeking to minimize the expected
risk associated with a given loss function. In general, a loss function is a map, where is a feature space and is an hypotheses space. In this case, the expected risk associated with a given hypothesis is given by . Since the data-generating distribution is typically unknown, the expected risk is approximated using observed i.i.d. samples of , and an hypothesis is then chosen to minimize the empirical risk . When choosing an hypothesis based on the empirical risk , one would like to know how close is to the actual risk ; only then can one infer something about the generalization property of the learned hypothesis .
Generalization bounds—in which the expected risk is bounded in terms of its empirical version up to some error—are at the heart of many machine learning problems. The main techniques leading to such bounds comprise uniform converges arguments (often involving the Rademacher complexity of the set ), algorithmic stability arguments (see e.g. (bousquet2002stability) and more recently the work from (DBLP:conf/alt/Abou-MoustafaS19; DBLP:journals/corr/abs-1910-07833; celisse2016stability)), and the PAC-Bayesian analysis for non-degenerate randomized estimators (mcallester2003). Behind these techniques lies concentration inequalities, such as Chernoff’s inequality (for the PAC-Bayesian analysis) and McDiarmid’s inequality (for algorithmic stability and the uniform convergence analysis), which control the deviation between population and empirical averages (see boucheron2003concentration; boucheron2013concentration; mcdiarmid1998concentration, among others).
Standard concentration inequalities are well suited for learning problems where the goal is to minimize the expected risk . However, the expected risk—the mean performance of an algorithm—might fail to capture the underlying phenomenon of interest. For example, when dealing with medical (responsitivity to a specific drug with grave side effects, etc.), environmental (such as pollution, exposure to toxic compounds, etc.), or sensitive engineering tasks (trajectory evaluation for autonomous vehicles, etc.), the mean performance is not necessarily the best objective to optimize as it will cover potentially disastrous mistakes (e.g., a few extra centimeters when crossing another vehicle, a slightly too large dose of a lethal compound, etc.) while possibly improving on average. There is thus a growing interest to work with alternative measures of risk (other than the expectation) for which standard concentration inequalities do not apply directly. Of special interest are coherent risk measures (artzner1999coherent) which possess properties that make them desirable in mathematical finance and portfolio optimization (allais1953comportement; ellsberg1961risk; rockafellar2000optimization), with a focus on optimizing for the worst outcomes rather than on average. Coherent risk measures have also been recently connected to fairness, and appear as a promising framework to control the fairness of an algorithm’s solution (williamson2019fairness).
A popular coherent risk measure is the Conditional Value at Risk (CVaR; see Pflug2000); for and random variable , measures the expectation of conditioned on the event that is greater than its
-th quantile.CVaR has been shown to underlie the classical SVM (takeda2008nu), and has in general attracted a large interest in machine learning over the past two decades (huo2017risk; bhat2019concentration; williamson2019fairness; Chen2009; Chow2014; prashanth2013actor; tamar2015optimizing; pinto2017robust; morimura2010nonparametric; Takeda2009).
Various concentration inequalities have been derived for , under different assumptions on , which bound the difference between and its standard estimator
with high probability(brown2007; wang2010; prashanth2013actor; kolla2019concentration; bhat2019concentration). However, none of these works extend their results to the statistical learning setting where the goal is to learn an hypothesis from data to minimize the conditional value at risk. In this paper, we fill this gap by presenting a sharp PAC-Bayesian generalization bound when the objective is to minimize the conditional value at risk.
Deviation bounds for CVaR were first presented by brown2007
. However, their approach only applies to bounded continuous random variables, and their lower deviation bound has a sub-optimal dependence on the level. wang2010 later refined their analysis to recover the “correct” dependence in , albeit their technique still requires a two-sided bound on the random variable . thomas2019 derived new concentration inequalities for CVaR with a very sharp empirical performance, even though the dependence on in their bound is sub-optimal. Further, they only require a one-sided bound on , without a continuity assumption.
kolla2019concentration were the first to provide concentration bounds for CVaR when the random variable is unbounded, but is either sub-Gaussian or sub-exponential. bhat2019concentration
used a bound on the Wasserstein distance between true and empirical cumulative distribution functions to substantially tighten the bounds ofkolla2019concentration when has finite exponential or
th-order moments; they also apply their results to other coherent risk measures. However, when instantiated with bounded random variables, their concentration inequalities have sub-optimal dependence in.
On the statistical learning side, duchi2018learning present generalization bounds for a class of coherent risk measures that technically includes CVaR. However, their bounds are based on uniform convergence arguments which lead to looser bounds compared with ours.
Our main contribution is a PAC-Bayesian generalization bound for the conditional value at risk, where we bound the difference , for , by a term of order with representing a complexity term which depends on . Due to the presence of inside the square-root, our generalization bound has the desirable property that it becomes small whenever the empirical conditional value at risk is small. For the standard expected risk, only state-of-the-art PAC-Bayesian bounds share this property (see e.g. langford2003pac; catoni2007pac; maurer2004note or more recently in DBLP:conf/nips/TolstikhinS13; mhammedi2019pac). We refer to (guedj2019primer) for a recent survey on PAC-Bayes.
As a by-product of our analysis, we derive a new way of obtaining concentration bounds for the conditional value at risk by reducing the problem to estimating expectations using empirical means. This reduction then makes it easy to obtain concentration bounds for even when the random variable is unbounded ( may be sub-Gaussian or sub-exponential). Our bounds have explicit constants and are sharper than existing ones due to kolla2019concentration; bhat2019concentration which deal with the unbounded case.
In Section 2, we define the conditional value at risk along with its standard estimator. In Section 3, we recall the statistical learning setting and present our PAC-Bayesian bound for CVaR. The proof of our main bound is in Section 4. In Section 5, we present a new way of deriving concentration bounds for CVaR which stems from our analysis in Section 4. Section 6 concludes and suggests future directions.
Let be a probability space. For , we denote by the space of -integrable functions, and we let be the set of probability measures on which are absolutely continuous with respect to . We reserve the notation for the expectation under the reference measure , although we sometimes write for clarity. For random variables , we denote the empirical distribution, and we let . Furthermore, we let
be the uniform distribution on the simplex. Finally, we use the notationto hide log-factors in the sample size .
Coherent Risk Measures (CRM).
A CRM (artzner1999coherent) is a functional that is simultaneously, positive homogeneous, monotonic, translation equivariant, and sub-additive111These are precisely the properties which make coherent risk measures excellent candidates in some machine learning applications (see e.g. (williamson2019fairness) for an application to fairness) (see Appendix B for a formal definition). For and a real random variable , the conditional value at risk is a CRM and is defined as the mean of the random variable conditioned on the event that is greater than its -th quantile222We use the convention in brown2007; wang2010; prashanth2013actor.. This is equivalent to the following expression, which is more convenient for our analysis:
Key to our analysis is the dual representation of CRMs. It is known that any CRM can be expressed as the support function of some closed convex set (rockafellar2013fundamental); that is, for any real random variable , we have
In this case, the set is called the risk envelope associated with the risk measure R. The risk envelope of is given by
and so substituting for in (2) yields . Though the overall approach we take in this paper may be generalizable to other popular CRMs, (see Appendix B) we focus our attention on CVaR for which we derive new PAC-Bayesian and concentration bounds in terms of its natural estimator ; given i.i.d. copies of of , we define
From now on, we write and for and , respectively.
3 PAC-Bayesian Bound for the Conditional Value at Risk
In this section, we briefly describe the statistical learning setting, formulate our goal, and present our main results.
In the statistical learning setting, is a loss random variable which can be written as , where is a loss function and [resp. ] is a feature [resp. hypotheses] space. The aim is to learn an hypothesis , or more generally a distribution over (also referred to as randomized estimator), based on i.i.d. samples of which minimizes some measure of risk—typically, the expected risk , where .
Our work is motivated by the idea of replacing this expected risk by any coherent risk measure R. In particular, if is the risk envelope associated with R, then our quantity of interest is
Thus, given a consistent estimator of and some prior distribution on , our grand goal (which goes beyond the scope of this paper) is to bound the risk as
with high probability. Based on (4), the consistent estimator we use for is
This is in fact a consistent estimator (see e.g. (duchi2018learning, Proposition 9)). As a first step towards the goal in (6), we derive a sharp PAC-Bayesian bound for the conditional value at risk, which we state now as our main theorem:
Let , , , and . Further, let be any distribution on a hypothesis set , be a loss, and be i.i.d. copies of . Then, for any “posterior” distribution over , , and , we have, with probability at least .
Discussion of the bound.
Although we present the bound for the bounded loss case, our result easily generalizes to the case where is sub-Gaussian or sub-exponential, for all . We discuss this in Section 5. Our second observation is that since is a coherent risk measure, it is convex in (rockafellar2013fundamental), and so we can further bound the term on the LHS of (8) from below by . This shows that the type of guarantee we have in (8) is in general tighter than the one in (6).
Even though not explicitly done before, a PAC-Bayesian bound of the form (6) can be derived for a risk measure R using an existing technique due to mcallester2003 as soon as, for any fixed hypothesis , the difference is sub-exponential with a sufficiently fast tail decay as a function of (see the proof of Theorem 1 in (mcallester2003)). While it has been shown that the difference also satisfies this condition for bounded i.i.d. random variables (see e.g. brown2007; wang2010), applying the technique of mcallester2003 yields a bound on (i.e. the LHS of (8)) of the form
Such a bound is weaker than ours in two ways; (I) by Jensen’s inequality the term in our bound (defined in (7)) is always smaller than the term in (9); and (II) unlike in our bound in (8), the complexity term inside the square-root in (9) does not multiply the empirical conditional value at risk . This means that our bound can be much smaller whenever is small—this is to be expected in the statistical learning setting since will typically be picked by an algorithm to minimize the empirical value . This type of improved PAC-Bayesian bound, where the empirical error appears multiplying the complexity term inside the square-root, has been derived for the expected risk in works such as (Seeger02; langford2003pac; catoni2007pac; maurer2004note); these are arguably the state-of-the-art generalization bounds.
A reduction to the expected risk.
A key step in the proof of Theorem 1 is to show that for a real random variable (not necessarily bounded) and , one can construct a function such that the auxiliary variable satisfies (I)
and (II) for i.i.d. copies of , the i.i.d. random variables satisfy
is sufficient to obtain a concentration bound for CVaR. Since are i.i.d., one can apply standard concentration inequalities, which are available whenever is sub-Gaussian or sub-exponential, to bound the difference in (12). Further, we show that whenever is sub-Gaussian or sub-exponential, then essentially so is . Thus, our method allows us to obtain concentration inequalities for , even when is unbounded. We discuss this in Section 5.
4 Proof Sketch for Theorem 1
In this section, we present the key steps taken to prove the bound in Theorem 1. We organize the proof in three subsections. In Subsection 4.1, we introduce an auxiliary estimator for , , which will be useful in our analysis; in particular, we bound this estimator in terms of (as in (11) above, but with the LHS replaced by ). In Subsection 4.2, we introduce an auxiliary random variable whose expectation equals (as in (10)) and whose empirical mean is bounded from above by the estimator introduced in Subsection 4.1—this enables the reduction described at the end of Section 3. In Subsection 4.3, we conclude the argument by applying the classical Donsker-Varadhan variational formula (donsker1976asymptotic; C75).
4.1 An Auxiliary Estimator for CVaR
In this subsection, we introduce an auxiliary estimator of and show that it is not much larger than . For , , and , define:
Using the set , and given i.i.d. copies of , let
In the next lemma, we give a “variational formulation” of , which will be key in our results:
Let , , and be as in (14). Then, for any ,
The proof of Lemma 2 (which is in Appendix A.1) is similar to that of the generalized Donsker-Varadhan variational formula considered in (Beck2003). The “variational formulation” on the RHS of (15) reveals some similarity between and the standard estimator defined in (4). In fact, thanks to Lemma 2, we have the following relationship between the two:
Let , , and . Further, let be the decreasing order statistics of . Then, for as in (13), we have
|and if (not necessarily positive), then|
4.2 An Auxiliary Random Variable
In this subsection, we introduce a random variable which satisfies the properties in (10) and (11), where are i.i.d. copies of (this is where we leverage the dual representation in (2)). This allows us to the reduce the problem of estimating CVaR to that of estimating an expectation.
Let be an arbitrary set, and be some fixed measurable function (we will later set to a specific function depending on whether we want a new concentration inequality or a PAC-Bayesian bound for CVaR). Given a random variable in (arbitrary for now), we define
|and the auxiliary random variable:|
Let and be i.i.d. random variables in . Then, (I) the random variable in (19) and , are i.i.d. and satisfy , for all ; and (II) with probability at least ,
and as in (14). We now present a concentration inequality for the random variable in (19); the proof, which can be found in Appendix A, is based on a version of the standard Bernstein’s moment inequality (cesa2006prediction, Lemma A.5):
4.3 Exploiting the Donsker-Varadhan Formula
for any hypothesis . Next, we will need the following result which follows from the classical Donsker-Varadhan variational formula (donsker1976asymptotic; C75):
Let , and be any fixed (prior) distribution over . Further, let be any family of random variables such that , for all . Then, for any (posterior) distribution over , we have
Let , and . Further, let be i.i.d. random variables in . Then, for any randomized estimator over , we have, with ,
with probability at least on the samples , where is defined in (13).
If we could optimize the RHS of (25) over , this would lead to our desired bound in Theorem 1 (after some rearranging). However, this is not directly possible since the optimal depends on the sample through the term . The solution is to apply the result of Theorem 7 with a union bound, so that (25) holds for any estimator taking values in a carefully chosen grid ; to derive our bound, we will use the grid From this point, the proof of Theorem 1 is merely a mechanical exercise of rearranging (25) and optimizing over , and so we postpone the details to Appendix A.
5 New Concentration Bounds for CVaR
In this section, we show how some of the results of the previous section can be used to reduce the problem of estimating to that of estimating a standard expectation. This will then enable us to easily obtain concentration inequalities for even when is unbounded. We note that previous works (kolla2019concentration; bhat2019concentration) used sophisticated techniques to deal with the unbounded case (sometimes achieving only sub-optimal rates), whereas we simply invoke existing concentration inequalities for empirical means thanks to our reduction.
Together, these two lemmas imply that, for any , i.i.d. random variables ,
with probability at least , where is as in (13) and are the decreasing order statistics of . Thus, getting a concentration inequality for can be reduced to getting one for the empirical mean of the i.i.d. random variables . Next, we show that whenever is a sub-exponential [resp. sub-Gaussian] random variable, essentially so is . But first we define what this means:
Let , , and be a random variable such that, for some ,
Then, is -sub-exponential [resp. -sub-Gaussian] if [resp. ].
Let and . Let be a zero-mean real random variable and let be as in (26). If is -sub-exponential [resp. -sub-Gaussian], then
Note that in Lemma 9 we have assumed that is a zero-mean random variable, and so we still need to do some work to derive a concentration inequality for . In particular, we will use the fact that and , which holds since and are coherent risk measures, and thus translation invariant (see Definition 14). We use this in the proof of the next theorem (which is in Appendix A):
Let , , and be as in (13). If is a -sub-Gaussian random variable, then with and , we have
|otherwise, if is -sub-exponential random variable, then|
We note that unlike the recent results due to bhat2019concentration which also deal with the unbounded case, the constants in our concentration inequalities in Theorem 10 are explicit.
A similar inequality holds for the sub-exponential case. We note that the term in (31) can be further bounded from above by
This follows from the triangular inequality and facts that (see e.g. Lemma 4.1 in brown2007), and (ahmadi2012entropic). The remaining term in (32) which depends on the unknown can be bounded from above using another concentration inequality.
Generalization bounds of the form (6) for unbounded but sub-Gaussian or sub-exponential , , can be obtained using the PAC-Bayesian analysis of (mcallester2003, Theorem 1) and our concentration inequalities in Theorem 10. However, due to the fact that is squared in the argument of the exponentials in these inequalities (which is also the case in the bounds of bhat2019concentration; kolla2019concentration) the generalization bounds obtained this way will have the outside the square-root “complexity term”—unlike our bound in Theorem 1.
We conjecture that the dependence on in the concentration bounds of Theorem 10 can be improved by swapping for in the argument of the exponentials; in the sub-Gaussian case, this would move inside the square-root on the RHS of (31). We know that this is at least possible for bounded random variables as shown in brown2007; wang2010. We now recover this fact by presenting a new concentration inequality for when is bounded using the reduction described at the beginning of this section.
Let , and be i.i.d. rvs in . Then, with probability at least ,
The proof is in Appendix A. The inequality in (33) essentially replaces the range of the random variable typically present under the square-root in other concentration bounds (brown2007; wang2010) by the smaller quantity . The concentration bound (33) is not immediately useful for computational purposes since its RHS depends on . However, it is possible to rearrange this bound so that only the empirical quantity appears on the RHS of (33) instead of ; we provide the means to do this in Lemma 13 in the appendix.
6 Conclusion and Future Work
In this paper, we derived a first PAC-Bayesian bound for CVaR by reducing the task of estimating CVaR to that of merely estimating an expectation (see Section 4). This reduction then made it easy to obtain concentration inequalities for CVaR (with explicit constants) even when the random variable in question is unbounded (see Section 5).
We note that the only steps in the proof of our main bound in Theorem 1 that are specific to CVaR are Lemmas 2 and 3, and so the question is whether our overall approach can be extended to other coherent risk measures to achieve (6).
In Appendix B, we discuss how our results may be extended to a rich class of coherent risk measures known as -entropic risk measures. These CRMs are often used in the context of robust optimization namkoong2017, and are perfect candidates to consider next in the context of this paper.
Appendix A Proofs
a.1 Proof of Lemma 2
Let , where for a set , if ; and otherwise. From (14), we have that is equal to
where we recall that . The Lagrangian dual of (34) is given by
where (37) is due to , and (38) follows by setting and noting that the in (37) is always attained at a point satisfying , in which case ; this is true because by the positivity of , if , then can always be made smaller while keeping the difference fixed. Finally, since the primal problem is feasible— is a feasible solution—there is no duality gap (see the proof of [Beck2003, Theorem 4.2]), and thus the RHS of (38) is equal to in (34). The proof is concluded by noting that the Fenchel dual of satisfies , for all . ∎
a.2 Proof of Lemma 3