to prove high-probability excess risk upper bounds for empirical risk minimization (ERM) for random design settings even if the magnitude of the noise and the estimates is unbounded. Our result (Theorem 1) covers bounded settings (Bartlett et al., 2005; Koltchinskii, 2011), extends to sub-Gaussian or even subexponential noise (van de Geer, 2000; Györfi and Wegkamp, 2008), and handles hypothesis classes with unbounded magnitude (Lecué and Mendelson, 2013; Mendelson, 2014; Liang et al., 2015). Furthermore, it applies to many loss functions besides the squared loss, and does not need additional statistical assumptions such as the bounded kurtosis of the transformed covariates over the hypothesis class, which prevent the latest developments to provide tight excess risk bounds for many sub-Gaussian cases (Section 1.2).
To demonstrate the effectiveness of our method for such unbounded settings, we use our general excess risk bound (Theorem 1) to provide a detailed analysis for linear least squares estimators using quadratic slope constraint and penalty with sub-Gaussian noise and domain for the random design, nonrealizable setting (Section 3). Our result for the slope constrained case extends Theorem A of Lecué and Mendelson (2013) and nearly proves the conjecture of Shamir (2015)
, while our treatment for the penalized case (ridge regression) is comparable to the work ofHsu et al. (2014).
The rest of this section introduces our notation through the formal definition of the regression problem and ERM estimators (Section 1.1), and discusses the limitations of current excess risk upper bounds in the literature (Section 1.2). Then, we provide our main result in Section 2 to upper bound the excess risk of ERM estimators, and discuss its properties for various settings including many loss functions besides the squared loss. Next, Section 3 provides a detailed analysis for linear least squares estimators including the slope constrained case (Section 3.1) and ridge regression (Section 3.2). Finally, Section 4 proves our main result (Theorem 1).
1.1 Empirical risk minimization
For the formal definition of a regression problem, consider a probability distributionover some set with some domain being a separable Hilbert space,111 All sets and functions considered are assumed to be measurable as necessary. To simplify the presentation, we omit these conditions by noting here that all the measurability issues can be overcome using standard techniques as we work with separable Hilbert spaces (e.g., Dudley, 1999, Chapter 5). a loss function , and a reference class .
The task of a regression estimator is to produce a function based on a training sample of pairs independently sampled from (in short ), such that the prediction error, , is “small” on a new instance with respect to .
The risk of function is defined as and the cost of using a fixed function is measured by the excess risk with respect to :
We also use the notation for any , hence we can write for any .222A straightforward limiting argument can be used if the minimums are not attained for .
An estimator is a sequence of functions , where maps the data to an estimate . These estimates lie within some hypothesis class , that is , where might depend on the random sample .
Then, for a regression problem specified by , the goal of an estimator is to produce estimates which minimize the excess risk with high-probability or in expectation, where the random event is induced by the random sample and the possible randomness of the estimator .
In this paper, we consider ERM estimators. Formally, is called an -approximate -penalized ERM estimate with respect to the class , in short -ERM(), when and
where is the empirical risk of function , is a penalty function and is an error term. All , , and might depend on the sample . When the penalty function is zero (that is ), we simply write -ERM(). If both and , we say ERM().
1.2 Limitations of current methods
Now we provide a simple regression problem class for which we are not aware of any technique in the literature that could provide a tight excess risk bound up to logarithmic factors for empirical risk minimization. Consider the following problem set:
where , a.s. stands for almost surely, and, and consider the squared loss defined as for all . Notice that and for all . Then, we discuss various techniques from the literature which aim to bound the “performance” of an estimate ERM().
Because here we have a random design setting, the results of van de Geer (2000, Theorems 9.1 and 9.2) do not apply. Moreover, the regression function cannot be represented by the class , so the methods of Györfi et al. (2002, Theorem 11.3), and Györfi and Wegkamp (2008, Corollary 1) do not provide an excess risk bound.
As the domain is unbounded, so does the range of any nonzero function in . Additionally, the squared loss is neither Lipschitz, nor bounded on the range of response which is the whole real line . Hence, the techniques including Bartlett et al. (2005, Corollary 5.3), Koltchinskii (2011, Theorem 5.1), Mehta and Williamson (2014, Theorem 6), Grünwald and Mehta (2016, Theorem 14 with Proposition 4) fail to provide any rate for this case.
We also mention the work of van der Vaart and Wellner (2011, Theorem 3.2), which, although works for this setting, can only provide an rate for sample size , which can be improved to by our result (Lemmas 3 and 1).
Next, denote the kurtosis about the origin by
for some random variable, and consider the recent developments of Lecué and Mendelson (2013, Theorem A), and Liang et al. (2015, Theorem 7). These results need that the kurtosis of the random variables is bounded for any . However, observe that for any function with , which can be arbitrarily large as gets close to zero.
1.3 Highlights of our technique
Our excess risk bound builds on the development of inexact oracle inequalities for ERM estimators (e.g., Györfi et al., 2002, Theorem 11.5), which uses the decomposition
with some . Then the random variables for all
, having a negative bias, often satisfy a moment condition (5), which we cannot guarantee for . Using this moment condition, we can augment the chaining technique (e.g., Pollard, 1990, Section 3) with an extra initial step, which provides a new term in the bound. This new term can be balanced with the (truncated) entropy integral, so tightening the bound significantly in many cases.
By defining as a reference function (instead of regression function), the inexact oracle inequalities become exact when . In fact, the notion of exact and inexact becomes meaningless as long as the approximation error between and is kept under control and incorporated into the bound as it is often done for sieved estimators (e.g., van de Geer, 2000, Section 10.3).
To prove the moment condition (5) for a reference function and a hypothesis class , we use Bernstein’s inequality (Lemma 14) with the Bernstein condition (3). These tools are standard, however we have to use Bernstein’s inequality for the sub-Gaussian random variable so that appears in the bound. A naive way to do this would require the kurtosis to be bounded for all , which cannot be guaranteed in many cases (Section 1.2). Hence, we use a truncation technique (Lemma 15) that pushes the kurtosis bound under a logarithmic transformation, which can be eliminated by considering functions with excess risk bounded away from zero.
The Bernstein condition (3) has been well-studied for strongly-convex loss functions (e.g., Bartlett et al., 2006, Lemma 7), by exploiting that strong-convexity provides an upper bound to the quadratic function. However, because it is enough for our technique to consider functions with excess risk bouded away from zero, we can use the Bernstein condition for any Lipschitz loss function (Section 2.2.1) by scaling its parameters depending on the sample size and balancing the appropriate terms in the excess risk bound (Theorem 1). In many cases, this provides an alternative way for deriving excess risk bounds for other loss functions without using the entropy integral.
2 Excess risk upper bound
Here we are going to state our excess risk upper bound for ERM estimators.
Our result requires a few conditions to be satisfied by the random variables with , which are related to the excess risk through . Similarly, we use the empirical excess risk defined as , where .
We also use subexponential random variables (
) and vectorscharacterized by the -Orlicz norm with defined as , where , , is the Euclidean norm, and . The properties of random vectors with are reviewed in Appendix A.
Furthermore, we need covering numbers and entropies. Let be a nonempty metric space and . The set is called an (internal) -cover of under if the -balls of centers and radius cover : for any , . The -covering number of under , denoted by , is the cardinality of the -cover with the fewest elements:
with . Further, the -entropy of under is defined as the logarithm of the covering number, .
Finally, our upper bound on the excess risk of ERM estimates is the following:
Consider a regression problem with an i.i.d. training sample . Let be a hypothesis class which might depend on the data , and let -ERM(). Further, let be two function classes, where might depend on , but might depend on the sample only through its size . Finally, suppose that the following conditions hold for some metric , , , and for some :
the enclosement holds with probability at least ,
there exists such that ,
there exists and such that a.s. for all , and ,
there exists and color=blue!20!white,color=blue!20!white,todo: color=blue!20!white,BG: Would be nice with in general. But extending Lemma 13 is not trivial. such that holds for all ,
there exist and such that
Then for all , we have with probability at least that
Furthermore, the result holds without (4), that is using , , and .
We point out that (1) disappears when one sets as it is usually done in the literature. However, this is an implicit assumption that either is small enough to be negligible (i.e., ), or equivalently the estimator knows the value of . By choosing the sets to be slightly different, Theorem 1 covers the practical case when an estimator approximates and by their empirical versions and , respectively, so uses a data-dependent hypothesis class .
Notice that if , then (2) reduces to bounding the penalty and error terms, that is proving with probability at least . When ERM(), which is a usual setting in the literature, (2) is immediately satisfied by . In this case Theorem 1 is an exact oracle inequality (e.g., Lecué and Mendelson, 2013, Eq. 1.1).
Furthermore, observe that Theorem 1 uses metric for the entropy , which is related to the loss function through (3) and (4). This allows us to apply the result to estimates with unbounded magnitude, which can be parametrized by some bounded space. In such case is defined on the bounded parameter space which keeps the entropy finite.
In the following sections we provide a detailed analysis for the moment condition (5), showing that it holds for many practical settings and loss functions besides the squared loss. We note that (5) is very similar to the stochastic mixability condition of Mehta and Williamson (2014, Section 2.1), which is equivalent to (5) with and .
Finally, we mention that if the conditions of Theorem 1 hold for all , we can transform the result to an expected excess risk bound. To see this, suppose that holds for all , some , and some . Then setting for any , we get
where the expectation is taken with respect to the random sample and the potential extra randomness of the estimator producing .
2.1 Bounded losses
We start with a simple case when the loss function is bounded, which implies that holds for some . Now notice that the random variable in the exponent of (5), that is , has a negative expected value . Furthermore, is bounded away from zero as by the definition of . Then, combining these observations with Hoeffding’s lemma, we get the following result:
Suppose that a.s. holds for some . Then satisfies (5) with any and .
Fix any , and set . Then, apply Hoeffding’s lemma to the bounded random variable to get
as by due to the definiton of and . ∎
2.2 Unbounded losses
We show that the moment condition (5) is often implied by the Bernstein condition (e.g., Lecué and Mendelson, 2013, Definition 1.2), which is said to be satisfied by if there exists such that for all , we have
Hence, the conditions of Bernstein’s lemma (Lemma 14) hold for and with any , so by , we obtain
where in the last step we applied the Bernstein condition (3) also implying for all , and used . ∎
Notice that Lemma 3 “splits” the subexponential property of the random variable between and . For the squared loss , using provides the sub-Gaussian setting (Lecué and Mendelson, 2013). Furthermore, when the random variable is bounded, that is a.s., we have for all , hence we can use and to cover the setting of uniformly bounded functions and subexponential noise with the squared loss (van de Geer, 2000, Section 9.2). For bounded problems (e.g., Bartlett and Mendelson, 2006; Koltchinskii, 2011), when is bounded, we can use , and eliminate the term completely.
Lemma 3 also shows that even in the worst case we can set the leading constant in the bound of Theorem 1 as with , which scales logarithmically in the sample size , and depends on the regression parameters only through the Bernstein condition (3) of . In the following sections we investigate this dependence for a few popular regression settings.
2.2.1 Lipschitz losses
Observe that if holds for some , then the Bernstein condition is always satisfied for the function class with any by . To see this, use Lemma 124 with , and for any due to the definition of , to obtain
As and so scale with , this setting is similar to the bounded case (Section 2.1), so we choose to balance the appropriate terms of Theorem 1. For this, here we use which balances with . This way the bound of Theorem 1 scales with , which again cannot be improved in general.333For example, consider estimating the mean of a standard Gaussian random variable through constant functions using the absolute value loss, and derive the optimal rate by Theorem 1.
2.2.2 Strongly-convex losses
Now consider a loss function , which is -strongly convex in its second argument, that is holds for all and . Then, if is satisfied for all , the Bernstein condition (3) holds with . To see this, proceed similarly to Bartlett et al. (2006, Lemma 7) by using the strong convexity property of to get for all that
Notice that the condition is implied by the definition of if either is a regression function defined by the reference class , or when and is midpoint convex, that is implies .
Then, we need and to satisfy the requirements of Lemma 3. Again, we get the latter for any Lipschitz loss as in Section 2.2.1. However, here the constant of the Bernstein condition (3) does not scale with the sample size , which provides better rates by Theorem 1.
One such example is logistic regression witha.s. using the cross-entropy loss where for , and a hypothesis class with some . Because the function is -Lipschitz and -strongly convex in its second argument over the domain , we get the requirements of Lemma 3 with , , , and by (4).444To get these values, use and . Here notice that does not scale with the sample size as for the general Lipschitz case in Section 2.2.1, which allows Theorem 1 to deliver better rates.
Notice that the squared loss is -strongly convex, however, it is not Lipschitz over the real line. Fortunately, this is not needed for the condition which holds for some when is bounded due to the decomposition .
3 Linear least squares regression
Here we provide an analysis for the linear least squares regression setting, which uses the squared loss and considers ERM estimators over affine hypothesis classes for regression problems with sub-Gaussian distributions defined as
with some sub-Gaussian parameters , and feature space with dimension . Further, we consider affine reference classes , and use least squares estimators (LSEs), that is ERM estimators (1) using the squared loss , over some hypothesis class within affine functions .
First, we derive a general result (Corollary 6) which is specialized later for the slope constrained (Section 3.1) and penalized (Section 3.2) settings. For the general result, we set the reference class to the set of slope-bounded affine functions as , where for some Lipschitz bound .
Here we only consider penalty functions which are independent of the bias term satisfying . Then, we have
hence any estimate -ERM() can be expressed as with some . Moreover, because , we can also write any reference function as with some .
Now introduce the following linear function classes:
for any . Observe that any reference function satisfies for any , and any estimate -ERM() with Lipschitz bound satisfies -ERM() for all .
Because distribution is unknown, estimators cannot be represented by the class , just by its data-dependent approximation