The minimax principle in statistical estimation prescribes procedures (i.e.
, estimators) that minimize the worst-case risk over a large class of distributions generating the data. For a given loss function, the risk is the expectation of the loss of the estimator, where the expectation is taken over the data examined by the estimator. For example, for a large class of loss functions including squared loss, the empirical mean estimator minimizes the worst-case risk over the class of Gaussian distributions with known variance(Wolfowitz, 1950). In fact, Gaussian distributions with the specified variance are essentially the worst-case family of distributions for squared loss, at least up to constants (see, e.g., Catoni, 2012, Proposition 6.1).
In this work, we are interested in estimators whose deviations from expected behavior are controlled with very high probability over the random draw of the data examined by the estimator. Deviations of the behavior of the estimator from its expected behavior are worrisome especially when data come from unbounded and/or heavy-tail distributions, where only very low order moments may be finite. For example, the Pareto distributions with shape parameter are unbounded and have finite moments only up to orders ; these distributions are commonly associated with the modeling of extreme events that manifest in data. Bounds on the expected behavior of an estimator are insufficient in these cases, since the high-probability guarantees that may be derived from such bounds (say, using Markov’s inequality) are rather weak. For example, if the risk (i.e., expected loss) of an estimator is bounded by , then all that we may derive from Markov’s inequality is that the loss is no more than with probability at least . For small values of , the guarantee is not very reassuring, but it may be all one can hope for in these extreme scenarios—see Remark 1 in Section 3.1
for an example where this is tight. Much of the work in statistical learning theory is also primarily concerned with such high probability guarantees, but the bulk of the work makes either boundedness or subgaussian tail assumptions that severely limit the applicability of the results even in settings as simple as linear regression(see, e.g., Srebro et al., 2010; Shamir, 2014).
Recently, it has been shown that it is possible to improve on methods which are optimal for expected behavior but suboptimal when high-probability deviations are concerned (Audibert and Catoni, 2011; Catoni, 2012; Brownlees et al., 2014). These improvements, which are important when dealing with heavy-tailed distributions, suggest that new techniques (e.g., beyond empirical risk minimization) may be able to remove the reliance on boundedness or control of high-order moments.
This work applies and generalizes a technique for controlling large deviations from the expected behavior with high probability, assuming only bounded low-order moments such as variances. We show that the technique is applicable to minimization of smooth and strongly convex losses, and derive specific loss bounds for least squares linear regression, which match existing rates, but without requiring the noise or covariates to be bounded or subgaussian. This contrasts with several recent works (Srebro et al., 2010; Hsu et al., 2014; Shamir, 2014) concerned with (possibly regularized) empirical risk minimizers that require such assumptions. It is notable that in finite dimensions, our result implies that a constant factor approximation to the optimal loss can be achieved with a sample size that is independent of the size of the optimal loss. This improves over the recent work of Mahdavi and Jin (2013), which has a logarithmic dependence on the optimal loss, as well as a suboptimal dependence on specific problem parameters (namely condition numbers). We also provide a new generalization of the basic technique for general metric spaces, which we apply to least squares linear regression with heavy tail covariate and noise distributions, yielding an improvement over the computationally expensive procedure of Audibert and Catoni (2011).
The basic technique, found in the textbook of Nemirovsky and Yudin (1983, p. 243), is very simple, and can be viewed as a generalization of the median-of-means estimator used by Alon et al. (1999) and many others. The idea is to repeat an estimate several times, by splitting the sample into several groups, and then selecting a single estimator out of the resulting list of candidates. If an estimator from one group is good with noticeably better-than-fair chance, then the selected estimator will be good with probability exponentially close to one. This is remininscant of techniques from robust statistics (Huber, 1981)
, although our aim is expressly different in that our aim is good performance on the same probability distribution generating the data, rather than an uncontaminated or otherwise better behaved distribution. Our new technique can be cast as a simple selection problem in general metric spaces that generalizes the scalar median.
We demonstrate the versatility of our technique by giving further examples in sparse linear regression (Tibshirani, 1996) under heavy-tailed noise and low-rank covariance covariance matrix approximation (Koltchinskii et al., 2011) under heavy-tailed covariate distributions. We also show that for prediction problems where there may not be a reasonable metric on the predictors, one can achieve similar high-probability guarantees by using median aggregation in the output space.
The initial version of this article (Hsu and Sabato, 2013, 2014) appeared concurrently with the simultaneous and independent work of Minsker (2013), which develops a different generalization of the median-of-means estimator for Banach and Hilbert spaces. We provide a new analysis and comparison of this technique to ours in Section 7. We have also since become aware of the earlier work by Lerasle and Oliveira (2011), which applies the median-of-means technique to empirical risks in various settings much like the way we do in Algorithm 3, although our metric formulation is more general. Finally, the recent work of Brownlees et al. (2014) vastly generalizes the techniques of Catoni (2012) to apply to much more general settings, although they retain some of the same deficiencies (such as the need to know the noise variance for the optimal bound for least squares regression), and hence their results are not directly comparable to ours.
2 Overview of main results
This section gives an overview of the main results.
Let for any natural number . Let take value if the predicate is true, and otherwise. Assume an example space , and a distribution over the space. Further assume a space of predictors or estimators . We consider learning or estimation algorithms that accept as input an i.i.d. sample of size drawn from and a confidence parameter , and return an estimator (or predictor) . For a (pseudo) metric on , let denote the ball of radius around .
2.2 Robust distance approximation
Consider an estimation problem, where the goal is to estimate an unknown parameter of the distribution, using a random i.i.d. sample from that distribution. We show throughout this work that for many estimation problems, if the sample is split into non-overlapping subsamples, and estimators are obtained independently from each subsample, then with high probability, this generates a set of estimators such that some fraction of them are close, under a meaningful metric, to the true, unknown value of the estimated parameter. Importantly, this can be guaranteed in many cases even under under heavy-tailed distributions.
Having obtained a set of estimators, a fraction of which are close to the estimated parameter, the goal is now to find a single good estimator based on this set. This goal is captured by the following general problem, which we term Robust Distance Approximation. A Robust Distance Approximation procedure is given a set of points in a metric space and returns a single point from the space. This single point should satisfy the following condition: If there is an element in the metric space that a certain fraction of the points in the set are close to, then the output point should also be close to the same element. Formally, let be a metric space. Let be a (multi)set of size and let be a distinguished element in . For and , denote by the minimal number such that . We often omit the subscript and write simply when is known.
We define the following problem: [Robust Distance Approximation] Fix . Given and as input, return such that , for some constant . is the approximation factor of the procedure.
In some cases, learning with heavy-tailed distributions requires using a metric that depends on the distribution. Then, the Robust Distance Estimation procedure has access only to noisy measurements of distances in the metric space, and is required to succeed with high probability. In Section 3 we formalize these notions, and provide simple implementations of Robust Distance Approximation for general metric spaces, with and without direct access to the metric. For the case of direct access to the metric our formulation is similar to that of Nemirovsky and Yudin (1983).
2.3 Convex loss minimization
The general approach to estimation described above has many applications. We give here the general form of our main results for applications, and defer the technical definitions and results to the relevant sections. Detailed discussion of related work for each application is also provided in the appropriate sections.
First, we consider smooth and convex losses. We assume a loss function that assigns a non-negative number to a pair of an example from and a predictor from , and consider the task of finding a predictor that has a small loss in expectation over the distribution of data points. The expected loss of a predictor on the distribution is denoted . Let . Our goal is to find such that is close to . We assume that the parameter space is a Banach space with a norm and a dual norm, where the dual norm is -smooth for some . We further assume that for some and for some sample size , with some probability larger than half, the average loss of a predictor on the points in the sample is -strongly convex as a function of the predictor, with respect to the defined norms. We prove the following result: There exists an algorithm that accepts as input an i.i.d. sample of size drawn from and a confidence parameter , and returns , such that if the following conditions hold:
for some universal constant ;
is -smooth with respect to for all ;
is -smooth with respect to ,
then with probability at least , for another universal constant ,
This gives a constant approximation of the optimal loss with a number of samples that does not depend on the value of the optimal loss. The full results for smooth convex losses are provided in Section 4. Theorem 2.3 is stated in full as Corollary 4, and we further provide a result with more relaxed smoothness requirements. As apparent in the result, the only requirements on the distribution are those that are implied by the strong convexity and smoothness parameters. This allows support for fairly general heavy-tailed distributions, as we show below.
2.4 Least squares linear regression
A concrete application of our analysis of smooth convex losses is linear regression. In linear regression, is a Hilbert space with an inner product , and it is both the data space and the parameter space. The loss is the squared loss
and are defined similarly to and .
Unlike standard high-probability bounds for regression, we give bounds that make no assumption on the range or the tails of the distribution of the response variables, other than a trivial requirement that the optimal squared loss be finite. The assumptions on the distribution of the covariates are also minimal.
Let be the second-moment operator , where is a random data point from the marginal distribution of on . For a finite-dimensional , is simply the (uncentered) covariance matrix . First, consider the finite-dimensional case, where , and assume is not singular. Under only bounded moments of the marginal on (a condition that we specify in full detail in Section 5), we show the following guarantee.
Assume the marginal of has bounded moments. There is a constant and an algorithm that accepts as input a sample of size and a confidence parameter , and returns , such that if , with probability at least ,
Theorem 2.4 can be specialized for specific cases of interest. For instance, suppose is bounded and well-conditioned in the sense that there exists such that , but may still be heavy-tailed. Under this assumption we have the following result.
Assume is not singular. There exists an algorithm an algorithm that accepts as input a sample of size and a confidence parameter , and returns , such that with probability at least , for ,
so . is closely related to a condition number for the distribution of . For instance, if , then . This result is minimax optimal up to logarithmic factors (see, e.g., Nussbaum, 1999). We also remark that the boundedness assumption can be replaced by a subgaussian assumption on , in which case the sample size requirement becomes . We give analogous guarantees for the case of regularized least squares in a possibly finite-dimensional Hilbert space in Theorem 5.1, Section 5.
It is interesting to note that here we achieve a constant factor approximation to with a sample complexity that does not depend on the value of . This contrasts with other parametric learning settings, such as classification, where constant approximation requires
samples, and even active learning can only improve the dependence to(see, e.g., Balcan et al., 2006). We expand on this observation in Section 5.3.
2.5 Other applications, comparisons, and extensions
The general method studied here allows handling heavy tails in other applications as well. We give two examples in Section 6. First, we consider parameter estimation using -regularized linear least squares regression (Lasso) under random subgaussian design. We show that using the above approach, parameter estimation bounds can be guaranteed for general bounded variance noise, including heavy-tailed noise. This contrasts with standard results that assume sub-Gaussian noise. Second, we show that low-rank covariance matrix approximation can be obtained for heavy-tailed distributions, under a bounded moment assumption. These two applications have been analyzed also in the independent and simultaneous work of Minsker (2013).
All the results above are provided using a specific solution to the Robust Distance Approximation problem, which is easy to implement for any metric space. For the case of a fully known metric, in a Banach or a Hilbert space, Minsker (2013) proposed a different solution, which is based on the geometric median. In Section 7, we provide a detailed comparison of the approximation factor achieved by each approach, as well as some general lower bounds. Several interesting open questions remain regarding this general problem.
Lastly, in Section 8, we give a short proof to the intuitive fact that in some prediction problem, one can replace Robust Distance Approximation with taking the median of the predictions of the input estimators. This gives a possible improper-learning algorithm for relevant learning settings.
All of the techniques we have developed in this work are simple enough to implement and empirically evaluate, and indeed in some simulated experiments, we have verified the improvements over standard methods such as the empirical mean when the data follow heavy-tailed distributions. However, at present, the relatively large constant factors in our bounds are real enough to restrict the empirical improvements only to settings where very high confidence (i.e., small values of ) is required. By contrast, with an appropriately determined noise variance, the techniques of Catoni (2012) and Brownlees et al. (2014) may yield improvements more readily. Nevertheless, since our techniques are more general in some respects, it is worth investigating whether they can be made more practical (e.g., with greater sample reuse or overlapping groups), and we plan to do this in future work.
3 The core techniques
In this section we present the core technique used for achieving exponential concentration. We first demonstrate the underlying principle via the median-of-means estimator, and then explain the generalization to arbitrary metric spaces. Finally, we show a new generalization that supports noisy feature measurements.
3.1 Warm-up: median-of-means estimator
We first motivate the estimation procedure by considering the special case of estimating a scalar population mean using a median-of-means estimator, given in Algorithm 1. This estimator, heavily used in the streaming algorithm literature (Alon et al., 1999) (though a similar technique also appears in Nemirovsky and Yudin (1983) as noted in Levin (2005)), partitions a sample into equal-size groups, and returns the median of the sample means of each group. The input parameter should be thought of as a constant determined by the desired confidence level (i.e., for confidence ). The following result is well-known.
be a random variable with meanand variance , and let be a set of independent copies of . Assume divides . With probability at least , the estimate returned by Algorithm 1 on input satisfies . Pick any , and observe that is an i.i.d. sample of size . Therefore, by Chebyshev’s inequality, . For each , let . Note that the
are independent indicator random variables, each with. By Hoeffding’s inequality, . In the event that , at least half of the are within of , which means that the same holds for the median of the .
Using the terminology of Robust Distance Approximation with the metric , the proof shows that with high probability over the choice of , . The result then immediately follows because on the space , the median is a Robust Distance Approximation procedure with .
[Alternative estimators] It is remarkable that the estimator has convergence with exponential probability tails, even though the random variable may have heavy-tails (e.g., no bounded moments beyond the variance). We note that Catoni (2012) also presents estimators with these properties and also asymptotically optimal constants, although the estimators require as a parameter.
[Empirical mean] In Catoni (2012), it is shown that the empirical mean cannot provide a qualitatively similar guarantee. Specifically, for any and , there is a distribution with mean zero and variance such that the empirical average of i.i.d. draws satisfies
Therefore the deviation of the empirical mean necessarily scales with rather than (with probability ).
3.2 Generalization to arbitrary metric spaces
We now consider a simple generalization of the median-of-means estimator for arbitrary metric spaces, first mentioned in Nemirovsky and Yudin (1983). Let be the parameter (solution) space, be a distinguished point in (the target solution), and a metric on (in fact, a pseudometric suffices).
The first abstraction captures the generation of candidate solutions obtained from independent subsamples. We assume there is an oracle which satisfies the following assumptions.
A query to returns a random such that
Note that the could be replaced by another constant larger than half; we have not optimized the constants. The second assumption regards statistical independence.
The random responses of are statistically independent.
The proposed procedure, given in Algorithm 2, generates candidate solutions by querying times, and then selects a single candidate using a generalization of the median. Specifically, for each , the smallest ball centered at that contains more than half of is determined; the with the smallest such ball is returned. This selection method is a Robust Distance Approximation procedure. The proof is given below and illustrated in Figure 1. Nemirovsky and Yudin (1983) proposed a similar technique, however their formulation relies on knowledge of .
Let . Selecting such that is a Robust Distance Approximation procedure with . Assume that . Then . For any , by the triangle inequality, . This implies that , and so . By the pigeonhole principle, . Therefore, by the triangle inequality again, .
Again, the number of candidates determines the resulting confidence level. The following theorem provides a guarantee for Algorithm 2.
Suppose that Assumption 1 and Assumption 2 hold. Then, with probability at least , Algorithm 2 returns satisfying . For each , let . Note that the are independent indicator random variables, each with . By Hoeffding’s inequality, . In the event that , more than half of the are contained in the ball of radius around , that is . The result follows from Proposition 3.2.
3.3 Random distance measurements
In some problems, the most appropriate metric on in which to measure accuracy is not directly computable. For instance, the metric may depend on population quantities which can only be estimated; moreover, the estimates may only be relatively accurate with some constant probability. To capture such cases, we assume access to an oracle, denoted , that provides a random estimate of the distance between a point and the point generated by . The oracle responses should be weakly accurate. Define the random variable
indicates that the oracle provides a good estimate of the distances from . We assume the following.
For any , .
Note that the responses of need not correspond to a metric. We further require the following independence assumption.
The random variables are statistically independent.
We do not require that and be statistically independent.
Assume this event holds, and denote . We have .
Let such that . Then, for any such that , by the triangle inequality . There are at least such indices , therefore for more than of the indices , we have
For such that this holds, by the definition of , It follows that .
Now, let such that . Then, for any such that , by the triangle inequality . As above, for more than of the indices ,
For such that this holds, by the definition of , It follows that .
By Eq. (2), We conclude that with probability at least ,
for all , and
for all .
In this event the with the smallest satisfies .
The properties of the approximation procedure and of are combined to give a guarantee for Algorithm 3. Suppose that Assumptions 1,2,3,4 all hold. With probability at least , Algorithm 3 returns satisfying . For each , let . The are independent indicator random variables, each with . By the definition of and Hoeffding’s inequality, . The result follows from Lemma 3.3 and a union bound. In the following sections we show several applications of these general techniques.
4 Minimizing strongly convex losses
In this section we apply the core techniques to the problem of approximately minimizing strongly convex losses, which includes least squares linear regression as a special case. Suppose is a Banach space, with the metric induced by the norm . We sometimes denote the metric by as well. Denote by the dual norm, so for .
The derivative of a differentiable function at in direction is denoted by . We say is -strongly convex with respect to if
for all ; it is -smooth with respect to if for all
We say is -smooth if is -smooth with respect to . We assume is -smooth for some . Let denote the smallest sample size such that the following holds: With probability over the choice of an i.i.d. sample of size from , for all ,
In other words, the sample induces a loss which is -strongly convex around .222Technically, we only need the sample size to guarantee Eq. (3) for all for some . We assume that for some .
We use the following facts in our analysis. [Srebro et al. (2010)] If a non-negative function is -smooth with respect to , then for all .
[Juditsky and Nemirovski (2008)] Let
be independent copies of a zero-mean random vector, and let be -smooth. Then .
Recall that is a data space, and is a distribution over . Let be a -valued random variable with distribution . Let be a non-negative loss function, and for , let be the expected loss. Also define the empirical loss with respect to a sample from , . To simplify the discussion throughout, we assume is differentiable, which is anyway our primary case of interest. We assume that has a unique minimizer .333This holds, for instance, if is strongly convex. let . Set such that .
To use Algorithm 2, we implement based on loss minimization over sub-samples, as follows: Given a sample , randomly partition into equal-size groups , and let the response to the -th query to be the loss minimizer on , i.e., . We call this implementation subsampled empirical loss minimization. Clearly, if is an i.i.d. sample from , then the queries to are independent, and so Assumption 2 holds. Thus, to apply Proposition 3.2, it is left to show that Assumption 1 holds as well.
The following lemma proves that Assumption 1 holds under these assumptions with
Assume divides , and that is an i.i.d. sample from of size . Then subsampled empirical loss minimization using the sample is a correct implementation of for up to queries. It is clear that are independent by the assumption. Fix some . Observe that , and therefore by Proposition 4. By Markov’s inequality,
Moreover, the assumption that implies that with probability at least , Eq. (3) holds for . By a union bound, both of these events hold simultaneously with probability at least . In the intersection of these events, letting ,
where the last inequality follows from the definition of the dual norm, and the optimality of on . Rearranging and combining with the above probability inequality implies
Combining Lemma 4 and Proposition 3.2 gives the following theorem. Assume divides , and that is an i.i.d. sample from of size . Further, assume Algorithm 3 uses the subsampled empirical loss minimization to implement , where is as in Eq. (4). Then with probability at least , the parameter returned by Algorithm 2 satisfies
is -smooth with respect to for all ;
is -smooth with respect to .
Then with probability at least ,
Corollary 4 implies that for smooth losses, Algorithm 2 provides a constant factor approximation to the optimal loss with a sample size (with probability at least ). In subsequent sections, we exemplify cases where the two arguments of the are roughly of the same order, and thus imply a sample size requirement of . Note that there is no dependence on the optimal loss in the sample size, and the algorithm has no parameters besides .
We can also obtain a variant of Theorem 4 based on Algorithm 3 and Theorem 3.3, in which we assume that there exists some sample size that allows to be correctly implemented using an i.i.d. sample of size at least . Under such an assumption, essentially the same guarantee as in Theorem 4 can be afforded to Algorithm 3 using the subsampled empirical loss minimization to implement (for as in Eq. (4)) and the assumed implementation of . Note that since Theorem 3.3 does not require and to be statistically independent, both can be implemented using the same sample.
Assume divides , and is an i.i.d. sample from of size . Further, assume Algorithm 3 implements using with subsampled empirical loss minimization, where is as in Eq. (4), and implements using as well. Then with probability at least , the parameter returned by Algorithm 3 satisfies
[Mean estimation and empirical risk minimization] The problem of estimating a scalar population mean is a special case of the loss minimization problem, where , and the loss function of interest is the square loss . The minimum population loss in this setting is the variance of , i.e., . Moreover, in this setting, we have , so the estimate returned by Algorithm 2 satisfies, with probability at least ,
with probability at least . Therefore empirical risk minimization cannot provide a qualitatively similar guarantee as Corollary 4. It is easy to check that minimizing a regularized objective also does not work, since any non-trivial regularized objective necessarily provides an estimator with a positive error for some distribution with zero variance.
In the next section we use the analysis for general smooth and convex losses to derive new algorithms and bounds for linear regression.
5 Least squares linear regression
In linear regression, the parameter space is a Hilbert space with inner product , and , where in the finite-dimensional case, for some finite integer . The loss here is the squared loss, denoted by , and defined as
The regularized squared loss, for , is denoted
Note that . We analogously define , , , , etc. as the squared-loss equivalents of . Finally, denote by the identity operator on .
The proposed algorithm for regression (Algorithm 4) is as follows. Set , where is a universal constant. First, draw independent random samples i.i.d. from , and perform linear regression with -regularization on each sample separately to obtain linear regressors. Then, use the same samples to generate estimates of the covariance matrix of the marginal of on the data space. Finally, use the estimated covariances to select a single regressor from among the at hand. The slightly simpler variants of steps 4 and 5 can be used in some cases, as detailed below.
In Section 5.1, the full results for regression, mentioned in Section 2, are listed in full detail, and compared to previous work. The proofs are provided in Section 5.2. We expand on implications for active learning in Section 5.3.
Let be a random vector drawn according to the marginal of on , and let be the second-moment operator . For a finite-dimensional , is simply the (uncentered) covariance matrix . For a sample of independent copies of , denote by