Estimating the mean of a population given a finite sample is arguably the most fundamental statistical estimation problem. Despite the broad applicability and the fundamental nature of this problem, an estimator achieving the optimal statistical rate has only been discovered recently. However the optimal computational complexity of such an estimator is not well-understood.
In this paper, we are interested in obtaining high confidence estimates of the mean in the simple setting where only the existence of the covariance of the distribution is assumed. That is, we would like to find the smallest such that given samples from a distribution with mean our estimator satisfies:
To understand the inherent statistical limit of this problem, let us consider the simplified setting where the covariance is the identity. The most natural estimator for the mean of the population is the sample mean
. From the Central Limit Theorem, the distribution ofsatisfies , and assuming this conclusion holds for any allows an satisfying
[Cat12] shows that this
is the optimal statistical performance achievable under such mild assumptions. However, the above confidence interval only holds true asymptotically when the number of samples goes to infinity or when the distribution is sub-Gaussian. For finite sample results with a heavy-tailed distribution, applying Chebyshev’s inequality to the empirical mean gives only
The above bound is weaker than the one obtained by the Central Limit Theorem in two ways, the dependence on the failure probabilityis polynomial in instead of logarithmic and the term depending on is multiplied by the dimensionality as opposed to being part of a smaller additive term. Unfortunately, [Cat12] also shows the above result is tight. That is, for any , there exists a distribution for which the bound guaranteed by Chebyshev’s inequality is optimal.
The poor performance of the empirical mean is due to its sensitivity to large outliers that occur naturally as part of the sample. The median-of-means framework was devised as a means of circumventing such difficulties. It was independently developed in the one dimensional case by[NY83, JVV86, AMS99] and was later extended to the multivariate case by [HS16, LO11, Min15]. As part of this framework, the samples are first divided into batches and the mean of the samples is computed within each batch to obtain estimates . Each of these has mean
and variance. The empirical mean is simply the mean of these estimates, which is sensitive to outliers. The median-of-means estimator instead is the geometric median of the estimates, which has greater tolerance to outliers. The success of the median-of-means estimator is due to the fact that it relies on only a fraction of estimates being close to the mean as opposed to all the estimates being close. [Min15] shows this gives an improved value of as follows:
The confidence interval guaranteed by the median-of-means estimator is better than the one for the empirical mean by improving the dependence on , but it is still poorer than we might expect from the Central Limit Theorem. Subsequent work attempting to bridge this gap achieves better rates than those guaranteed by the median-of-means but with stronger assumptions on the data generating distribution111A rate of is achieved under a fourth moment assumption on the distribution. ([JLO17]). The question of whether it was statistically feasible to obtain confidence intervals of the form guaranteed by the Central Limit Theorem was finally resolved by [LM19]. They devised an improved estimator, based on the median-of-means framework, called the median-of-means tournament, which achieves CLT-like confidence intervals. While the median-of-means estimator relies on the concentration of the number of close to the mean in Euclidean norm, the median-of-means tournament relies on the fact that along every direction , the number of close to the projection of the mean concentrates. The freedom to choose a different set of for each direction allow one to obtain a much smaller confidence interval than the one for the median-of-means estimator. In subsequent work, following the PAC-Bayesian approach of [Cat12], [CG17a] proposed a soft-truncation based estimator which obtains CLT-like confidence intervals provided one has access to estimates of the trace and spectral norm of the covariance matrix.
However, it is not known whether the estimators from [LM19, CG17a] are computationally feasible, as there are no known polynomial time algorithms to compute them. In contrast, the median-of-means and empirical mean can be computed in nearly-linear time ([CLM16]). To alleviate this computational intractability, [CG17b] proposed an efficient polynomial time estimator which achieves optimal statistical performance up to second order terms, assuming the existence of higher order moments. The question of computational tractability was subsequently resolved by [Hop18], who showed that an algorithm based on a sum-of-squares relaxation of the median-of-means tournament estimator achieves the statistically optimal CLT-like confidence intervals. However, the runtime of this algorithm is exorbitantly large222Assuming standard runtimes of the Interior Point method for semidefinite programming ([Ali95]) ().
In this paper, we propose a new algorithm with a reduced runtime——and a significantly simpler analysis. Our algorithm is a descent-based method that iteratively improves an estimate of the mean. The main challenge of such an approach is to estimate the descent direction. To this end, we crucially leverage the structure of the solutions to semidefinite programming relaxations of polynomial optimization problems designed to test whether a estimate is close to the mean. Our main contributions are twofold; we first show how exact solutions to the polynomial optimization problem furnish suitable descent directions and that such descent directions can also be efficiently extracted from relaxations of such problems and secondly, we show that these descent directions can be used in a descent style algorithm for mean estimation. Our paper is organized as follow: in Section 2, we present our main result, then in Section 3, as a warm-up, we devise a descent style algorithm for the case where we are given exact solutions to the polynomial optimization problems mentioned previously and prove that this algorithm achieves optimal statistical efficiency. This sets the stage for Section 4, where we present our main algorithm based on semidefinite relaxations of the previously defined polynomial optimization problems, leading to computationally efficient sub-Gaussian mean estimation.
2 Main result
Formally, our main result333The constants are explicit but we believe sub-optimal. is as follows:
We can make the following comments:
Our estimator is both statistically optimal and computationally efficient. It achieves sub-Gaussian performance under minimal conditions on the distribution, and its runtime is . See Section 4.2 for details.
The dependence of the number of iterations, , on can be avoided by initializing the algorithm with the median-of-means estimate. In this case, we can instead use and obtain the same guarantees, avoiding any dependence on the knowledge of .
The estimator depends on the confidence level . [DLLO16] propose an estimator which works for a whole range of but for a restricted class of distributions.
Our result does not explicitly depend on the dimension and our algorithm can be extended to a Hilbert space by working within the finite dimensional subspace containing the data points.
We present in this section a simple descent based algorithm. This algorithm is computationally inefficient but achieves the same guarantees of Theorem 1 with a much simpler analysis which nevertheless illustrates the main ideas behind the algorithm and proof of Theorem 1.
We provide some intuition for our procedure, which iteratively improves an estimate of the mean. We first consider the simpler problem of testing whether a given point is close to the mean. We draw our inspiration from the main technical insight of [LM19], who show that along any direction, most of the bucket means, , are close to the mean, . Thus, to test whether a point, , is far from the mean, it is sufficient to check whether there exists a direction where most of the are far away from along that direction. This is formally expressed in the following polynomial optimization problem:
This polynomial problem over the set of variables and is parameterized by , the current estimate and the bucket means . Its polynomial constraints are encoding the number of beyond a distance from when projected along a direction . Intuitively, this program tries to find a direction so as to maximize the number of beyond a distance from along that direction. Here, we know from ([LM19]) that for an appropriate choice of , along all directions , a large fraction of the are close to the mean. Formally, for all directions , (see Corollary 1 ). Therefore this optimization problem has a large value when is far from the mean and can be used to certify this.
Strikingly, the direction returned by the solution of the above problem also contains information about the location of the mean when is chosen appropriately, which enables improvement of the quality of the current estimate. As illustrated in Figure 1, the direction returned by this optimization problem is strongly correlated with the vector joining the current point to the mean .
1: Input: Data Points , Current point 2: = Distance Estimation 3: 4: Return:
Therefore, moving a small distance along the vector should intuitively take us closer to the mean. Given solutions to the polynomial optimization problem MTE, we may iteratively improve our estimate until no further change is necessary.
In this section we put the intuition provided previously into practice and propose a procedure that estimates the mean in the ideal situation where MTE can be exactly solved (the method is formally described in Algorithm 1):
First, following the median of means framework, the samples are divided into buckets and the mean of the samples within each bucket is computed as .
Second, the estimate of the mean is iteratively updated using a descent approach, based on the solution of MTE. As mentioned in Section 3.1, we need to run MTE with an appropriate choice of for the solution to be correlated with the direction . In the Distance Estimation step of our algorithm, we estimate a suitable choice of (see Algorithm 2). This value of is subsequently used in the Gradient Estimation step, to obtain an appropriate descent direction (see Algorithm 3).
From this point on, we refer to the solution of polynomial equations MTE as .
3.3 Analysis warm-up
In this simplified setting, we provide an analysis of our method and show that it obtains the same guarantees as those presented in Theorem 1. This is formally expressed in the following theorem for Algorithm 1 instantiated with Algorithms 2 and 3.
Let be i.i.d. random vectors with mean and covariance . Then Algorithm 1 instantiated with Algorithms 2 and 3 and run with inputs , target confidence , stepsize and number of iterations returns a vector satisfying:
with probability at least .
The main steps involved in the proof are the following:
Gradient Descent: Combining the previous two steps, we prove that we eventually converge to a good approximation to the mean.
In the proofs of our lemmas relating to the correctness of the Distance Estimation and the Gradient Estimation steps, we make use of the following assumption:
For the bucket means, , we have:
The assumption is a formalization of the insight of ([LM19]), which shows that along all directions, , most of the bucket means are within a small radius of the true mean, , with high probability444This will be made precise in Corollary 1..
First, we prove that the Distance Estimation step defined in Algorithm 2 is correct.
Let . We first prove the lower bound . We may assume that , as the alternate case is trivially true. For , we can simply pick the vector where is the unit vector in the direction of . Under Assumption 1, we have that for at least points:
This implies the lower bound holds in the case where .
For the upper bound , suppose, for the sake of contradiction, there is a value of for which the optimal value of is greater than . Let be the solution of . This means that for of the , we have:
This contradicts Assumption 1 and proves the upper bound. ∎
Next, we prove the correctness of the Gradient Estimation step from Algorithm 3.
Let . We have, from the definition of , that for of the , . We also have, under Assumption 1, that for of the . From the pigeonhole principle, there exists a which satisfies both those inequalities. Therefore, for that , the lower bound from Lemma 1 implies
By rearranging the above inequality and using the assumption on in Eq. (1), we get the required conclusion. ∎
Let be i.i.d. random vectors with mean and covariance . Furthermore, assume . Then we have for all such that :
with probability at least .
Proof of Theorem 2.
Assume first that Assumption 1 holds. Let . To start with, let us define the set . We prove the theorem in two cases:
Case 1: None of the iterates lie in . In this case, note that by Lemma 1 and the definition of , we have:
Moreover, we have by the definition of the update rule of in Algorithm 1:
where we have used Lemma 2 for the first inequality and the inequalities in Eq. (2) for the second inequality. By iteratively applying the above inequality, we get the conclusion of the theorem in this case.
Bearing in mind that the polynomial optimization problem MTE is non-convex, we consider a convex relaxation in the following section.
4 Efficient Algorithm for Mean Estimation
In this section, we define a semi-definite programming relaxation of the polynomial optimization problem MTE. We then design new Distance Estimation and Gradient Estimation algorithms that use the tractable solutions to the relaxation instead of the original polynomial optimization problem. We then use these solutions to update our mean estimate along the same lines as those from Section 3, albeit with some added technical difficulty. Finally, we provide the analysis of the method and prove Theorem 1.
4.1 The Semi-Definite Relaxation of Mte
indexed by , the variables and and denote by the vector :
Similar to the polynomial optimization MTE, this optimization problem is also parameterized by a vector , and a matrix . We refer to solutions of this program as with denoting the optimal value and denoting the optimal solution.
As opposed to the polynomial optimization problem, solutions to the relaxation may not necessarily return a single vector but rather a semidefinite matrix which corresponds to the relaxation of
. This matrix may not uniquely determine a direction of improvement. We, therefore, parse the solution to isolate a provably good direction of improvement and use this to iteratively improve our estimate. It is noteworthy that the singular value decomposition does not provide a sign direction. Thankfully the correct orientation is easily ascertained using the data points.
To analyze the runtime of Algorithm 1 with Algorithms 4 and 5, we first note that the semidefinite relaxation has variables. However, by projecting all the data down to a subspace containing the bucket means, we may effectively reduce the number of variables to with an time pre-processing step. Therefore, we are now left with variables. The runtime of interior point methods for solving semidefinite programs with variables and constraints is ([Ali95]). Furthermore, a single call of the Distance Estimation procedure can be efficiently implemented using rounds of binary search on the parameter . Therefore, the total cost of a single call to Algorithm 4 is . Similarly, the total cost of a call to Algorithm 5 is . Since the cost of each iteration is dominated by a single call of Algorithm 4 and 5, the total cost per iteration is . Since, we only run iterations, the total cost of the Algorithm 1 instantiated with Algorithms 4 and 5 is .
Gradient Descent: Combining the previous two steps, we prove that we eventually converge to a good approximation to the mean. See Section 4.3.3.
The following assumption is required to prove the correctness of the Distance Estimation and Gradient Estimation steps:
For the bucket means, , let denote the set of feasible solutions for . Then, we have for all ,
4.3.1 Distance Estimation Step
In this subsection, we analyze the Distance Estimation step from Algorithm 4. We show that an accurate estimate of the distance of the current point from the mean can be found. We begin by stating a lemma that shows that a feasible solution for can be converted to a feasible solution for with a reduction in optimal value.
Let us assume Assumption 2. Let be a positive semi-definite matrix, symbolically indexed by and the variables and . Moreover, suppose that satisfies:
Then, there is a set of at least indices such that for all :
and a set of at least indices such that for all , we have .
Let . We prove the lemma by contradition. Firstly, note that is infeasible for as the optimal value for is less than (Assumption 2). Note that the only constraints of that are violated by are constraints of the form:
Now, let denote the set of indices for which the above inequality is violated. We can convert to a feasible solution for by setting to the rows and columns corresponding to the indices in . Let be the matrix obtained by the above operation. We have from Assumption 2:
where the last inequality follows from the fact that . By rearranging the above inequality, we get the first claim of the lemma.
For the second claim, let denote the set of indices satisfying . We have:
This establishes the second claim of the lemma. ∎
The following lemma shows that if the distance between the mean and a point is small then the estimate returned by Algorithm 4 is also small.
Let and . Suppose that the optimal value of is greater than and let its optimal solution be . Let and denote the two sets whose existence is guaranteed by Lemma 3. From, the cardinalities of and , we see that their intersection is not empty. For , we have:
where the first inequality follows from the fact that and the fact that is feasible for and the last inequality follows from the inclusion of in and Cauchy-Schwarz.
By plugging in the bounds on and , we get:
This contradicts the assumption on and concludes the proof of the lemma. ∎
The next lemma shows that the distance between the mean and a point can be accurately estimated as long as is sufficiently far from .
Let us define the direction to be the unit vector in the direction of . From Assumption 1 (which is implied by Assumption 2), the number of satisfying is less than . Therefore, we have that for at least points:
For the upper bound, we show that the optimal value of is less than . For the sake of contradiction, suppose that this optimal value is greater than . Let be a feasible solution of that achieves . Let and be the two sets whose existence is guaranteed by Lemma 3 and be an element in their intersection. We have for :
where the first inequality follows from the inclusion of in and the last inequality follows from the inclusion of in and Cauchy-Schwarz. By re-arranging the above inequality, we get:
which is a contradiction. Therefore, we get from the monotonicity of (see Lemma 8), that and this concludes the proof of the lemma. ∎
4.3.2 Gradient Estimation Step
In this section, we analyze the Gradient Estimation step of the algorithm. We show that an approximate gradient can be found as long as the current point is not too close to the mean . The following lemma shows that we obtain a non-trivial estimate of the gradient in Algorithm 5.
In the running of Algorithm 5, let denote the solution of . We begin by factorizing the solution into with the rows of denoted by , and . We also define the matrix in . From the constraints in MT, we have:
Let and denote the sets defined in Lemma 3. Let . By noting that , we have for :
where the first inequality follows from the inclusion of in and the second from its inclusion in . We get by rearranging the above equation and using our bound on from Lemma 5:
By rearranging Eq. (3), using Cauchy-Schwarz, and the assumption on :
We finally get that:
Now, we have:
Let be the top singular vector of . Note that and is also the top right singular vector of . We have that:
This means that we have:
Therefore, we have for points:
This means that in the case where , we return which satisfies . This implies the lemma in this case. The case where is similar with used instead of . This concludes the proof of the lemma. ∎
4.3.3 Gradient Descent Step
Let be i.i.d. random vectors with mean and covariance and let denote the set of feasible solutions of . Then, we have for and :
with probability at least .
In this paper, we proposed a computationally efficient estimator for the mean of a random vector which obtains the statistically optimal performance. This estimator has a significantly faster runtime together with a simpler analysis than previous works. Our algorithm is based on a descent method, where a current estimate of the mean is iteratively improved.
- [AC11] J.-Y. Audibert and O. Catoni. Robust linear least squares regression. Ann. Statist., 39(5):2766–2794, 10 2011.
Interior point methods in semidefinite programming with applications to combinatorial optimization.SIAM journal on Optimization, 5(1):13–51, 1995.
- [AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
- [BJL15] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. Ann. Statist., 43(6):2507–2536, 12 2015.
- [BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford university press, 2013.
- [Cat12] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. Inst. Henri Poincaré Probab. Stat., 48(4):1148–1185, 2012.
- [CG17a] O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. arXiv preprint arXiv:1712.02747, 2017.
- [CG17b] O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. NIPS 2017 workshop; (Almost) 50 shades of Bayesian learning: PAC-Bayesian trends and insights, 2017.
M. B. Cohen, Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford.
Geometric median in nearly linear time.
Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC ’16, 2016.
- [DLLO16] L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira. Sub-Gaussian mean estimators. Ann. Statist., 44(6):2695–2725, 2016.
- [Hop18] S. B Hopkins. Sub-Gaussian mean estimation in polynomial time. arXiv preprint arXiv:1809.07425, 2018.
- [HS16] D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res., 17, 2016.
- [JLO17] E. Joly, G. Lugosi, and R. Oliveira. On the estimation of the mean of a random vector. Electron. J. Statist., 11(1):440–451, 2017.
M. R. Jerrum, L. G. Valiant, and V. V. Vazirani.
Random generation of combinatorial structures from a uniform distribution.Theoretical Computer Science, 43:169–188, 1986.
- [LM17] G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, to appear, 2017.
- [LM19] G. Lugosi and S. Mendelson. Sub-Gaussian estimators of the mean of a random vector. Ann. Statist., 47(2):783–794, 04 2019.
- [LO11] M. Lerasle and R. Oliveira. Robust empirical mean estimators. arXiv preprint arXiv:1112.3914, 2011.
- [LT91] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991.
- [Min15] S. Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308–2335, 2015.
- [Nes98] Y. Nesterov. Semidefinite relaxation and nonconvex quadratic optimization. Optimization methods and software, 9(1-3):141–160, 1998.
- [NY83] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. John Wiley & Sons, 1983.
Appendix A Auxiliary lemma
For any and , the optimal value of is monotonically non-increasing in .
The lemma follows trivially from the fact that a feasible solution of is also a feasible solution for for . ∎
Appendix B Proof of Lemma 7
We first show that the optimal value of the semi-definite program MT satisfies a bounded-difference condition with respect to the ’s.
Let be any set of vectors in . Now, let be the same set of vectors with the vector replaced by . If and are the optimal values of and , we have:
Firstly, assume that is a feasible solution to . Now, let us define as:
That is is equal to except with the row and column corresponding to being set to . We see that forms a feasible solution to . Therefore, we have that:
where the bound follows from the fact that the sub-matrix of formed by the rows and columns indexed by and is positive semidefinite and the constraint that . Since the above series of equalities holds for all feasible solutions of , we get:
Through a similar argument, we also conclude that . Putting the above two inequalities together, we get the required conclusion. ∎
For the next few lemmas, we are concerned with the case where . Since we already know that the optimal SDP value satisfies the bounded differences condition, we need to verify that the expectation is small. As a first step towards this, we define the 2-to-1 norm of a matrix .
The 2-to-1 norm of is defined as
We consider the classical semidefinite programming relaxation of the 2-to-1 norm. To start with, we will define a matrix with the rows and columns indexed by and the elements and . The semidefinite programming relaxation is defined as follows: