# Fast Mean Estimation with Sub-Gaussian Rates

We propose an estimator for the mean of a random vector in R^d that can be computed in time O(n^4+n^2d) for n i.i.d. samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.

• 13 publications
• 34 publications
• 58 publications
09/19/2018

### Mean Estimation with Sub-Gaussian Rates in Polynomial Time

We study polynomial time algorithms for estimating the mean of a heavy-t...
08/13/2019

### A Fast Spectral Algorithm for Mean Estimation with Sub-Gaussian Rates

We study the algorithmic problem of estimating the mean of heavy-tailed ...
11/17/2020

### Optimal Sub-Gaussian Mean Estimation in ℝ

We revisit the problem of estimating the mean of a real-valued distribut...
07/12/2020

### A spectral algorithm for robust regression with subgaussian rates

We study a new linear up to quadratic time algorithm for linear regressi...
05/10/2019

### Selectivity Estimation with Deep Likelihood Models

Selectivity estimation has long been grounded in statistical tools for d...
06/02/2020

### Robust and efficient mean estimation: approach based on the properties of self-normalized sums

Let X be a random variable with unknown mean and finite variance. We pre...
02/12/2018

### Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

In this paper, we present a new estimator of the mean of a random vector...

## 1 Introduction

Estimating the mean of a population given a finite sample is arguably the most fundamental statistical estimation problem. Despite the broad applicability and the fundamental nature of this problem, an estimator achieving the optimal statistical rate has only been discovered recently. However the optimal computational complexity of such an estimator is not well-understood.

In this paper, we are interested in obtaining high confidence estimates of the mean in the simple setting where only the existence of the covariance of the distribution is assumed. That is, we would like to find the smallest such that given samples from a distribution with mean our estimator satisfies:

 P{∥^X−μ∥≥rδ}≤δ.

To understand the inherent statistical limit of this problem, let us consider the simplified setting where the covariance is the identity. The most natural estimator for the mean of the population is the sample mean

. From the Central Limit Theorem, the distribution of

satisfies , and assuming this conclusion holds for any allows an  satisfying

 rδ=O⎛⎝√dn+√log1/δn⎞⎠.

[Cat12] shows that this

is the optimal statistical performance achievable under such mild assumptions. However, the above confidence interval only holds true asymptotically when the number of samples goes to infinity or when the distribution is sub-Gaussian. For finite sample results with a heavy-tailed distribution, applying Chebyshev’s inequality to the empirical mean gives only

 rδ=Ω(√dnδ).

The above bound is weaker than the one obtained by the Central Limit Theorem in two ways, the dependence on the failure probability

is polynomial in instead of logarithmic and the term depending on is multiplied by the dimensionality as opposed to being part of a smaller additive term. Unfortunately, [Cat12] also shows the above result is tight. That is, for any , there exists a distribution for which the bound guaranteed by Chebyshev’s inequality is optimal.

The poor performance of the empirical mean is due to its sensitivity to large outliers that occur naturally as part of the sample. The median-of-means framework was devised as a means of circumventing such difficulties. It was independently developed in the one dimensional case by

[NY83, JVV86, AMS99] and was later extended to the multivariate case by [HS16, LO11, Min15]. As part of this framework, the samples are first divided into batches and the mean of the samples is computed within each batch to obtain estimates . Each of these has mean

and variance

. The empirical mean is simply the mean of these estimates, which is sensitive to outliers. The median-of-means estimator instead is the geometric median of the estimates, which has greater tolerance to outliers. The success of the median-of-means estimator is due to the fact that it relies on only a fraction of estimates being close to the mean as opposed to all the estimates being close. [Min15] shows this gives an improved value of as follows:

 rδ=O⎛⎝√dlog1/δn⎞⎠.

The confidence interval guaranteed by the median-of-means estimator is better than the one for the empirical mean by improving the dependence on , but it is still poorer than we might expect from the Central Limit Theorem. Subsequent work attempting to bridge this gap achieves better rates than those guaranteed by the median-of-means but with stronger assumptions on the data generating distribution111A rate of is achieved under a fourth moment assumption on the distribution. ([JLO17]). The question of whether it was statistically feasible to obtain confidence intervals of the form guaranteed by the Central Limit Theorem was finally resolved by [LM19]. They devised an improved estimator, based on the median-of-means framework, called the median-of-means tournament, which achieves CLT-like confidence intervals. While the median-of-means estimator relies on the concentration of the number of close to the mean in Euclidean norm, the median-of-means tournament relies on the fact that along every direction , the number of close to the projection of the mean concentrates. The freedom to choose a different set of for each direction allow one to obtain a much smaller confidence interval than the one for the median-of-means estimator. In subsequent work, following the PAC-Bayesian approach of [Cat12], [CG17a] proposed a soft-truncation based estimator which obtains CLT-like confidence intervals provided one has access to estimates of the trace and spectral norm of the covariance matrix.

However, it is not known whether the estimators from [LM19, CG17a] are computationally feasible, as there are no known polynomial time algorithms to compute them. In contrast, the median-of-means and empirical mean can be computed in nearly-linear time ([CLM16]). To alleviate this computational intractability, [CG17b] proposed an efficient polynomial time estimator which achieves optimal statistical performance up to second order terms, assuming the existence of higher order moments. The question of computational tractability was subsequently resolved by [Hop18], who showed that an algorithm based on a sum-of-squares relaxation of the median-of-means tournament estimator achieves the statistically optimal CLT-like confidence intervals. However, the runtime of this algorithm is exorbitantly large222Assuming standard runtimes of the Interior Point method for semidefinite programming ([Ali95]) ().

In this paper, we propose a new algorithm with a reduced runtime——and a significantly simpler analysis. Our algorithm is a descent-based method that iteratively improves an estimate of the mean. The main challenge of such an approach is to estimate the descent direction. To this end, we crucially leverage the structure of the solutions to semidefinite programming relaxations of polynomial optimization problems designed to test whether a estimate is close to the mean. Our main contributions are twofold; we first show how exact solutions to the polynomial optimization problem furnish suitable descent directions and that such descent directions can also be efficiently extracted from relaxations of such problems and secondly, we show that these descent directions can be used in a descent style algorithm for mean estimation. Our paper is organized as follow: in Section 2, we present our main result, then in Section 3, as a warm-up, we devise a descent style algorithm for the case where we are given exact solutions to the polynomial optimization problems mentioned previously and prove that this algorithm achieves optimal statistical efficiency. This sets the stage for Section 4, where we present our main algorithm based on semidefinite relaxations of the previously defined polynomial optimization problems, leading to computationally efficient sub-Gaussian mean estimation.

## 2 Main result

Formally, our main result333The constants are explicit but we believe sub-optimal. is as follows:

###### Theorem 1.

Let be i.i.d. random vectors with mean and covariance . Then Algorithm 1 instantiated with Algorithms 4 and 5 and run with inputs , target confidence , stepsize and number of iterations returns a vector satisfying:

 ∥x∗−μ∥≤max⎛⎝ϵ,480000⎛⎝√TrΣn+√∥Σ∥log1/δn⎞⎠⎞⎠,

with probability at least .

We can make the following comments:

• Our estimator is both statistically optimal and computationally efficient. It achieves sub-Gaussian performance under minimal conditions on the distribution, and its runtime is . See Section 4.2 for details.

• The dependence of the number of iterations, , on can be avoided by initializing the algorithm with the median-of-means estimate. In this case, we can instead use and obtain the same guarantees, avoiding any dependence on the knowledge of .

• The estimator depends on the confidence level . [DLLO16] propose an estimator which works for a whole range of but for a restricted class of distributions.

• Our result does not explicitly depend on the dimension and our algorithm can be extended to a Hilbert space by working within the finite dimensional subspace containing the data points.

## 3 Warm-up

We present in this section a simple descent based algorithm. This algorithm is computationally inefficient but achieves the same guarantees of Theorem 1 with a much simpler analysis which nevertheless illustrates the main ideas behind the algorithm and proof of Theorem 1.

### 3.1 Intuition

We provide some intuition for our procedure, which iteratively improves an estimate of the mean. We first consider the simpler problem of testing whether a given point is close to the mean. We draw our inspiration from the main technical insight of [LM19], who show that along any direction, most of the bucket means, , are close to the mean, . Thus, to test whether a point, , is far from the mean, it is sufficient to check whether there exists a direction where most of the are far away from along that direction. This is formally expressed in the following polynomial optimization problem:

 maxk∑i=1bi b2i=bi ∥v∥2=1 bi⟨v,Zi−x⟩≥b2ir∀i∈[k] (MTE)

This polynomial problem over the set of variables and is parameterized by , the current estimate and the bucket means . Its polynomial constraints are encoding the number of beyond a distance from when projected along a direction . Intuitively, this program tries to find a direction so as to maximize the number of beyond a distance from along that direction. Here, we know from ([LM19]) that for an appropriate choice of , along all directions , a large fraction of the are close to the mean. Formally, for all directions , (see Corollary 1 ). Therefore this optimization problem has a large value when is far from the mean and can be used to certify this.

Strikingly, the direction returned by the solution of the above problem also contains information about the location of the mean when is chosen appropriately, which enables improvement of the quality of the current estimate. As illustrated in Figure 1, the direction returned by this optimization problem is strongly correlated with the vector joining the current point to the mean .

Algorithm 2 Distance Estimation 1:  Input: Data Points , Current point 2:   3:  Return:
Algorithm 3 Gradient Estimation 1:  Input: Data Points , Current point 2:   = Distance Estimation 3:   4:  Return:

Therefore, moving a small distance along the vector should intuitively take us closer to the mean. Given solutions to the polynomial optimization problem MTE, we may iteratively improve our estimate until no further change is necessary.

### 3.2 Algorithm

In this section we put the intuition provided previously into practice and propose a procedure that estimates the mean in the ideal situation where MTE can be exactly solved (the method is formally described in Algorithm 1):

1. First, following the median of means framework, the samples are divided into buckets and the mean of the samples within each bucket is computed as .

2. Second, the estimate of the mean is iteratively updated using a descent approach, based on the solution of MTE. As mentioned in Section 3.1, we need to run MTE with an appropriate choice of for the solution to be correlated with the direction . In the Distance Estimation step of our algorithm, we estimate a suitable choice of (see Algorithm 2). This value of is subsequently used in the Gradient Estimation step, to obtain an appropriate descent direction (see Algorithm 3).

From this point on, we refer to the solution of polynomial equations MTE as .

### 3.3 Analysis warm-up

In this simplified setting, we provide an analysis of our method and show that it obtains the same guarantees as those presented in Theorem 1. This is formally expressed in the following theorem for Algorithm 1 instantiated with Algorithms 2 and 3.

###### Theorem 2.

Let be i.i.d. random vectors with mean and covariance . Then Algorithm 1 instantiated with Algorithms 2 and 3 and run with inputs , target confidence , stepsize and number of iterations returns a vector satisfying:

 ∥x∗−μ∥≤max⎛⎝ϵ,108000⎛⎝√TrΣn+√∥Σ∥log1/δn⎞⎠⎞⎠,

with probability at least .

The main steps involved in the proof are the following:

1. Distance Estimation: We show that the Distance Estimation step in Algorithm 2 provides an accurate estimate of the distance of the current point from the mean. See Lemma 1.

2. Gradient Estimation: Next, we show that when is far away from the mean , the vector obtained by solving  MTE in Algorithm 3 is well aligned with the vector joining the current point to the mean . See Lemma 2.

3. Gradient Descent: Combining the previous two steps, we prove that we eventually converge to a good approximation to the mean.

In the proofs of our lemmas relating to the correctness of the Distance Estimation and the Gradient Estimation steps, we make use of the following assumption:

###### Assumption 1.

For the bucket means, , we have:

 ∀v∈Rd,∥v∥=1⇒∣∣{i:⟨Zi−μ,v⟩≥300(√TrΣ/n+√k∥Σ∥/n)}∣∣≤0.05k

The assumption is a formalization of the insight of ([LM19]), which shows that along all directions, , most of the bucket means are within a small radius of the true mean, , with high probability444This will be made precise in Corollary 1..

First, we prove that the Distance Estimation step defined in Algorithm 2 is correct.

###### Lemma 1.

Under Assumption 1, for all in the running of Algorihm 1, satisfies:

 ∣∣dt−∥xt−μ∥∣∣≤300(√TrΣ/n+√∥Σ∥k/n).
###### Proof.

Let . We first prove the lower bound . We may assume that , as the alternate case is trivially true. For , we can simply pick the vector where is the unit vector in the direction of . Under Assumption 1, we have that for at least points:

 ⟨Zi−xt,v⟩=⟨Zi−μ,v⟩+⟨μ−xt,v⟩≥∥xt−μ∥−r∗=r.

This implies the lower bound holds in the case where .

For the upper bound , suppose, for the sake of contradiction, there is a value of for which the optimal value of is greater than . Let be the solution of . This means that for of the , we have:

 ⟨Zi−μ,v⟩=⟨Zi−xt,v⟩+⟨xt−μ,v⟩≥r−∥xt−μ∥>r∗.

This contradicts Assumption 1 and proves the upper bound. ∎

Next, we prove the correctness of the Gradient Estimation step from Algorithm 3.

###### Lemma 2.

In the running of Algorithm 1, let us assume satisfies:

 ∥μ−xt∥≥1200(√TrΣ/n+√∥Σ∥k/n), (1)

and let denote the unit vector in the direction of . Then, under Assumption 1, we have that:

 ⟨gt,Δ⟩≥12.
###### Proof.

Let . We have, from the definition of , that for of the , . We also have, under Assumption 1, that for of the . From the pigeonhole principle, there exists a which satisfies both those inequalities. Therefore, for that , the lower bound from Lemma 1 implies

 ∥μ−xt∥−r∗≤dt≤⟨Zj−xt,gt⟩=⟨Zj−μ,gt⟩+⟨μ−xt,gt⟩≤r∗+∥μ−xt∥⟨Δ,gt⟩.

By rearranging the above inequality and using the assumption on in Eq. (1), we get the required conclusion. ∎

To control the probability that Assumption 1 holds, we assume the correctness of the following corollary of Lemma 7, formalizing the insight of ([LM19]):

###### Corollary 1.

Let be i.i.d. random vectors with mean and covariance . Furthermore, assume . Then we have for all such that :

 ∣∣{i:⟨Yi−μ,v⟩≥300(√TrΛ/k+√∥Λ∥)}∣∣≤0.05k

with probability at least .

By instantiating Corollary 1 with the , we see that Assumption 1 holds with high probability.

Finally, we put the results of Lemma 1, Lemma 2 and Corollary 1 together to prove Theorem 2.

###### Proof of Theorem 2.

Assume first that Assumption 1 holds. Let . To start with, let us define the set . We prove the theorem in two cases:

1. Case 1: None of the iterates lie in . In this case, note that by Lemma 1 and the definition of , we have:

 34∥xt−μ∥≤dt≤54∥xt−μ∥. (2)

Moreover, we have by the definition of the update rule of in Algorithm 1:

 ∥xt+1−μ∥2 =∥xt−μ∥2+12dt⟨xt−μ,gt⟩+d2t16≤∥xt−μ∥2−dt∥xt−μ∥4+d2t16 ≤∥xt−μ∥2−316∥xt−μ∥2+25256∥xt−μ∥2≤2325∥xt−μ∥2,

where we have used Lemma 2 for the first inequality and the inequalities in Eq. (2) for the second inequality. By iteratively applying the above inequality, we get the conclusion of the theorem in this case.

2. Case 2: At least one of the iterates lies in . Therefore, we have from Lemma 1:

 dt≤1500(√TrΣ/n+√∥Σ∥k/n).

We also have at the completion of the algorithm, from another application of Lemma 1:

 ∥x∗−μ∥−300(√TrΣ/n+√∥Σ∥k/n)≤d∗≤dt≤1500(√TrΣ/n+√∥Σ∥k/n).

By re-arranging the above inequality, we get the desired result.

By Corollary 1, Assumption 1 holds with probability at least and therefore, the conclusions from Case 1 and Case 2 hold with probability . ∎

Bearing in mind that the polynomial optimization problem MTE is non-convex, we consider a convex relaxation in the following section.

## 4 Efficient Algorithm for Mean Estimation

In this section, we define a semi-definite programming relaxation of the polynomial optimization problem MTE. We then design new Distance Estimation and Gradient Estimation algorithms that use the tractable solutions to the relaxation instead of the original polynomial optimization problem. We then use these solutions to update our mean estimate along the same lines as those from Section 3, albeit with some added technical difficulty. Finally, we provide the analysis of the method and prove Theorem 1.

### 4.1 The Semi-Definite Relaxation of Mte

Here, we propose a semidefinite programming relaxation of MTE, a variant of the Threshold-SDP from ([Hop18]). We first define a semidefinite matrix symbolically

indexed by , the variables and and denote by the vector :

 maxk∑i=1X1,bi X1,bi=Xbi,bi X1,1=1 d∑j=1Xvj,vj=1 ⟨vbi,Zi−x⟩≥Xbi,bir ∀i∈[k] X≽0 (MT)

Similar to the polynomial optimization MTE, this optimization problem is also parameterized by a vector , and a matrix . We refer to solutions of this program as with denoting the optimal value and denoting the optimal solution.

The main contribution of our paper is in showing that the solutions to the relaxed optimization problem MTE can be used to improve the mean estimate similar to those of MT.

### 4.2 Algorithm

To efficiently estimate the mean, we instantiate Algorithm 1 to use solutions of MT instead of MTE. The new Distance Estimation and Gradient Estimation procedures are stated in Algorithms 4 and 5.

As opposed to the polynomial optimization problem, solutions to the relaxation may not necessarily return a single vector but rather a semidefinite matrix which corresponds to the relaxation of

. This matrix may not uniquely determine a direction of improvement. We, therefore, parse the solution to isolate a provably good direction of improvement and use this to iteratively improve our estimate. It is noteworthy that the singular value decomposition does not provide a sign direction. Thankfully the correct orientation is easily ascertained using the data points.

To analyze the runtime of Algorithm 1 with Algorithms 4 and  5, we first note that the semidefinite relaxation has variables. However, by projecting all the data down to a subspace containing the bucket means, we may effectively reduce the number of variables to with an time pre-processing step. Therefore, we are now left with variables. The runtime of interior point methods for solving semidefinite programs with variables and constraints is ([Ali95]). Furthermore, a single call of the Distance Estimation procedure can be efficiently implemented using rounds of binary search on the parameter . Therefore, the total cost of a single call to Algorithm 4 is . Similarly, the total cost of a call to Algorithm 5 is . Since the cost of each iteration is dominated by a single call of Algorithm 4 and 5, the total cost per iteration is . Since, we only run iterations, the total cost of the Algorithm 1 instantiated with Algorithms 4 and  5 is .

### 4.3 Analysis

We now prove Theorem 1. We follow the same lines as the proof of Theorem 2, but with the added technical difficulties arising from the use of the semi-definite relaxation.

1. Distance Estimation: We show that the Distance Estimation step in Algorithm 4 provides an accurate estimate of the distance of the current point from the mean. See Section 4.3.1.

2. Gradient Estimation: Next, we show that when is far away from the mean , the vector output by Algorithm 5 is well aligned with the vector joining the current point to the mean . See Section 4.3.2.

3. Gradient Descent: Combining the previous two steps, we prove that we eventually converge to a good approximation to the mean. See Section 4.3.3.

The following assumption is required to prove the correctness of the Distance Estimation and Gradient Estimation steps:

###### Assumption 2.

For the bucket means, , let denote the set of feasible solutions for . Then, we have for all ,

 maxX∈Srk∑i=1Xbi,bi≤k20.

The above assumption is a strengthening of Assumption 1 for the case where we use MT instead of MTE. We use the following fact at several points in the subsequent analysis:

###### Remark 1.

Note that Assumption 2 implies Assumption 1.

#### 4.3.1 Distance Estimation Step

In this subsection, we analyze the Distance Estimation step from Algorithm 4. We show that an accurate estimate of the distance of the current point from the mean can be found. We begin by stating a lemma that shows that a feasible solution for can be converted to a feasible solution for with a reduction in optimal value.

###### Lemma 3.

Let us assume Assumption 2. Let be a positive semi-definite matrix, symbolically indexed by and the variables and . Moreover, suppose that satisfies:

 X1,1=1,Xbi,bi=X1,bi,d∑j=1Xvj,vj=1,k∑i=1Xbi,bi≥0.9k.

Then, there is a set of at least indices such that for all :

 ⟨Zi−μ,vbi⟩

and a set of at least indices such that for all , we have .

###### Proof.

Let . We prove the lemma by contradition. Firstly, note that is infeasible for as the optimal value for is less than (Assumption 2). Note that the only constraints of that are violated by are constraints of the form:

 ⟨Zi−μ,vbi⟩

Now, let denote the set of indices for which the above inequality is violated. We can convert to a feasible solution for by setting to the rows and columns corresponding to the indices in . Let be the matrix obtained by the above operation. We have from Assumption 2:

 0.05k≥k∑i=1X′bi,bi=k∑i=1Xbi,bi−∑i∈TXbi,bi≥0.9k−|T|,

where the last inequality follows from the fact that . By rearranging the above inequality, we get the first claim of the lemma.

For the second claim, let denote the set of indices satisfying . We have:

 0.9k≤k∑j=1Xbj,bj=∑j∈RXbj,bj+∑j∉RXbj,bj≤|R|+0.85k−0.85|R|⟹k3≤|R|.

This establishes the second claim of the lemma. ∎

The following lemma shows that if the distance between the mean and a point is small then the estimate returned by Algorithm 4 is also small.

###### Lemma 4.

Suppose a point satisfies . Then, under Assumption 2, Algorithm 4 returns a value satisfying

 d′≤7500(√TrΣ/n+√k∥Σ∥/n).
###### Proof.

Let and . Suppose that the optimal value of is greater than and let its optimal solution be . Let and denote the two sets whose existence is guaranteed by Lemma 3. From, the cardinalities of and , we see that their intersection is not empty. For , we have:

 0.85r′≤⟨Zj−x,vbj⟩=⟨Zj−μ,vbj⟩+⟨μ−x,vbj⟩

where the first inequality follows from the fact that and the fact that is feasible for and the last inequality follows from the inclusion of in and Cauchy-Schwarz.

By plugging in the bounds on and , we get:

 ∥x−μ∥>6075(√TrΣ/n+√k∥Σ∥/n).

This contradicts the assumption on and concludes the proof of the lemma. ∎

The next lemma shows that the distance between the mean and a point can be accurately estimated as long as is sufficiently far from .

###### Lemma 5.

Suppose a point satisfies . Then, under Assumption 2, Algorithm 4 returns a value satisfying:

 0.95~d≤d′≤1.25~d.
###### Proof.

Let us define the direction to be the unit vector in the direction of . From Assumption 1 (which is implied by Assumption 2), the number of satisfying is less than . Therefore, we have that for at least points:

 ⟨Zi−x,−Δ⟩=⟨x−μ+μ−Zi,Δ⟩=∥x−μ∥−300(√TrΣ/n+√k∥Σ∥/n)≥0.95~d.

Along with the monotonicity555See Lemma 8 in Appendix A. of in , this implies the lower bound.

For the upper bound, we show that the optimal value of is less than . For the sake of contradiction, suppose that this optimal value is greater than . Let be a feasible solution of that achieves . Let and be the two sets whose existence is guaranteed by Lemma 3 and be an element in their intersection. We have for :

 0.85(1.25~d) ≤Xbj,bj1.25~d≤⟨Zj−x,vbj⟩=⟨Zj−μ,vbj⟩+⟨μ−x,vbj⟩

where the first inequality follows from the inclusion of in and the last inequality follows from the inclusion of in and Cauchy-Schwarz. By re-arranging the above inequality, we get:

 Xbj,bj>(1.0625~d−~d)(300(√TrΣ/n+√k∥Σ∥/n))−1>1,

which is a contradiction. Therefore, we get from the monotonicity of (see Lemma 8), that and this concludes the proof of the lemma. ∎

In this section, we analyze the Gradient Estimation step of the algorithm. We show that an approximate gradient can be found as long as the current point is not too close to the mean . The following lemma shows that we obtain a non-trivial estimate of the gradient in Algorithm 5.

###### Lemma 6.

Suppose a point satisfies and let be the unit vector along . Then under Assumption 2, Algorithm 5 returns a vector satisfying:

 ⟨g,Δ⟩≥115.
###### Proof.

In the running of Algorithm 5, let denote the solution of . We begin by factorizing the solution into with the rows of denoted by , and . We also define the matrix in . From the constraints in MT, we have:

 Xbi,bi=∥ubi∥2≤1⟹∥ubi∥≤1,d∑j=1Xvj,vj=d∑j=1∥uvj∥2=∥Uv∥2F=1⟹∥Uv∥F=1.

Let and denote the sets defined in Lemma 3. Let . By noting that , we have for :

 0.85d∗≤⟨Zj−μ,vbj⟩+⟨μ−x,vbj⟩≤Xbj,bj300(√TrΣ/n+√k∥Σ∥/n)+u⊤bjUv(μ−x),

where the first inequality follows from the inclusion of in and the second from its inclusion in . We get by rearranging the above equation and using our bound on from Lemma 5:

 0.80∥μ−x∥≤0.85d∗≤Xbj,bj300(√TrΣ/n+√k∥Σ∥/n)+u⊤bjUv(μ−x). (3)

By rearranging Eq. (3), using Cauchy-Schwarz, and the assumption on :

 ∥Uv(μ−x)∥≥u⊤bjUv(μ−x)≥0.75∥μ−x∥.

We finally get that:

 ∥UvΔ∥≥0.75.

Now, we have:

Let be the top singular vector of . Note that and is also the top right singular vector of . We have that:

 0.75≤∥Uvy∥≤∥UvPΔy∥+∥UvP⊥Δy∥≤∥PΔy∥+∥UvP⊥Δ∥F≤∥PΔy∥+0.67.

This means that we have:

 |⟨y,Δ⟩|≥115.

Note that the algorithm returns either or . Firstly, consider the case where . From Assumption 1 (implied by Assumption 2), we have for at least points:

 ⟨Zi−μ,y⟩≤300(√TrΣ/n+√k∥Σ∥/n).

Therefore, we have for points:

 ⟨Zi−x,y⟩ =⟨Zi−μ,y⟩+⟨μ−x,y⟩ ≥−300(√TrΣ/n+√k∥Σ∥/n)+6000(√TrΣ/n+√k∥Σ∥/n)15>0.

This means that in the case where , we return which satisfies . This implies the lemma in this case. The case where is similar with used instead of . This concludes the proof of the lemma. ∎

The following lemma guarantees that Assumption 2 holds with high probability and is used analogously to Corollary 1 in the proof of Theorem 2:

###### Lemma 7.

Let be i.i.d. random vectors with mean and covariance and let denote the set of feasible solutions of . Then, we have for and :

 maxX∈Sk∑i=1Xbi,bi≤k20,

with probability at least .

The proof of the lemma is an application of standard empirical process theory and concentration inequalities ([LM19, Hop18]) and is proven in Appendix B.

The rest of the proof of Theorem 1 follows the same lines as that of Theorem 2 and is postponed to Appendix C.

## 5 Conclusion

In this paper, we proposed a computationally efficient estimator for the mean of a random vector which obtains the statistically optimal performance. This estimator has a significantly faster runtime together with a simpler analysis than previous works. Our algorithm is based on a descent method, where a current estimate of the mean is iteratively improved.

Considering the extension to M-estimation procedures ([BJL15, HS16, LM17]

) is a promising direction for further research, with as first step, the particular example of linear regression with heavy tailed noise and covariates (

[AC11]).

## References

• [AC11] J.-Y. Audibert and O. Catoni. Robust linear least squares regression. Ann. Statist., 39(5):2766–2794, 10 2011.

Interior point methods in semidefinite programming with applications to combinatorial optimization.

SIAM journal on Optimization, 5(1):13–51, 1995.
• [AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
• [BJL15] C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. Ann. Statist., 43(6):2507–2536, 12 2015.
• [BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford university press, 2013.
• [Cat12] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. Inst. Henri Poincaré Probab. Stat., 48(4):1148–1185, 2012.
• [CG17a] O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. arXiv preprint arXiv:1712.02747, 2017.
• [CG17b] O. Catoni and I. Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. NIPS 2017 workshop; (Almost) 50 shades of Bayesian learning: PAC-Bayesian trends and insights, 2017.
• [CLM16] M. B. Cohen, Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford. Geometric median in nearly linear time. In

Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing

, STOC ’16, 2016.
• [DLLO16] L. Devroye, M. Lerasle, G. Lugosi, and R. I. Oliveira. Sub-Gaussian mean estimators. Ann. Statist., 44(6):2695–2725, 2016.
• [Hop18] S. B Hopkins. Sub-Gaussian mean estimation in polynomial time. arXiv preprint arXiv:1809.07425, 2018.
• [HS16] D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res., 17, 2016.
• [JLO17] E. Joly, G. Lugosi, and R. Oliveira. On the estimation of the mean of a random vector. Electron. J. Statist., 11(1):440–451, 2017.
• [JVV86] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani.

Random generation of combinatorial structures from a uniform distribution.

Theoretical Computer Science, 43:169–188, 1986.
• [LM17] G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments. Journal of the European Mathematical Society, to appear, 2017.
• [LM19] G. Lugosi and S. Mendelson. Sub-Gaussian estimators of the mean of a random vector. Ann. Statist., 47(2):783–794, 04 2019.
• [LO11] M. Lerasle and R. Oliveira. Robust empirical mean estimators. arXiv preprint arXiv:1112.3914, 2011.
• [LT91] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes, volume 23. Springer Science & Business Media, 1991.
• [Min15] S. Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308–2335, 2015.
• [Nes98] Y. Nesterov. Semidefinite relaxation and nonconvex quadratic optimization. Optimization methods and software, 9(1-3):141–160, 1998.
• [NY83] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. John Wiley & Sons, 1983.

## Appendix A Auxiliary lemma

###### Lemma 8.

For any and , the optimal value of is monotonically non-increasing in .

###### Proof.

The lemma follows trivially from the fact that a feasible solution of is also a feasible solution for for . ∎

## Appendix B Proof of Lemma 7

We first show that the optimal value of the semi-definite program MT satisfies a bounded-difference condition with respect to the ’s.

###### Lemma 9.

Let be any set of vectors in . Now, let be the same set of vectors with the vector replaced by . If and are the optimal values of and , we have:

 |m−m′|≤1
###### Proof.

Firstly, assume that is a feasible solution to . Now, let us define as:

 X′i,j={Xi,j if i,j≠bi0otherwise

That is is equal to except with the row and column corresponding to being set to . We see that forms a feasible solution to . Therefore, we have that:

 k∑j=1Xbj,bj=k∑j=1,j≠iX′bj,bj+Xbi,bi≤k∑j=1,j≠iX′bj,bj+1≤m′+1

where the bound follows from the fact that the sub-matrix of formed by the rows and columns indexed by and is positive semidefinite and the constraint that . Since the above series of equalities holds for all feasible solutions of , we get:

 m≤m′+1.

Through a similar argument, we also conclude that . Putting the above two inequalities together, we get the required conclusion. ∎

For the next few lemmas, we are concerned with the case where . Since we already know that the optimal SDP value satisfies the bounded differences condition, we need to verify that the expectation is small. As a first step towards this, we define the 2-to-1 norm of a matrix .

###### Definition 1.

The 2-to-1 norm of is defined as

 ∥M∥2→1=max∥v∥=1σi∈{±1}σ⊤Mv=max∥v∥=1∥Mv∥1

We consider the classical semidefinite programming relaxation of the 2-to-1 norm. To start with, we will define a matrix with the rows and columns indexed by and the elements and . The semidefinite programming relaxation is defined as follows: