List-Decodable Robust Mean Estimation and Learning Mixtures of Spherical Gaussians

We study the problem of list-decodable Gaussian mean estimation and the related problem of learning mixtures of separated spherical Gaussians. We develop a set of techniques that yield new efficient algorithms with significantly improved guarantees for these problems. List-Decodable Mean Estimation. Fix any d ∈Z_+ and 0< α <1/2. We design an algorithm with runtime O (poly(n/α)^d) that outputs a list of O(1/α) many candidate vectors such that with high probability one of the candidates is within ℓ_2-distance O(α^-1/(2d)) from the true mean. The only previous algorithm for this problem achieved error Õ(α^-1/2) under second moment conditions. For d = O(1/ϵ), our algorithm runs in polynomial time and achieves error O(α^ϵ). We also give a Statistical Query lower bound suggesting that the complexity of our algorithm is qualitatively close to best possible. Learning Mixtures of Spherical Gaussians. We give a learning algorithm for mixtures of spherical Gaussians that succeeds under significantly weaker separation assumptions compared to prior work. For the prototypical case of a uniform mixture of k identity covariance Gaussians we obtain: For any ϵ>0, if the pairwise separation between the means is at least Ω(k^ϵ+√((1/δ))), our algorithm learns the unknown parameters within accuracy δ with sample complexity and running time poly (n, 1/δ, (k/ϵ)^1/ϵ). The previously best known polynomial time algorithm required separation at least k^1/4polylog(k/δ). Our main technical contribution is a new technique, using degree-d multivariate polynomials, to remove outliers from high-dimensional datasets where the majority of the points are corrupted.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/22/2022

List-Decodable Covariance Estimation

We give the first polynomial time algorithm for list-decodable covarianc...
10/31/2017

On Learning Mixtures of Well-Separated Gaussians

We consider the problem of efficiently learning mixtures of a large numb...
02/19/2014

Near-optimal-sample estimators for spherical Gaussian mixtures

Statistical and machine-learning algorithms are frequently applied to hi...
07/26/2022

Efficient Algorithms for Sparse Moment Problems without Separation

We consider the sparse moment problem of learning a k-spike mixture in h...
12/01/2021

Clustering Mixtures with Almost Optimal Separation in Polynomial Time

We consider the problem of clustering mixtures of mean-separated Gaussia...
12/14/2020

Small Covers for Near-Zero Sets of Polynomials and Learning Latent Variable Models

Let V be any vector space of multivariate degree-d homogeneous polynomia...
05/28/2022

List-Decodable Sparse Mean Estimation

Robust mean estimation is one of the most important problems in statisti...

1 Introduction

1.1 Background

This paper is concerned with the problem of efficiently learning high-dimensional spherical Gaussians in the presence of a large fraction of corrupted data, and in the related problem of parameter estimation for mixtures of high-dimensional spherical Gaussians (henceforth, spherical GMMs). Before we state our main results, we describe and motivate these two fundamental learning problems.

The first problem we study is the following:

Problem 1: List-Decodable Gaussian Mean Estimation. Given a set of points in and a parameter with the promise that an -fraction of the points in are drawn from — an unknown mean, identity covariance Gaussian — we want to output a “small” list of candidate vectors such that at least one of the ’s is “close” to the mean of , in Euclidean distance.

A few remarks are in order: We first note that we make no assumptions on the remaining -fraction of the points in . These points can be arbitrary and may be chosen by an adversary that is computationally unbounded and is allowed to inspect the set of good points. We will henceforth call such a set of points -corrupted. Ideally, we would like to output a single hypothesis vector that is close to (with high probability). Unfortunately, this goal is information-theoretically impossible when the fraction of good samples is less than . For example, if the input distribution is a uniform mixture of many Gaussians whose means are pairwise far from each other, there are different valid answers and the list must by definition contain approximations to each of them. It turns out that the information-theoretically best possible size of the candidates list is . Therefore, the feasible goal is to design an efficient algorithm that minimizes the Euclidean distance between the unknown and its closest .

The second problem we consider is the familiar task of learning the parameters of a spherical GMM. Let us denote by the Gaussian with mean and covariance . A Gaussian is called spherical if its covariance is a multiple of the identity, i.e., , for . An -dimensional -mixture of spherical Gaussians (spherical -GMM) is a distribution on with density function where , and .

Problem 2: Parameter Estimation for Spherical GMMs. Given , a specified accuracy , and samples from a spherical -GMM on , we want to estimate the parameters up to accuracy . More specifically, we want to return a list so that for some permutation , we have that for all : , , and .

If is the hypothesis distribution, the above definition implies that . We will also be interested in the robust version of Problem 2. This corresponds to the setting when the input is an -corrupted set of samples from a -mixture of spherical Gaussians, where .

Before we proceed with a detailed background and motivation, we point out the connection between these two problems. Intuitively, Problem 2 can be reduced to Problem 1, as follows: We can think of the samples drawn from a spherical GMM as a set of corrupted samples from a single Gaussian — where the Gaussian in question can be any of the mixture components. The output of the list decoding algorithm will produce a list of hypotheses with the guarantee that every mean vector in the mixture is relatively close to some hypothesis. If in addition the distances between the means and their closest hypotheses are substantially smaller than the distances between the means of different components, this will allow us to reliably cluster our sample points based on which hypothesis they are closest to. We can thus cluster points based on which component they came from, and then we can learn each component independently.

1.2 List-Decodable Robust Learning

The vast majority of efficient high-dimensional learning algorithms with provable guarantees make strong assumptions about the input data. In the context of unsupervised learning (which is the focus of this paper), the standard assumption is that the input points are independent samples drawn from a known family of generative models (e.g., a mixture of Gaussians). However, this simplifying assumption is rarely true in practice and it is important to design estimators that are

robust to deviations from their model assumptions.

The field of robust statistics [HRRS86, HR09] traditionally studies the setting where we can make no assumptions about a “small” constant fraction of the data. The term “small” here means that , hence the input data forms a reasonably accurate representation of the true model. From the information-theoretic standpoint, robust estimation in this “small error regime” is fairly well understood. For example, in the presence of -fraction of corrupted data, where , the Tukey median [Tuk75] is a robust estimator of location that approximates the mean of a high-dimensional Gaussian within -error — a bound which is known to be information-theoretically best possible for any

estimator. The catch is that computing the Tukey median can take exponential time (in the dimension). This curse of dimensionality in the running time holds for essentially all known estimators in robust statistics 

[Ber06].

This phenomenon had raised the following question: Can we reconcile computational efficiency and robustness in high dimensions? Recent work in the TCS community made the first algorithmic progress on this front: Two contemporaneous works [DKK16, LRV16] gave the first computationally efficient robust algorithms for learning high-dimensional Gaussians (and many other high-dimensional models) with error close to the information-theoretic optimum. Specifically, for the problem of robustly learning an unknown mean Gaussian from an -corrupted set of samples, , we now know a polynomial-time algorithm that achieves the information-theoretically optimal error of  [DKK17b].

The aforementioned literature studies the setting where the fraction of corrupted data is relatively small (smaller than ), therefore the real data is the majority of the input points. A related setting of interest focuses on the regime when the fraction of real data is small — strictly smaller than

. From a practical standpoint, this “large error regime” is well-motivated by a number of pressing machine learning applications (see, e.g., 

[CSV17, SVC16, SKL17]). From a theoretical standpoint, understanding this regime is of fundamental interest and merits investigation in its own right. A specific motivation comes from a previously observed connection to learning mixture models: Suppose we are given samples from the mixture , i.e., -fraction of the samples are drawn from an unknown Gaussian, while the rest of the data comes from several other populations for which we have limited (or no) information. Can we approximate this “good” Gaussian component, independent of the structure of the remaining components?

More broadly, we would like to understand what type of learning guarantees are possible when the fraction of good data is strictly less than . While outputting a single accurate hypothesis is information-theoretically impossible, one may be able to efficiently compute a small list of candidate hypotheses with the guarantee that at least one of them is accurate. This is the notion of list-decodable learning, a model introduced by [BBV08]. Very recently, [CSV17] first studied the problem of robust high-dimensional estimation in the list-decodable model. In the context of robust mean estimation, [CSV17] gave an efficient list-decodable learning algorithm with the following performance guarantee: Assuming the true distribution of the data has bounded covariance, their algorithm outputs a list of candidate vectors one of which is guaranteed to achieve -error from the true mean.

Perhaps surprisingly, several aspects of list-decodable robust mean estimation are poorly understood. For example, is the error bound of the [CSV17] algorithm best possible? If so, can we obtain significantly better error guarantees assuming additional structure about the real data? Notably — and in contrast to the small error regime — even basic information-theoretic aspects of the problem are open. That is, ignoring statistical and computational efficiency considerations, what is the minimum error achievable with (or ) candidate hypotheses for a given family of distributions?

The main focus of this work is on the fundamental setting where the good data comes from a Gaussian distribution. Specifically, we ask the following question:

Question 1.1.

What is the best possible error guarantee (information-theoretically) achievable for list-decodable mean estimation, when the true distribution is an unknown ? More importantly, what is the best error guarantee that we can achieve with a computationally efficient algorithm?

As our first main result, we essentially resolve Question 1.1.

1.3 Learning Mixtures of Separated Spherical Gaussians

A mixture of Gaussians or Gaussian mixture model (GMM) is a convex combination of Gaussian distributions, i.e., a distribution in of the form , where the weights , mean vectors , and covariance matrices are unknown. GMMs are one of the most ubiquitous and extensively studied latent variable models in the literature, starting with the pioneering work of Karl Pearson [Pea94]. In particular, the problem of parameter learning of a GMM from samples has received tremendous attention in statistics and computer science. (See Section 1.5 for a summary of prior work.)

In this paper, we focus on the natural and important case where each of the components is spherical, i.e., each covariance matrix is an unknown multiple of the identity. The majority of prior algorithmic work on this problem studied the setting where there is a minimum separation between the means of the components111Without any separation assumptions, it is known that the sample complexity of the problem becomes exponential in the number of components [MV10, HP15].. For the simplicity of this discussion, let us consider the case that the mixing weights are uniform (i.e., equal to , where is the number of components) and each component has identity covariance. (We emphasize that the positive results of this paper hold for the general case of an arbitrary mixture of high-dimensional spherical Gaussians, and apply even in the presence of a small dimension-independent fraction of corrupted data.) The problem of learning separated spherical GMMs was first studied by Dasgupta [Das99], followed by a long line of works that obtained efficient algorithms under weaker separation assumptions.

The currently best known algorithmic result in this context is the learning algorithm by Vempala and Wang [VW02] from 2002. Vempala and Wang gave a spectral algorithm with the following performance guarantee [VW02]: their algorithm uses samples and time, and learns a spherical -GMM in dimensions within parameter distance , as long as the pairwise distance (separation) between the component mean vectors is at least . Obtaining a time algorithm for this problem that succeeds under weaker separation conditions has been an important open problem since.

Interestingly enough, until very recently, even the information-theoretic aspect of this problem was not understood. Specifically, what is the minimum separation that allows the problem to be solvable with samples? Recent work by Regev and Vijayraghavan [RV17] characterized this aspect of the problem: Specifically, [RV17] showed that the problem of learning spherical -GMMs (with equal weights and identity covariances) can be solved with samples if and only if the means are pairwise separated by at least . Unfortunately, the approach of [RV17] is non-constructive in high dimensions. Specifically, they gave a sample-efficient learning algorithm whose running time is exponential in the dimension. This motivates the following question:

Question 1.2.

Is there a time algorithm for learning spherical -GMMs with separation , or better , for any fixed ? More ambitiously, is there an efficient algorithm that succeeds under the information-theoretically optimal separation?

As our second main result, we make substantial progress towards the resolution of Question 1.2.

1.4 Our Contributions

In this paper, we develop a set of techniques that yield new efficient algorithms with significantly better guarantees for Problems 1 and 2. Our algorithms depend in an essential way on the analysis of high degree multivariate polynomials. We obtain a detailed structural understanding of the behavior of high degree polynomials under the standard multivariate Gaussian distribution, and leverage this understanding to design our learning algorithms. More concretely, our main technical contribution is a new technique, using degree- multivariate polynomials, to remove outliers from high-dimensional datasets where the majority of the points are corrupted.

List-Decodable Mean Estimation.

Our main result is an efficient algorithm for list-decodable Gaussian mean estimation with a significantly improved error guarantee:

Theorem 1.3 (List-Decodable Gaussian Mean Estimation).

Fix and . There is an algorithm with the following performance guarantee: Given , and a set of cardinality with the promise that -fraction of the points in are independent samples from an unknown , , the algorithm runs in time and with high probability outputs a list of vectors one of which is within -distance of the mean of .

We note that the notation hides polylogarithmic factors in its argument. See Theorem 3.1 for a more detailed formal statement.

Discussion and Comparison to Prior Work.

As already mentioned in Section 1.2, the only previously known algorithm for list-decodable mean estimation (for ) is due to [CSV17] and achieves error under a bounded covariance assumption for the good data. As we will show later in this section (Theorem 1.5), this error bound is information-theoretically (essentially) best possible under such a second moment condition. Hence, additional assumptions about the good data are necessary to obtain a stronger bound. It should also be noted that the algorithm [CSV17] does not lead to a better error bound, even for the case that the good distribution is an identity covariance Gaussian222Intuitively, this holds because the [CSV17] algorithm only uses the first two empirical moments. It can be shown that more moments are necessary to improve on the error bound (see the construction in the proof of Theorem 1.6)..

Our algorithm establishing Theorem 1.3 achieves substantially better error guarantees under stronger assumptions about the good data. The parameter quantifies the tradeoff between the error guarantee and the sample/computational complexity of our algorithm. Even though it is not stated explicitly in Theorem 1.3, we note that for our algorithm straightforwardly extends to all subgaussian distributions (with parameter ), and gives error . We also remark that our algorithm is spectral — in contrast to [CSV17] that relies on semidefinite programming — and it may be practical for small constant values of .

There are two important parameter regimes we would like to highlight: First, for , where is an arbitrarily small constant, Theorem 1.3 yields a polynomial time algorithm that achieves error of . Second, for , Theorem 1.3 yields an algorithm that runs in time and achieves error of . This error bound comes close to the information-theoretic optimum of , established in Theorem 1.5. While we do not prove it in this version of the paper, we believe that an adaptation of our algorithm works under the optimal separation of .

A natural question is whether there exists a time list-decodable mean estimation algorithm with error , or even . In Theorem 1.6, we prove a Statistical Query (SQ) lower bound suggesting that the existence of such an algorithm is unlikely. More specifically, our SQ lower bound gives evidence that the complexity of our algorithm is qualitatively best possible.

High-Level Overview of Technical Contributions.

Let be the unknown mean Gaussian from which the -fraction of good samples are drawn, and be the -corrupted set of points given as input. We design an algorithm that iteratively detects and removes outliers from , until we are left with a collection of many subsets of one of which is substantially “cleaner” than . Specifically, the empirical mean of at least one of the ’s will be close to the unknown mean of . Our algorithm is “spectral” in the sense that it works by analyzing the eigendecomposition of certain matrices constructed from degree- moments of the empirical distribution. Specifically, to achieve error of , the algorithm of Theorem 1.3 works with matrices of dimension .

At a very high-level, our approach bears a similarity to the “filter” method — a spectral technique to iteratively detect and remove outliers from a dataset — introduced in [DKK16], for efficient robust estimation in the “small error regime” (corresponding to ). Specifically, our algorithm tries to identify degree- polynomials such that the behavior of on the corrupted set of samples is significantly different from the expected behavior of on the good set of samples . One way to achieve this goal [DKK16, DKS16b] is by finding polynomials

with unexpectedly large empirical variance. The hope is that if we find such a polynomial, we can then use it to identify a set of points with a large fraction of corrupted samples and remove it to clean up our data set. This idea was previously used for robust estimation in the small error regime.

A major complication that occurs in the regime of is that since fewer than half of our samples are good, the values of such a polynomial might concentrate in several clusters. As a consequence, we will not necessarily be able to identify which cluster contains the good samples. In order to deal with this issue, we need to develop new techniques for outlier removal that handle the setting that the good data is a small fraction of our dataset. Roughly speaking, we achieve this by performing a suitable clustering of points based on the values of , and returning multiple (potentially overlapping) subsets of our original dataset with the guarantee that at least one of them will be a cleaner version of . This new paradigm for performing outlier removal in the large error regime may prove useful in other contexts as well.

A crucial technical contribution of our approach is the use of degree more than one polynomials for outlier removal in this setting. The intuitive reason for using polynomials of higher degree is this: A small fraction of points that are far from the true mean in some particular direction will have a more pronounced effect on higher degree moments. Therefore, taking advantage of the information contained in higher moments should allow us to discern smaller errors in the distance from the true mean. The difficulty is that it is not clear how to algorithmically exploit the structure of higher degree moments in this setting.

The major obstacle is the following: Since we do not know the mean of — this is exactly the quantity we are trying to approximate! — we are also not able to evaluate the variance of . If was a degree- polynomial, this would not be a problem, as the variance does not depend on . But for degree at least polynomials, the dependence of on becomes a fundamental difficulty. Thus, although we can potentially find polynomials with unexpectedly large empirical variance, we will have no way of knowing whether this is due to corrupted points (on which is abnormally far from its true mean), or due to errors in our estimation of the mean of causing us to underestimate the variance .

In order to circumvent this difficulty, we require a number of new ideas, culminating in an algorithm that allows us to either verify that the variance of is close to what we are expecting, or to find some other polynomial that allows us to remove outliers.

Learning Mixtures of Separated Spherical GMMs.

We leverage the connection between list-decodable learning and learning mixture models to obtain an efficient algorithm for learning spherical GMMs under much weaker separation assumptions. Specifically, by using the algorithm of Theorem 1.3 combined with additional algorithmic ideas, we obtain our second main result:

Theorem 1.4 (Learning Separated Spherical GMMs).

There is an algorithm with the following performance guarantee: Given , , and sample access to a -mixture of spherical Gaussians on , where , with for all , and so that is at least

for all , for a sufficiently large constant, the algorithm draws samples from , runs in time , and with high probability returns a list , such that the following conditions hold (up to a permutation): , , and .

The reader is also referred to Proposition 4.3 for a more detailed statement that also allows a small, dimension-independent fraction of adversarial noise in the input samples.

Discussion and High-Level Overview.

To provide a cleaner interpretation of Theorem 1.4, we focus on the prototypical case of a uniform mixture of identity covariance Gaussians. For this case, Theorem 1.4 reduces to the following statement (see Corollary 4.12): For any , if the pairwise separation between the means is at least , our algorithm learns the parameters up to accuracy in time . Prior to our work, the best known efficient algorithm [VW02] required separation . Also note that by setting , we obtain a learning algorithm with sample complexity and running time that works with separation of . This separation bound comes close to the information-theoretically minimum of  [RV17]. (We also note that improving the error bound in Theorem 1.3 to , for , would directly improve our separation bound to .)

We now provide an intuitive explanation of our spherical GMM learning algorithm. First, we note that we can reduce the dimension of the problem from down to some function of . When the covariance matrices of the components are nearly identical, this can be done with a twist of standard techniques. For the case of arbitrary covariances, we need to employ a few additional ingredients.

When each component has the same covariance matrix, the learning algorithm is quite simple: We start by running our list-decoding algorithm (Theorem 1.3) with appropriate parameters to get a small list of hypothesis means. We then associate each sample with the closest element of our list. At this point, we can cluster the points based on which means they are associated to and use this clustering to accurately learn the correct components.

The general case, when the covariances of the components are arbitrary, is significantly more complex. In this case, we can recover a list of candidate means only after first guessing the radius of the component that we are looking for. Without too much difficulty, we can find a large list of guesses and thereby produce a list of hypotheses of size . However, clustering based on this list now becomes somewhat more difficult, as we do not know the radius at which to cluster. We address this issue by performing a secondary test to determine whether or not the cluster that we have found contains many points at approximately the correct distance from each other.

Minimax Error Bounds and SQ Lower Bounds.

As mentioned in Section 1.2, even the following information-theoretic aspect of list-decodable mean estimation is open: Ignoring sample complexity and running time, how small a distance from the true mean can be achieved with many hypotheses or number of hypotheses that is only a function of , i.e., independent of the dimension ?

Theorem 1.3 implies that we can achieve error for Gaussians. We show that the optimal error bound (upper and lower bound) for the case and more generally for subgaussian distributions is in fact . Moreover, under bounded -th moment assumptions, for even , the optimal error is .

Theorem 1.5 (Minimax Error Bounds).

Let . There exists an (inefficient) algorithm that given a set of -corrupted samples from a distribution , where (a) is subgaussian with bounded variance in each direction, or (b) has bounded first moments, for even , outputs a list of vectors one of which is within distance from the mean of , and in case (a) and in case (b). Moreover, these error bounds are optimal, up to constant factors. Specifically, the error bound of (a) cannot be asymptotically improved even if , as long as the list size is . The error bound of (b) cannot be asymptotically improved as long as the list size is only a function of .

For the detailed statements, the reader is referred to Section 5.

We now turn to our computational lower bounds. Given Theorem 1.5, the following natural question arises: For the case of Gaussians, can we achieve the minimax bound in polynomial time? We provide evidence that this may not be possible, by proving a Statistical Query (SQ) lower bound for this problem. Recall that a Statistical Query (SQ) algorithm [Kea98] relies on an oracle that given any bounded function on a single domain element provides an estimate of the expectation of the function on a random sample from the input distribution. This is a restricted but broad class of algorithms, encompassing many algorithmic techniques in machine learning. A recent line of work [FGR13, FPV15, FGV17, Fel17] developed a framework of proving unconditional lower bounds on the complexity of SQ algorithms for search problems over distributions.

By leveraging this framework, using the techniques of our previous work [DKS16b], we show that any SQ algorithm for list-decodable Gaussian mean estimation that guarantees error , for some , requires either high accuracy queries or exponentially many queries:

Theorem 1.6 (SQ Lower Bounds).

Any SQ list-decodable mean estimation algorithm for that returns a list of sub-exponential size so that some element in the list is within distance of the mean of requires either queries of accuracy or queries.

The reader is referred to Section 5.2 for the formal statement and proof.

1.5 Related Work

Robust Estimation. The field of robust statistics [Tuk60, Hub64, HR09, HRRS86, RL05] studies the design of estimators that are stable to model misspecification. After several decades of investigation, the statistics community has discovered a number of estimators that are provably robust in the sense that they can tolerate a constant (less than ) fraction of corruptions, independent of the dimension. While the information-theoretic aspects of robust estimation have been understood, the central algorithmic question — that of designing robust and computationally efficient estimators in high-dimensions — had remained open.

Recent work in computer science [DKK16, LRV16] shed light to this question by providing the first efficient robust learning algorithms for a variety of high-dimensional distributions. Specifically, [DKK16] gave the first robust learning algorithms that can tolerate a constant fraction of corruptions, independent of the dimension. Subsequently, there has been a flurry of research activity on algorithmic robust high-dimensional estimation. This includes robust estimation of graphical models [DKS16a], handling a large fraction of corruptions in the list-decodable model [CSV17, SCV17], developing robust algorithms under sparsity assumptions [BDLS17], obtaining optimal error guarantees [DKK17b], establishing computational lower bounds for robust estimation [DKS16b]

, establishing connections with robust supervised learning 

[DKS17], and designing practical algorithms for data analysis applications [DKK17a].

Learning GMMs. A long line of work initiated by Dasgupta [Das99], see, e.g., [AK01, VW02, AM05, KSV08, BV08], provides computationally efficient algorithms for recovering the parameters of a GMM under various separation assumptions between the mixture components. More recently, efficient parameter learning algorithms were obtained [MV10, BS10, HP15] under minimal information-theoretic separation assumptions. Without separation conditions, the sample complexity of parameter estimation is known to scale exponentially with the number of components, even in one dimension [MV10, HP15]. To circumvent this information-theoretic bottleneck of parameter learning, a related line of work has studied parameter learning in a smoothed setting [HK13, GVX14, BCMV14, ABG14, GHK15]. The related problems of density estimation and proper learning for GMMs have also been extensively studied [FOS06, SOAJ14, MV10, HP15, ADLS17, LS17]. In density estimation (resp. proper learning), the goal is to output some hypothesis (resp. GMM) that is close to the unknown mixture in total variation distance.

Most relevant to the current work is the classical work of Vempala and Wang [VW02] and the very recent work by Regev and Vijayraghavan [RV17]. Specifically, [VW02] gave an efficient algorithm that learns the parameters of spherical GMMs under the weakest separation conditions known to date. On the other hand, [RV17] characterize the separation conditions under which parameter learning for spherical GMMs can be solved with samples. Whether such a separation can be achieved with an efficient algorithm was left open in [RV17]. Our work makes substantial progress in this direction.

1.6 Detailed Overview of Techniques

1.6.1 List-Decodable Mean Estimation

Outlier Removal and Challenges of the Large Error Regime.

We start by reviewing the framework of [DKK16] for robust mean estimation in the small error regime, followed by an explanation of the main difficulties that arise in the large error regime of the current paper.

In the small error regime, the “filtering” algorithm of [DKK16] for robust Gaussian mean estimation works by iteratively detecting and removing outliers (corrupted samples) until the empirical variance in every direction is not much larger than expected. If every direction has small empirical variance, then the true mean and the empirical mean are close to each other [DKK16]. Otherwise, the [DKK16] algorithm projects the input points in a direction of maximum variance and throws away those points whose projections lie unexpectedly far from the empirical median in this direction. While this iterative spectral technique for outlier removal is by now well-understood for the small error regime (and has been applied to various settings), there are two major obstacles that arise if one wants to generalize it to the large error regime, i.e., where only a small fraction of samples are good.

The first difficulty is that even the one-dimensional version of the problem in the large error regime is non-trivial. Specifically, consider a direction of large empirical variance. The [DKK16] algorithm exploits the fact that the empirical median is a robust estimator of the mean in the one-dimensional setting. In contrast, in the large error regime, it is not clear how to approximate the true mean of a one-dimensional projection. This holds for the following reason: The input distribution can simulate a mixture of many Gaussians whose means are far from each other, and the algorithm will have no way of knowing which is the real one. In order to get around this obstacle, we construct more elaborate outlier-removal algorithms, which we call multifilters. Roughly speaking, a multifilter can return several (potentially overlapping) subsets of the original dataset with the guarantee that at least one of these subsets is substantially “cleaner” than .

The second difficulty is somewhat harder to deal with. As already mentioned, the filtering algorithm of  [DKK16] iteratively removes outliers by looking for directions in which the empirical distribution has a substantially larger variance than it should. In the low error regime, this approach does a good job of detecting are removing the corrupted points that can move the empirical mean far from the true mean. In the large error regime, the situation is substantially different. In particular, it is entirely possible that the empirical distribution does not have abnormally large variance in any direction, while still the empirical mean is -far from the true mean. That is, considering the variance of one-dimensional projections of our dataset in various directions seems inadequate in order to improve the error bound. This obstacle is inherent: the variance of linear polynomials (projections) is not a sufficiently accurate method of detecting a small fraction of good samples being substantially displaced from the mean of the bad samples. To circumvent this obstacle, we will use higher degree polynomials, which are much more sensitive to a small fraction of points being far away from the others. In particular, our algorithms will search for degree- polynomials that have abnormally large expectation or variance, and use such polynomials to construct our multifilters.

Overview of List-Decodable Mean Estimation Algorithm.

The basic overview of our algorithm is as follows: We compute the sample mean of the -corrupted set , and then search for (appropriate) degree- polynomials whose empirical expectation or variance is too large relative to what it should be, assuming that the good distribution is — an identity covariance Gaussian with mean

. We note that this task can be done efficiently with an eigenvalue computation, by taking advantage of the appropriate orthogonal polynomials. If there are no degree-

polynomials with too large variance, we can show that the sample mean is within distance from the true mean. On the other hand, if we do find a degree- polynomial with abnormally large variance, we will be able to produce a multifilter and make progress. This top-level algorithm is described in detail in Section 3.9.

We now sketch how to exploit the existence of a large variance polynomial to construct a multifilter. Intuitively, the existence of such a polynomial suggests that there are many points that are far away from other points, and therefore separating these points into (potentially overlapping) clusters should guarantee that almost all good points are in the same cluster. Unfortunately, for this idea to work, we need to know that the variance of on the good set of points is not too large. For degree- polynomials this condition holds automatically. If is a sufficiently large set of samples from and is a normalized linear form, then . But if has degree at least , the variance depends on the true mean , which unfortunately is unknown. Fortunately, there is a way to circumvent this obstacle by either producing a multifilter or verifying that the variance is not too large.

We do this as follows: Firstly, we show that the variance , , can be expressed as an average of for some explicitly computable, normalized, homogeneous polynomials (see Lemma 3.24). We then need to algorithmically verify that the polynomials are not too large. This is difficult to do directly, so instead we replace each by the corresponding multilinear polynomial , and note that is the average value of at many independent copies of . If this is large, then it means that evaluating at a random tuple of samples will often have larger than expected size.

This idea will allow us to produce a multifilter for the following reason: Since each is multilinear, this essentially allows us to write it as a composition of linear functions. More rigorously, we use the following iterative process: We iteratively plug-in variables one at a time to . If at any step the size of the resulting polynomial jumps substantially, then the fact that this size is not well-concentrated as we try different samples will allow us to produce a multifilter. The details of this argument are given in Lemma 3.27 of Section 3.7.

1.6.2 Learning Spherical GMMs

The Identity Covariance Case.

Since a Gaussian mixture model can simultaneously be thought of as a mixture of any one of its components with some error distribution, applying our list-decoding algorithm to samples from a GMM will return a list of hypotheses so that every mean in the mixture is close to some hypothesis in the list. We can then use this list to cluster our samples by component.

In particular, given samples from a Gaussian and many possible means , we consider the process of associating a sample from with the nearest . We note that is closer to than if and only if its projection onto the line between them is. Now if is substantially closer to than is, then this requires that this projection (which is Gaussian distributed) be far from its mean, which happens with tiny probability. Thus, by a union bound, as long as our list contains some that is close to , the closest hypothesis to with high probability is not much further. If the separation between the means in our mixture is much larger than the separation between the means and the closest hypotheses, this implies that almost all samples are associated with one of the hypotheses near its component mean, and this will allow us to cluster samples by component. This idea of clustering points based on which of a finite set they are close to is an important idea that shows up in several related contexts in this paper.

The General Case.

The above idea works more or less as stated for mixtures of identity covariance Gaussians, but when dealing with more general mixtures of spherical Gaussians several complications arise. Firstly, in order to run out list-decoding algorithm, we need to know (a good approximation to) the covariance matrix of each component. The other difficulty is that, in order to cluster points, we will take a set of all nearby hypotheses that have reasonable numbers of samples associated with them. The issue is that we no longer know what “nearby” means, as it should depend on the covariance matrix of the associated Gaussian.

To solve the first of these problems we use a trick that will be reused several times. We note that two samples from the same Gaussian have distance approximately , and that even one sample from

is unlikely to be much closer than this to samples from different components. Therefore, by simply looking at the distance to the closest other sample gives us a constant factor approximation to the standard deviation of the corresponding component. This allows us to write down a polynomial-size list of viable hypothesis standard deviations. Running our list decoding algorithm for each standard deviation, gives us a polynomial-size list of hypothesis means.

To solve the second problem, we use the above idea to approximate the standard deviations associated to our sample points. When clustering them, we look for collections of sample points with standard deviations approximately the same , whose closest hypotheses are within some reasonable multiple of of each other. Since we are able to approximate the size of the component that our samples are coming from, we can guarantee that we aren’t accidentally merging several smaller clusters together by using the wrong radius.

Dimension Reduction.

One slight wrinkle with the above sketched learning algorithm is that since the number of candidate hypotheses is polynomial in , the separation between the components will be required to be at least . This bound is suboptimal, when is very large. Another issue is that the overall runtime of the learning algorithm would not be a fixed polynomial in , but would scale as . There is a way around both these issues, by reducing to a lower dimensional problem.

In particular, standard techniques involve looking at the largest principle values that allow one to project onto a subspace of dimension without losing too much. Unfortunately, these ideas require that all of the Gaussians involved have roughly the same covariance. Fortunately, if is large, our ability to approximate the covariance associated to a sample by looking at its distances to other samples becomes more accurate. Using a slightly modification of this idea, we can actually break our samples into subsets so that each subset is a mixture of Gaussians of approximately the save covariance. By projecting each of these in turn, we can reduce the original problem to a number of dimensions and eliminate this extra term.

1.6.3 Minimax Error Bounds

We now explain our approach to pin down the information-theoretic optimal error for the list-decodable mean estimation problem. Concretely, for the identity covariance Gaussian case we show that there is an (inefficient) algorithm that guarantees that some hypothesis is within of the true mean. The basic idea is that the true mean must have the property that there is an -fraction of samples that are well-concentrated (in the sense of having good tail bounds in every direction) about the point. The goal of our (inefficient) algorithm will be to find a small number of balls of radius that covers the set of all such points. We show that such a set exists using the covering/packing duality. In particular, we note that if there are a large number of such sets with means far apart, we get a contradiction since the sets must be individually large but their overlaps must be pairwise small (due to concentration bounds).

This approach immediately generalizes to provide a list-decodable mean estimation algorithm for any distribution with known tail bounds, providing an error , where only an -fraction of the points are more than -far from the mean in any direction. This generic statement has a number of implications for various families. In particular, it gives a (tight) error upper bound of for subgaussian distributions with bounded variance in each direction. Previously, no upper bound better than was known for these families. For distributions whose first central moments are bounded from above (for even ), we obtain a tight error upper bound of .

Regarding lower bounds, [CSV17] showed an error lower bound for , where is unknown and . We strengthen this result by showing that the lower bound holds even for . We also prove matching lower bounds of for distributions with bounded moments. Our proofs proceed by exhibiting distributions , so that can be written as for many different satisfying the necessary hypotheses. Then any list-decoding algorithm must return a list of hypotheses close to the mean of every . If there are many such ’s with means pairwise separated, then the list-decoding algorithm must either return many hypotheses or have large error.

1.6.4 SQ Lower Bounds

Finally, we prove lower bounds for list-decoding algorithms in the Statistical Query (SQ) model. Roughly speaking, we show that any SQ algorithm must either spend time or have accuracy higher than , suggesting that our list-decoding algorithm is qualitatively tight in its tradeoff between runtime and sample complexity.

We prove these bounds using the technology developed in [DKS16b]. This basically reduces to finding a -dimensional distribution whose first many moments agree with the corresponding moments of a standard Gaussian. In our case, this amounts to constructing a one-dimensional distribution , so that ’s first moments agree with those of a standard Gaussian. This can be done essentially because the part of the distribution only contributes at most a constant to any of the first moments. This allows us to take approximately Gaussian but slightly tweaked near in order to fix these first few moments.

We note however, that if we move the error component much further from , its contribution to the moment becomes super-constant and thus impossible to hide. This corresponds to the fact that degree- moments are sufficient (and necessary) in order to detect errors of size .

1.7 Organization

The structure of the paper is as follows: In Section 2, we provide the necessary definitions and technical facts. In Section 3, we present our list-decoding algorithm. Section 4 gives our algorithm for GMMs. Finally, our minimax error and SQ lower bounds are given in Section 5.

2 Definitions and Preliminaries

2.1 Notation and Basic Definitions

Notation.

For , we denote by the set . If is a vector, let denote its Euclidean norm. If is a matrix, let denote its Frobenius norm.

Our algorithm and its analysis will make essential use of tensor analysis. For a tensor

, we will denote by the -norm of its entries.

Let be a finite multiset. We will use to denote that is drawn uniformly from . For a function , we will denote by

the random variable

, .

Our basic objects of study are the Gaussian distribution and finite mixtures of spherical Gaussians:

Definition 2.1.

The -dimensional Gaussian with mean and covariance is the distribution with density function . A Gaussian is called spherical if its covariance is a multiple of the identity, i.e., , for .

Definition 2.2.

An -dimensional -mixture of spherical Gaussians (spherical -GMM) is a distribution on with density function , where , , for all , and .

Definition 2.3.

The total variation distance

between two distributions (with probability density functions)

is defined to be The -divergence of is .

2.2 Formal Problem Definitions

We record here the formal definitions of the problems that we study. Our first problem is robust mean estimation in the list-decodable learning model. We start by defining the list-decodable model:

Definition 2.4 (List Decodable Learning, [Bbv08]).

We say that a learning problem is -list decodably solvable if there exists an efficient algorithm that can output a set of hypotheses with the guarantee that at least one is accurate to within error with high probability.

Our notion of robust estimation relies on the following model of corruptions:

Definition 2.5 (Corrupted Set of Samples).

Given and a distribution family , an -corrupted set of samples of size is generated as follows: First, a set of many samples are drawn independently from some unknown . Then an omniscient adversary, that is allowed to inspect the set , adds an arbitrary set of many points to the set to obtain the set .

We are now ready to define the problem of list-decodable robust mean estimation:

Definition 2.6 (List-Decodable Robust Mean Estimation).

Fix a family of distributions on . Given a parameter and an -corrupted set of samples from an unknown distribution , with unknown mean , we want to output a list of candidate mean vectors such that with high probability it holds , for some function . We say that is the error guarantee achieved by the algorithm.

Our main algorithmic result is for the important special case that is the family of unknown mean known covariance Gaussian distributions. We also establish minimax bounds that apply for more general distribution families.

Our second problem is that of learning mixtures of separated spherical Gaussians:

Definition 2.7 (Parameter Estimation for Spherical GMMs).

Given a positive integer and samples from a spherical -GMM , we want to estimate the parameters up to a required accuracy . More specifically, we would like to return a list so that with high probability the following holds: For some permutation we have that for all : , , and .

The above approximation of the parameters implies that . The sample complexity (hence, also the computational complexity) of parameter estimation depends on the smallest weight and the minimum separation between the components.

2.3 Basics of Hermite Analysis and Concentration

We briefly review the basics of Hermite analysis over under the standard -dimensional Gaussian distribution . Consider , the vector space of all functions such that . This is an inner product space under the inner product

This inner product space has a complete orthogonal basis given by the Hermite polynomials. For univariate degree- Hermite polynomials, , we will use the probabilist’s Hermite polynomials, denoted by , , which are scaled to be monic, i.e., the lead term of is . For , the -variate Hermite polynomial , , is of the form , and has degree . These polynomials form a basis for the vector space of all polynomials which is orthogonal under this inner product. For a polynomial , its -norm is .

We will need the following standard concentration bound for degree- polynomials over independent Gaussians (see, e.g., [Jan97]):

Fact 2.8 (“degree- Chernoff bound”).

Let , . Let be a real degree- polynomial. For any , we have that .

3 List-Decodable Robust Mean Estimation Algorithm

In this section, we prove our main algorithmic result on list-decodable mean estimation:

Theorem 3.1 (List-Decodable Mean Estimation).

There exists an algorithm List-Decode-Gaussian that, given , , a failure probability , and a set of points in , of which at least a -fraction are independent samples from a Gaussian , runs in time and returns a list of points such that, with probability at least , the list contains an with

Detailed Structure of Algorithm.

The key idea procedure behind our algorithm is a subroutine that given a set of samples either cleans it up producing one or two subsets at least one of which has substantially fewer errors than the original, or certifies that the mean of must be close to the empirical mean (Proposition 3.6). Using this subroutine, our final algorithm can be obtained by repeatedly applying the subroutine recursively to the returned sets until they produce vectors. The details of this analysis are in Section 3.3.

Before we can get into the detailed overview of this proof, it is necessary to lay out some technical groundwork. First, we will want to have a deterministic condition under which our algorithm will succeed. To that end, we introduce two important definitions. We say that a set is representative of if it behaves like a set of independent samples of , in particular in the sense that it is a PRG against low-degree polynomial threshold functions for . We also say that a larger set is good if (roughly speaking) an -fraction of the elements of are a representative set for . For technical reasons, will will also want the points of to be not too far apart from each other. In Section 3.1, we discuss the definitions of representative and good sets and provide some basic results.

In Section 3.2, we show that given a large set of points that contain an -fraction of good points from, one can algorithmically find many subsets so that with high probability at least one of them is good (and thus can be fed into the rest of our algorithm). This would be immediate if t were it not for the requirement that the points in a good set be not too far apart. As it stands, this will require that we perform some very basic clustering algorithms.

The actual design of our multifilter involves working with several types of “pure” degree- polynomials and their appropriate tensors. In particular, we need to pay attention to harmonic polynomials (which behave well with respect to -norms), homogeneous polynomials, and multilinear polynomials. In Section 3.5, we introduce these and give several algebraic results relating them that will be required later.

The multifilter at its base level requires a routine that given a polynomial , where behaves very differently from , allows us to use the values of to separate the points coming from from the errors. The basic idea of the technique is to cluster the points , for , and throw away points that are too far from any cluster large enough to correspond to the bulk of the values of (which must be well-concentrated), or to divide into two subsets with enough overlap to guarantee that any such cluster could be entirely contained on one side or the other. The details of this basic multifilter algorithm are covered in Section 3.4.

Given this basic multifilter, the high-level picture for our main subroutine is as follows: Using spectral methods we can find if there are any degree- polynomials where is substantially larger than it should be if consisted of samples from . If there are no such polynomials, it is not hard to see that is small giving us our desired approximation. Otherwise, we would like to apply our basic multifilter algorithm to get a refined version of . The details of this routine can be found in Section 3.9.

Unfortunately, the application of the multifilter in the application above has a slight catch to it. Our basic multifilter will only apply to if we can verify that is not too large. This would be easy to verify if we knew the mean of , but unfortunately, we do not and errors in our approximation may lead to being much larger than anticipated, and in fact, potentially too large to apply our filter productively. In order to correct this, we will need new techniques to either prove that is small or to find a filter in the process. Using analytic techniques, in Section 3.5 we show that the is a weighted average of squares of for some normalized, homogeneous polynomials . Thus, it suffices to verify that each is small.

To deal with this issue, it is actually much easier to work with multilinear polynomials, and so instead we deal with multilinear polynomials so that . We thus need to verify that is small. The discussion of the reduction to this problem is in Section 3.8, while the techniques for verifying that the have small mean is in Section 3.7.

In order to handle multilinear polynomials, we treat them as a sequence of linear polynomials. We note that if is abnormally large, then so is . This means that if we evaluate at random elements of , we are relatively likely to get an abnormally large value, our goal is to find some linear polynomial for which the distribution of has enough discrepancies that we can filter based on . To do this, consider starting with where are separate -coordinate variables, and replacing the one at a time with random elements of . Since there is a decent probability that is large, it is reasonably likely that at some phase of this process, setting one of the variables causes the -norm of to jump by some substantial amount. In particular, there must be some settings of so that for a random element of , we have that will have substantially larger -norm than with non-negligible probability. We note that this would only rarely happen if were distributed as , and this will allow us to filter. This argument is covered in Section 3.6.

To make this algorithm work, we note that is a degree- polynomial with bounded trace-norm. Therefore, we need an algorithm so that if is such a polynomial where is large, we can produce a multifilter. This is done by writing as an average of squares of linear polynomials. We thus note that there must be some linear polynomial , where is abnormally large. In particular, this implies that and have substantially different distributions, which should allow us to apply our basic multifilter. Also since is degree-, we have a priori bounds on , which avoids the problem that has been plaguing us for much of this argument. These details are discussed in Section 3.5.

Overview of this Section.

In summary, the structure of this section is as follows: In Section 3.1, we define the important notions of a representative set and a good set and prove some basic properties. In Section 3.2, we do some basic clustering and show that a good set can be extracted from a set of corrupted samples. In Section 3.3, we present the main subroutine and show how it can be used to produce our final algorithm.

The remainder of Section 3 will be spent building this subroutine. In Section 3.4, we produce our most basic tool for creating multifilters given a single polynomial where and behave substantially differently. In Section 3.5, we present some basic background on harmonic, homogeneous and multilinear polynomials and their associated tensors. In Section 3.6, we use these to produce routines to find multifilters given degree- polynomials or degree- polynomials with bounded trace norm for which is too large. In Section 3.7, we leverage these results to extend to produce a similar multifilter for arbitrary multilinear polynomials. In Section 3.8, we use this to get a multifilter for degree- harmonic polynomials whose norms are substantially larger than expected, and in Section 3.9, we combine this with spectral techniques to get the full version of our filtering procedure, thus finishing our algorithm.

3.1 Representative Sets and Good Sets

Let be the fraction of good samples. Recall that our model of corruptions works as follows: We draw a sufficiently large set of independent samples from , where is unknown, and then an adversary arbitrarily adds points to the set to obtain the corrupted set . The corrupted set is given as input to our learning algorithm that is required to produce a list of candidates for the unknown mean vector .

Representative Sets.

We define a deterministic condition on the set of “clean” samples that guarantees that running our algorithm on any corrupted set , as defined above, will succeed. A set of points satisfying this deterministic condition will be called representative. For our purposes, it will suffice that the representative set approximately gives the correct distributions to all low-degree polynomial threshold functions. We will show that our deterministic conditions hold with high probability for a sufficiently large set of independent samples from . This discussion is formalized in the following definition:

Definition 3.2 (Representative Set).

Let , , , and . We say that a set is representative (with respect to ) if for any degree-at-most- real polynomial , it holds

We note that even though the definition of “representativeness” of a set depends on the parameters and , these quantities will be fixed for the representative set throughout the execution of our algorithm, and thus the dependence will be implicit. Note that as increases or decreases, the representativeness condition becomes weaker. Thus, being representative with parameters and implies that it is also representative with parameter and , for any and .

We start by showing that a sufficiently large set of samples drawn from is representative with high probability. This fact follows from standard arguments using the VC-inequality:

Lemma 3.3.

For , , if is a set of independent samples from , then is representative (with respect to ) with probability at least .

Proof.

The collection of sets of the form , for multivariate polynomials on of degree at most , has VC-dimension . Thus, by the VC inequality [DL01], we have that

for all such polynomials with probability at least , if we take

samples.

Good Sets.

Our list-decodable learning algorithm and its analysis require an appropriate notion of goodness for the corrupted set . At a high-level, throughout its execution, our algorithm will produce several different subsets of the original corrupted set it starts with, in its attempt to remove outliers. Intuitively, we want a good set to have the property that a -fraction of the points in , with , come from a representative set . However, we also need to account for the possibility that — in the process of removing outliers — our algorithm may also remove a small number of the points in the original representative set . Moreover, for technical reasons, we will require that all points in a good set are contained in a ball of radius . This is formalized in the following definition:

Definition 3.4 (Good Set).

Let , , , and . A multiset is -good (with respect to ) if it satisfies the following conditions:

  1. All points in are within distance of each other.

  2. There exists a set which is representative (with respect to ) so that

3.2 Naive Clustering

We note that the definition of “goodness” of a set depends on the parameters and . The parameter will change multiple times during the execution of the algorithm (it will increase), while the parameter does not increase. This justifies our choice of making explicit in the definition of “-good”, while keeping the dependence on implicit.

The additional constraint that all points in a good set are contained in a not-too-large ball means that our original set of corrupted samples may not form a good set. To rectify this issue, we start by performing a very basic clustering step as follows: Since an -fraction of the points in are concentrated within distance from the true mean , by taking a maximal set of non-overlapping, not-too-large balls each with a large fraction of points, we are guaranteed that at least one of them will contain a good set. This is formalized in the following lemma:

Lemma 3.5.

Let , , and . Let be a set of points in , of which at least a -fraction are independent samples from . There is an algorithm that, given and , returns a list of at most many subsets of so that with probability at least at least one of them is -good with respect to .

Proof.

Let be the subset of containing the good samples, i.e., the points that are independent samples from . We have that and . By Lemma 3.3, the set is representative with probability at least . We henceforth condition on this event.

If all points in are contained in a ball of radius , there is nothing to prove. If this is not the case, we will show how to efficiently find a collection of at most many balls of radius , so that at least one of the balls contains at least a -fraction of the points in .

First note that, by the degree- Chernoff bound, we have that , for a sufficiently large universal constant . Since is a degree- polynomial and is representative, Definition 3.2 implies that at least an fraction of points in are within distance of .

Our clustering scheme works as follows: We consider a maximal set of disjoint balls of radius centered at points of , such that each ball contains at least an -fraction of the points in . Note that this set is non-empty: Since is representative, the ball of radius centered at any point with contains at least a -fraction of the points in , and therefore at least an -fraction of the points in .

Let be the maximal set of disjoint balls described above. Since each contains an -fraction of the points in , there are at most many such balls. Let , , be the ball with the same center as and radius . Consider the subsets , . We claim that at least one of the ’s is -good.

The pseudo-code for this clustering algorithm is given below.

Algorithm NaiveClustering Input: a multiset and . Let be the empty set. For each , proceed as follows: If there are points in within distance of , and no point has , then add to . For each in , let . Return the list of ’s.


We now prove correctness. By definition, all the points of