 # Learning Entangled Single-Sample Distributions via Iterative Trimming

In the setting of entangled single-sample distributions, the goal is to estimate some common parameter shared by a family of distributions, given one single sample from each distribution. We study mean estimation and linear regression under general conditions, and analyze a simple and computationally efficient method based on iteratively trimming samples and re-estimating the parameter on the trimmed sample set. We show that the method in logarithmic iterations outputs an estimation whose error only depends on the noise level of the α n-th noisiest data point where α is a constant and n is the sample size. This means it can tolerate a constant fraction of high-noise points. These are the first such results under our general conditions with computationally efficient estimators. It also justifies the wide application and empirical success of iterative trimming in practice. Our theoretical results are complemented by experiments on synthetic data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This work considers the novel parameter estimation setting called entangled single-sample distributions. Different from the typical i.i.d. setting, here we have data points that are independent, but each is from a different distribution. These distributions are entangled in the sense that they share some common parameter, and our goal is to estimate the common parameter. For example, in the problem of mean estimation for entangled single-sample distributions, we have data points from

different distributions with a common mean but different variances (the mean and all the variances are unknown), and our goal is to estimate the mean.

This setting is motivated for both theoretical and practical reasons. From the theoretical perspective, it goes beyond the typical i.i.d. setting and raises many interesting open questions, even on basic topics like mean estimation for Gaussians. It can also be viewed as a generalization of the traditional mixture modeling, since the number of distinct mixture components can grow with the number of samples. From the practical perspective, many modern applications have various forms of heterogeneity, for which the i.i.d. assumption can lead to bad modeling of their data. The entangled single-sample setting provides potentially better modeling. This is particularly the case for applications where we have no control over the noise levels of the samples. For example, the images taken by self-driving cars can have varying degrees of noise due to changing weather or lighting conditions. Similarly, signals collected from sensors on the Internet of Things can come with interferences from a changing environment.

Though theoretically interesting and practically important, few studies exist in this setting. Chierichetti et al. (2014) considered the mean estimation for entangled Gaussians and showed the existence of a gap between estimation error rates of the best possible estimator in this setting and the maximum likelihood estimator when the variances are known. Pensia et al. (2019) considered means estimation for symmetric, unimodal distributions including the symmetric multivariate case (i.e., the distributions are radially symmetric) with sharpened bounds, and provided extensive discussion on the performance of their estimators in different configurations of the variances. These existing results focus on specific family of distributions or focus on the case where most samples are “high-noise” points.

On the contrary, we focus on the case with a constant fraction of ”high-noise” points, which is more interesting in practice. We study multivariate mean estimation and linear regression under more general conditions and analyze a simple and efficient estimator based on iterative trimming. The iterative trimming idea is simple: the algorithm keeps an iterate and repeatedly refines it; each time it trims a fraction of bad points based on the current iterate and then uses the trimmed sample set to compute the next iterate. It is computationally very efficient and widely used in practice as a heuristic for handling noisy data. It can also be viewed as an alternating-update version of the classic trimmed estimator (e.g.,

Huber (2011)) which typically takes exponential time:

 ^θ=argminθ∈Θ;S⊆[n],|S|=⌈αn⌉∑i∈SLossi(θ)

where is the feasible set for the parameter to be estimated, is the size of the trimmed sample set , and is the loss of on the -th data point (e.g., error for linear regression).

For mean estimation, only assuming the distributions have a common mean and bounded covariances, we show that the iterative trimming method in logarithmic iterations outputs a solution whose error only depends on the noise level of the -th noisiest point for . More precisely, the error only depends on the -th largest value among all the norms of the covariance matrices. This means the method can tolerate a fraction of “high-noise” points. We also provide a similar result for linear regression, under a regularity condition that the explanatory variables are sufficiently spread out in different directions (satisfied by typical distributions like Gaussians). As far as we know, these are the first such results of iterative trimming under our general conditions in the entangled single-sample distributions setting. These results also theoretically justify the wide application and empirical success of the simple iterative trimming method in practice. Experiments on synthetic data provide positive support for our analysis.

## 2 Related Work

Entangled distributions. This setting is first studied by Chierichetti et al. (2014), which considered mean estimation for entangled Gaussians and presented a algorithm combining the -median and the -shortest gap algorithms. It also showed the existence of a gap between the error rates of the best possible estimator in this setting and the maximum likelihood estimator when the variances are known. Pensia et al. (2019) considered a more general class of distributions (unimodal and symmetric) and provided analysis on both individual estimator (-modal interval, -shortest gap, -median estimators) and hybrid estimator, which combines Median estimator with Shortest Gap or Modal Interval estimator. They also discussed slight relaxation of the symmetry assumption and provided extensions to linear regression. Our work considers mean estimation and linear regression under more general conditions and analyzes a simpler estimator. However, our results are not directly comparable to the existing ones above, since those focus on the case where most of the points have high noise or have extra constraints on distributions are assumed. For the constrained distributions, our results are weaker than the existing ones. See the detailed discussion in the remarks after our theorems.

This setting is also closely related to robust estimation, which have been extensively studied in the literature of both classic statistics and machine learning theory.

Robust mean estimation. There are several classes of data distribution models for robust mean estimators. The most commonly addressed is adversarial contamination model, whose origin can be traced back to the malicious noise model by Valiant (1985) and the contamination model by Huber (2011). Under contamination, mean estimation has been investigated in Diakonikolas et al. (2017, 2019a); Cheng et al. (2019). Another related model is the mixture of distributions. There has been steady progress in algorithms for leaning mixtures, in particular, leaning Gaussian mixtures. Starting from Dasgupta (1999), a rich collection of results are provided in many studies, such as Sanjeev and Kannan (2001); Achlioptas and McSherry (2005); Kannan et al. (2005); Belkin and Sinha (2010a, b); Kalai et al. (2010); Moitra and Valiant (2010); Diakonikolas et al. (2018a).

Robust regression.

Robust Least Squares Regression (RLSR) addresses the problem of learning regression coefficients in the presence of corruptions in the response vector. A class of robust regression estimator solving RLSR is Least Trimmed Square (LTS) estimator, which is first introduced by

Rousseeuw (1984) and has high breakdown point. The algorithm solutions of LTS are investigated in Hössjer (1995); Rousseeuw and Van Driessen (2006); Shen et al. (2013) for the linear regression setting. Recently, for robust linear regression in the adversarial setting (i.e., a small fraction of responses are replaced by adversarial values), there is a line of work providing algorithms with theoretical guarantees following the idea of LTS, e.g., Bhatia et al. (2015); Vainsencher et al. (2017); Yang et al. (2018)

for example. For robust linear regression in the adversary setting where both explanatory and response variables can be replaced by adversarial values, a line of work provided algorithms and guarantees, e.g.,

Diakonikolas et al. (2018b); Prasad et al. (2018); Klivans et al. (2018); Shen and Sanghavi (2019), while some others like Chen et al. (2013); Balakrishnan et al. (2017); Liu et al. (2018) considered the high-dimensional scenario.

## 3 Mean Estimation

Suppose we have independent samples , , where the mean vector and the covariance matrix of each distribution exist. Assume ’s have a common mean and denote their covariance matrices as . When , each

degenerate to an univariate random variable

, and we also write as and write as . Our goal is to estimate the common mean .

#### Notations

For an integer , denotes the set is the cardinality of a set . For two sets , is the set of elements in but not in . and

are the minimum and maximum eigenvalues. Denote the order statistics of

as . or denote constants whose values can vary from line to line.

### 3.1 Iterative Trimmed Mean Algorithm

First, recall the general version of the least trimmed loss estimator. Let

be the loss function, given currently learned parameter

. In contrast to minimizing the total loss of all samples, the least trimmed loss estimator of is given by

 ^μ(TL)=argminμ,S:|S|=⌈αn⌉∑i∈Sfμ(xi), (1)

where and is the fraction of samples we want to fit. Finding requires minimizing the trimmed loss over both the set of all subsets with size and the set of all available values of the parameter . However, solving the minimization above is hard in general. Therefore, we attempt to minimize the trimmed loss in an iterative way by alternating between minimizing over and . That is, it follows a natural iterative strategy: in the -th iteration, first select a subset of samples with the least loss on the current parameter , and then update to by minimizing the total loss over .

By taking in (1) as , in each iteration, samples are selected due to their least distance to the current in norm. Besides, note that for a given sample set ,

 argminμ∑i∈S∥xi−μ∥22=1|S|∑i∈Sxi,

that is, the parameter minimizing the total loss over sample set is the empirical mean over . This leads to our method described in Algorithm 1, which we referred to as Iterative Trimmed Mean.

Each update in our algorithm is similar to the trimmed-mean (or truncated-mean) estimator, which is, in univariate case, defined by removing a fraction of the sample consisting of the largest and smallest points and averaging over the rest; see Tukey and McLaughlin (1963); Huber (2011); Bickel and others (1965). A natural generalization to multivariate case is to remove a fraction of the sample consisting of points which have the largest distances to . The difference between our algorithm and the generalized trimmed-mean estimator is that ours select points based on the estimated while trimmed-mean is based on the ground-truth .

### 3.2 Theoretical Guarantees

Firstly, we introduce a lemma giving an upper bound of the sum of distances between points and mean vector over a sample set .

###### Lemma.

Define . Then we have

 ∑i∈S∥xi−μ⋆∥2≤2|S|√λSd (2)

with probability at least

.

###### Proof.

For each , denote the transformed random vector as . Then has mean and identical covariance matrix. Since , we can write

 ∑i∈S∥xi−μ⋆∥2≤√λS∑i∈S∥~xi∥2.

Let , thus are independent and for any , it holds that

 Eξi≤√Eξ2i=√d,Var(ξi)≤Eξ2i=d.

By Chebyshev’s inequality, we have

 P(∑i∈Sξi≥2|S|√d)≤1|S|,

which completes the proof.

We are now ready to prove our key lemma, which shows the progress of each iteration of the algorithm. The key idea is that the selected subset of samples have a large overlap with the subset of the “good” points with smallest covariance. Furthermore, is not that “bad” since by the selection criterion, they have less loss on than the points in . This thus allows to show the progress.

###### Lemma.

Given ITM with fraction , we have

 ∥μt+1−μ⋆∥2≤12∥μt−μ⋆∥2+2√dλ(⌈αn⌉) (3)

with probability at least .

###### Proof.

Define . Without loss of generality, assume is an integer. Then by the algorithm,

 μt+1=1|St|∑i∈Stxi=μ⋆+1αn∑i∈St(xi−μ⋆).

Therefore, the distance between the learned parameter and the ground truth parameter can be bound by:

 ∥μt+1−μ⋆∥2=1αn∥∥ ∥∥∑i∈St(xi−μ⋆)∥∥ ∥∥2 ≤|St∖S⋆|αn⋅∥μt−μ⋆∥2 +1αn⎛⎝∑i∈S⋆∩St∥xi−μ⋆∥2+∑i∈S⋆∖St∥xi−μt∥2⎞⎠ ≤2|St∖S⋆|αnκ⋅∥μt−μ⋆∥2+1αn∑i∈S⋆∥xi−μ⋆∥2,

where the second inequality is guaranteed by line 3 in Algorithm 1. Note that , thus . Due to the choice of , we have the distance of each sample in is less than that of samples in .

By Lemma 3.2, we can bound by with probability at least . Meanwhile, by , it guarantees Thus, when , Combining the inequalities completes the proof.

Based the error bound per-round in Lemma 3.2, it is easy to show that can be upper bounded by after sufficient iterations. This leads to our final guarantee.

###### Theorem.

Given ITM with within iterations, it holds that

 ∥μT−μ⋆∥2≤c√dλ(⌈αn⌉) (4)

with probability at least .

###### Proof.

By Lemma 3.2, we derive

 ∥μT−μ⋆∥2 ≤12∥μT−1−μ⋆∥2+2√dλ(⌈αn⌉) ≤12T∥μ0−μ⋆∥2+2T−1∑i=012i√dλ(⌈αn⌉) ≤12T∥μ0−μ⋆∥2+4√dλ(⌈αn⌉) ≤c√dλ(⌈αn⌉)

with probability at least when .

The theorem shows that the error bound only depends on the order statistics , irrelevant of the larger covariances for . Therefore, using the iterative trimming method, the magnitudes of a fraction of highest noises will not affect the quality of the final solution. However, it is still an open question whether the bound can be made , i.e., the optimal rate given an oracle that knows the covariances.

The method is also computationally very simple and efficient. Since a single iteration of ITM takes time , error can be computed in time, which is nearly linear in the input size.

###### Remark.

With a further assumption that are independently sampled from the Gaussians , the same bound in the theorem holds with probability converging to with exponential rate of . This is because can be guaranteed with a higher probability. The sketch of the proof follows: , by Cauchy-Schwarz inequality. Here

due to the Gaussian distribution of

. By the Lemma 3 in Fan and Lv (2008), with probability at least . Hence, , where can be any constant greater than and .

###### Remark.

For the univariate case with , it is sufficient to regard as one dimensional matrix . Then we obtain that when and iterations: with probability at least .

###### Remark.

It is worth mentioning that there is a trade off between the accuracy and running time in ITM. In particular, the constant can be any constant greater than , since it suffices to guarantee in the proof of Lemma 3.2 is less than . On the other hand, a smaller will slow down the speed for to shrink, although the computational complexity is still in the same order of .

###### Remark.

We now discuss existing results in detail.

In the univariate case, Chierichetti et al. (2014) achieved error in time . Among all estimators studied in Pensia et al. (2019), the superior performance is obtained by the hybrid estimators, which includes version (1): combining -median with -shorth and version (2): combining -median with modal interval estimator. These two versions achieve similar guarantees while version has lower run time . Version of the hybrid estimator outputs such that with probability , where and . Since here is defined as , the error bound giving above varies with specific , while the worst-case error guarantee is . When take , the error can be . However, this result for symmetric and unimodal distributions ’s.

In the multivariate case, Chierichetti et al. (2014) studied the special case where and provided an algorithm with the error bound in time . Pensia et al. (2019) mainly considered the special case where the overall mixture distribution is radically symmetric, and sharpens the bound above, resulting in the worst-case error bound . Pensia et al. (2019) also provided computationally efficient estimator with running time .

These existing results depend on (or alike) at the expense of an additional factor of roughly , which is most relevant when of the points have small noises, i.e., when the samples are dominated by high noises. Our results depend on (or for multivariate) for , which is more relevant when fraction of the points have high noises. We believe our bounds are more applicable for many practical scenarios. Finally, for the multivariante case, our result holds under more general assumptions and the method is significantly simpler and more efficient.

## 4 Linear Regression

Given observations from the linear model

 yi=xTiβ⋆+ϵi,∀1≤i≤n, (5)

our goal is to estimate . Here are independently distributed with expectation and variances , and are independent with and satisfy some regularity conditions described below. Denote the stack matrix as , the noise vector as and the response vector as .

###### Assumption.

Assume for all . Define

 ψ−(k)=minS:|S|=kλmin(XTSXS),

where is the submatrix of consisting of rows indexed by . Assume that for for a constant .

###### Remark.

is assumed without loss of generality, as we can always normalize , without affecting the assumption on ’s. The assumption on states that every large enough subset of is well conditioned, and has been used in previous work, e.g., Bhatia et al. (2015); Shen and Sanghavi (2019). It is worth mentioning that the uniformity over assumed is not for convenience but for necessity since the covariances of samples are unknown. Still, the assumption holds under several common settings in practice. For example, by Theorem 17 in Bhatia et al. (2015), this regularity is guaranteed for close to w.h.p. when the rows of are i.i.d. spherical Gaussian vectors and is sufficiently large.

###### Remark.

In our setting, the noise terms are independent but not necessarily identically distributed. It is also referred to as heteroscedasticity in linear regression by

Rao (1970) and Horn et al. (1975).

### 4.1 Iterative Trimmed Squares Minimization

We now apply iterative trimming to linear regression. The first step is to use the square loss, i.e. let , then the form of the trimmed loss estimator turns to

 ^β(TL)=argminβ,S:|S|=⌈αn⌉∑i∈S(yi−xTiβ)2. (6)

Such is first introduced by Rousseeuw (1984) as least trimmed squares estimator and its statistical efficiency has been studied in previous literature. However, the principal shortcoming is also its high computational complexity; see, e.g., Mount et al. (2014). Hence, we again use iterative trimming. Let

 ^β(LS)S=argminβ∑i∈S(yi−x⊤iβ)2

which is the least square estimator obtained by the sample set . When , we omit the subscript and write it as . The resulting algorithm is called Iterative Trimmed Squares Minimization and described in Algorithm 2.

### 4.2 Theoretical Guarantees

The key idea for the analysis is similar to that for mean estimation: the selected set has sufficiently large overlap with the set of “good” points with smallest noises, while the points in are not that “bad” by the selection criterion. Therefore, the algorithm makes progress in each iteration.

###### Lemma.

Under Assumption 4, given ITSM , with probability at least , we have

 ∥βt+1−β⋆∥2≤12∥βt−β⋆∥2+2c1σ(⌈αn⌉). (7)

###### Proof.

First we introduce some notations. Define

 ψ+(k)=maxS:|S|=kλmax(XTSXS)

where is the submatrix of consisting of rows indexed by . Note that , since for any of size , by , is bounded by

 Tr(XTSXS)=Tr(XSXTS) =∑i∈S∥xi∥22=k.

Denote as the diagonal matrix where if the -th sample is in set , otherwise . Let be a subset of with size and denote as the diagonal matrix w.r.t. .

Under Assumption 4, is nonsingular, so we have where we have used . Then

 βt+1 =β⋆+(XTWtX)−1XTWtϵ =β⋆+(XTWtX)−1XTWt((I−W⋆)ϵ+W⋆ϵ).

Therefore, the error can be bounded by:

 = ∥∥(XTWtX)−1XT(Wt(I−W⋆)ϵ+WtW⋆ϵ)∥∥2 ≤ 1ψ−(⌈αn⌉)⎛⎜ ⎜ ⎜⎝∥∥XTWt(I−W⋆)ϵ∥∥2T1+∥∥XTWtW⋆ϵ∥∥2T2⎞⎟ ⎟ ⎟⎠,

where we use the spectral norm inequality and triangle inequality and the fact that . In the following, we bound the two terms and .

First, can be bounded as:

 T1 =∥∥XTWt(I−W⋆)ϵ∥∥2 ≤∥∥XTWt(I−W⋆)X(βt−β⋆))∥∥2 +∥∥XTWt(I−W⋆)(y−Xβt)∥∥2 ≤ψ+(|St∖S⋆|)∥βt−β⋆∥2 +∥∥XTWt(I−W⋆)(y−Xβt)∥∥2,

since .

By the fact that , there exists a bijection between and . Denote the image of as . Since the loss of sample in is less than that of sample in , we have for any . Hence we can write

 ∥∥XTWt(I−W⋆)(y−Xβt)∥∥2 = ∥∥ ∥∥∑i∈St∖S⋆(yi−xTiβt)xi∥∥ ∥∥2 ≤ ∥∥ ∥∥∑i∈St∖S⋆ni(yki−xTkiβt)xi∥∥ ∥∥2 ≤ ∥∥ ∥∥∑i∈St∖S⋆niϵkixi∥∥ ∥∥2+∥∥ ∥∥∑i∈St∖S⋆nixixkiT(βt−β⋆)∥∥ ∥∥2

where is either or .

By the assumption that ,

 ∥∥ ∥∥∑i∈St∖S⋆niϵkixi∥∥ ∥∥2≤ ∑i∈St∖S⋆|ϵki|=∑i∈S⋆∖St|ϵi|.

Meanwhile,

 ∥∥ ∥∥∑i∈St∖S⋆nixixkiT(βt−β⋆)∥∥ ∥∥2≤smax⋅∥∥βt−β⋆∥∥2.

where

denotes the maximum singular value and is bounded as follows.

.

###### Proof.

The proof is implicit in the proof for Theorem in Shen and Sanghavi (2019). With matrix notations, can be then written as , where is diagonal with diagonal entries in and is some permutation matrix. Then

 smax=max∥u∥2=1,∥v∥=1uTXTWt(I−W⋆)NPXv.

Let and , is bounded by

 ∑i∈St∖S⋆|~uri~vti|≤max{∑i∈St∖S⋆~u2ri,∑i∈St∖S⋆~v2ti}

for some sequences and . So is bounded by the larger of the maximum singular values of and , which is bounded by .

With this claim, we can derive

 T1≤∑i∈S⋆∖St|ϵi|+2ψ+(|St∖S⋆|)∥β⋆−βt∥2.

Next, can be bounded as:

 T2=∥∥ ∥∥∑i∈St∩S⋆ϵixi∥∥ ∥∥2≤∑i∈St∩S⋆|ϵi|.

Combining the inequalities above, we have

 ∥β⋆−βt∥2=1ψ−(⌈αn⌉)(T1+T2) ≤1ψ−(⌈αn⌉)(∑i∈S⋆|ϵi|+2ψ+(|St∖S⋆|)∥β⋆−βt∥2) =2ψ+(|St∖S⋆|)ψ−(⌈αn⌉)κ∥β⋆−βt∥2+∑i∈S⋆|ϵi|ψ−(⌈αn⌉).

By assumption, there exist a constant such that and . Without loss of generality, assume is an integer. Since , when , And is bounded by with probability at least , using Markov’s inequality.

###### Theorem.

Under Assumption 4, given ITSM with and , it holds that

 ∥βT−β⋆∥2≤cc1σ(⌈αn⌉) (8)

with probability at least .

###### Proof.

It follows from Lemma 4.2 by an argument similar to that for Theorem 3.2.

###### Remark.

When ’s are i.i.d. spherical Gaussians, Assumption 4 can be satisfied with close to 1. Then we require and the error bound holds with probability , similar to that for mean estimation. Also, (8) is obtained without extra assumption on noise

except for assuming its second order moment exists. If

, the previous bound holds with a higher probability .

###### Remark.

We now discuss existing results.  Pensia et al. (2019) also adapted its mean estimation methodology to linear regression. When ’s are from a multivariate Gaussian with covariance matrix , w.h.p. it obtained error bound . Again, the result depends on with an additional factor , while ours depends on