Robust k-means Clustering for Distributions with Two Moments

02/06/2020 ∙ by Yegor Klochkov, et al. ∙ Google University of Cambridge Higher School of Economics 0

We consider the robust algorithms for the k-means clustering problem where a quantizer is constructed based on N independent observations. Our main results are median of means based non-asymptotic excess distortion bounds that hold under the two bounded moments assumption in a general separable Hilbert space. In particular, our results extend the renowned asymptotic result of Pollard, 1981 who showed that the existence of two moments is sufficient for strong consistency of an empirically optimal quantizer in R^d. In a special case of clustering in R^d, under two bounded moments, we prove matching (up to constant factors) non-asymptotic upper and lower bounds on the excess distortion, which depend on the probability mass of the lightest cluster of an optimal quantizer. Our bounds have the sub-Gaussian form, and the proofs are based on the versions of uniform bounds for robust mean estimators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical (sample-based) -means clustering is the classical form of quantization for probability measures. In this framework, given a distribution defined on a normed space and an integer , one wants to find such that the distortion

It is well known that if is with the Euclidean norm and if then this optimal quantizer exists (see e.g., Theorem 1 in (Linder; 2002)), although it is not necessarily unique for . The value of the optimal distortion can be written as . In the statistical setup, the access to is achieved via independent observations sampled according to . Consider again the case of and the Euclidean norm. The following renowned result due to Pollard (1981) states strong consistency of (any) empirically optimal quantizer, which is defined by

(1)
Theorem 1.1 (Strong consistency of -means (Pollard; 1981)).

For any distribution such that and any integer , it holds that

This consistency result was extended to the case where the space is a general separable Hilbert space (Biau, Devroye and Lugosi; 2008; Levrard; 2015). Clearly, the consistency alone does not provide any information on how many training samples are needed to ensure that the excess distortion is below a certain level. Moreover, it does not allow the underlying distribution to be different for each . Over the last three decades a lot of efforts were made in order to prove non-asymptotic results for the excess distortion where the space is or a general separable Hilbert space. We refer to various bounds in (Bartlett, Linder and Lugosi; 1998; Linder; 2002; Biau, Devroye and Lugosi; 2008; Graf and Luschgy; 2007; Maurer and Pontil; 2010; Narayanan and Mitter; 2010; Levrard; 2013, 2015; Fefferman, Mitter and Narayanan; 2016; Maurer; 2016) and the references therein. However, almost all the results were provided under the strong assumption that the domain is bounded. That is, it is usually assumed that almost surely where is distributed according to and is a constant. This simple setup allows one to use the tools of Empirical Process Theory in order to prove results of the form (see e.g., Theorem 2.1 by Biau, Devroye and Lugosi (2008), where the space is assumed to be a separable Hilbert space)

(2)

holding, with probability at least , for

. The question of general unbounded distributions is more challenging and has been studied less. The case where the vectors

have well behaved exponential moments was analyzed in (Cadre and Paris; 2012). Results under less restrictive assumptions include: the uniform deviation bounds in (Telgarsky and Dasgupta; 2013; Bachem, Lucic, Hassani and Krause; 2017); a sub-Gaussian excess distortion bound in (Brownlees, Joly and Lugosi; 2015) for the so-called -medians problem; and the results for trimmed quantizers in (Brécheteau, Fischer and Levrard; 2018). We will discuss some of these results in more detail in what follows. However, we emphasize that in our particular setup the results we are aware of require the existence of at least four moments (that is, ) compared to the minimal assumption under which the problem makes sense — — which we are aiming for in this paper (this assumption is required to define the distortion ). The question whether non-asymptotic results of the form (2) are possible under the minimal assumption (as in (Pollard; 1981)) appeared naturally in several papers (see e.g., (Levrard; 2013)) but has not yet been addressed.

Our motivating example is the sub-Gaussian mean estimator introduced in (Lugosi and Mendelson; 2019c). Consider the situation where with the Euclidean norm, set , and assume that the covariance matrix exists. If , we obviously have that the optimal quantizer is actually the mean . In this case, our problem boils down to the estimation of the mean of a random vector. It was shown by Lugosi and Mendelson that there is an estimator such that, with probability at least ,

(3)

where

is the largest eigenvalue of the covariance matrix

, the expectation is taken only with respect to , and is random. It is known that this bound is valid for the sample mean in the case where the underlying distribution is multivariate Gaussian. The bound (3) has some remarkable properties:

  • The dependence on is .

  • It only requires the existence of two moments, that is . We note that in , .

  • It has the logarithmic dependence on the confidence, which is and corresponds to the sub-Gaussian tails (see e.g., (Vershynin; 2016)

    for various equivalent definitions of sub-Gaussian distributions).

  • Finally, even in the favorable bounded case where almost surely the bound (3) does not scale as (compare it with the typical -means bound (2)) but as which can be much smaller than .

Therefore, extending the original question of whether the non-asymptotic excess distortion bounds are possible under , it is natural to ask if one can prove a result of the form (3) for . Unfortunately, fully general picture is much more subtle. In particular, even in the favorable bounded case for the lower bounds of order are known (see e.g., (Antos; 2005)) making the simple bound (2) sharp with respect to .

Further, if we observe that the right-hand side of (3) converges to zero as goes to infinity even if the underlying distribution is different for each . Our only condition is that does not grow too fast as goes to infinity. Surprisingly, Example 1.2 below will show that the same is not true for general .

Risk bounds having the sub-Gaussian form for heavy-tailed distributions have attracted a lot of attention recently. Among these advances are (almost) optimal results on mean estimation in various norms (Minsker; 2018; Lugosi and Mendelson; 2019a); in robust regression (Minsker and Mathieu; 2019; Lugosi and Mendelson; 2019b; Lecué and Lerasle; 2017); in covariance estimation (Mendelson and Zhivotovskiy; 2018); in classification (Lecué, Lerasle and Mathieu; 2018). All the technical results in this area are based on different versions of the so-called median of means estimator, which was first introduced and analyzed by Nemirovsky and Yudin (1983) and independently in (Alon, Matias and Szegedy; 1999). For the sake of completeness, let us recall this basic result.

Assume that

are independent random variables with the same mean

and the same variance

. Fix the confidence level and assume that is such that , where is integer. Split the set into blocks of equal size such that . Denote the median of means (MOM for short) estimator by

For this estimator we have the following sub-Gaussian behaviour. It holds, with probability at least , that

Returning to the question of -means clustering and the inequalities of the form (3) for general , the following simple example inspired by Bachem, Lucic, Hassani and Krause highlights some of the obstacles we will have to handle.

Example 1.2.

Let be the sample size. Consider the real line , and the distribution supported on such that and . In this case we have .

One may easily see that, with constant probability, the value is not among . That will obviously force and

which is not converging to zero as goes to infinity.

In Section 3.1 we will significantly extend this construction. Of course, Example 1.2 does not contradict the strong consistency result of Theorem 1. Although , the distribution changes with , which is, of course, not allowed in Theorem 1

. However, in the statistical learning theory literature, the underlying distribution

is usually allowed to change with , and this provides an additional motivation for our study. Surprisingly, our general bounds will provide consistency even for some sequences of distributions changing with the sample size .

On Voronoi cells and clustering.

From now on we assume that is a separable Hilbert space with the inner product denoted by . Any quantizer induces a partition of into the so-called Voronoi cells, which for consists of the points that have as the closest point from . To avoid the uncertainty at the boundaries, we assume that the elements of each quantizer are ordered, and define the cells for ,

This way, we ensure that the cells are non-intersecting, and each of them is an intersection of open or closed half-spaces. Slightly abusing the notation, we sometimes write instead of .

We recall some basic properties of an optimal quantizer under the assumption .

  1. For any distribution with and any there exists an optimal -elements quantizer (see (Fischer; 2010), Corollary 3.1.) Note that an optimal quantizer is not necessarily unique.

  2. For any optimal and ,

    which means that the measure of intersection of any two cells is zero, thus it does not matter to which cell the boundary points are assigned (see (Graf and Luschgy; 2007), Theorem 4.2.)

  3. The centroid condition (Graf and Luschgy; 2007): for ,

    (4)
  4. Once the support of consists of at least elements, there is a well-defined real number such that for any optimal ,

    (5)

    We refer to the original proof of Pollard or to Lemma 5.1 in (Fischer; 2010).

  5. Due to Theorem 4.1 in (Graf and Luschgy; 2007) provided that the support of consists of at least elements, there exists s.t. for any optimal ,

    (6)

Observe that the same conclusions work if we replace by its empirical counterpart . In particular, a version of centroid condition is also valid for the empirically optimal quantizer defined by (1). However, it is not true for a MOM based estimator in general.

Structure of the paper

  • Section 2 is devoted to a high probability excess distortion bounds that hold in the case where a good guess on the localization radius of the optimal quantizer is available. The result generalizes naturally several known bounds for the empirically optimal quantizer in separable Hilbert spaces.

  • Section 3 contains our main results. We show that there is a consistent median-of-means based estimator that gives the sub-Gaussian performance under our minimal moment assumption provided that a good guess on is given and . We also prove a minimax lower bound showing that our dependence on and is sharp up to constant factors in the special case of .

  • Finally, Section 4 is devoted to the generalization of our main results. We show that it is possible to prove a slightly weaker bound using the procedure that does not require the knowledge of the parameters of .

  • Section 5 is devoted to discussions and some final remarks.

Notation.

For , we set and for two real valued functions , we write iff there is an absolute constant such that . We set if and . Given a probability measure let denote the measure which is -times product of . For the sake of simplicity, we always assume that is equal to . The indicator of an event is denoted by . We also use the standard , , notation as well as and

for Kullback–Leibler divergence and Total Variation distance between two measures

and (see e.g., (Boucheron et al.; 2013)). The support of a measure is denoted by . For a normed space let denote the closed ball of radius centered at the origin. To avoid the measurability issues we use the standard convention for the supremum of stochastic processes (see Paragraph 2 in (Talagrand; 2014)). Given the sample sampled i.i.d. from and a function we denote . In general, the symbol will denote the empirical measure.

We are interested in space, and the corresponding covering number of a functional class will be denoted by , where is the corresponding radius (see e.g., (Vershynin; 2016) for more details on covering numbers).

2 Simple Case: Known Magnitude of an Optimal Quantizer

In this section we provide our simplest result which can serve as a good illustration of the underlying techniques. In Sections 3 and 4 we will be focusing on sharpening our basic bound as well as weakening some of the assumptions.

We first show a simple bound which holds in the situations where a good guess on is available (recall the property 4 and (5)). The result of Theorem 2.2 below can be seen as a strengthening of Theorem 11 in (Brownlees et al.; 2015).

First, we observe that defined above is not translation invariant. This means that if the distribution of is changed in a way such that we replace by , where is a constant vector, the value of may increase, while the clustering problem will remain the same. Therefore, we slightly redefine the quantity. Let be a number such that

(7)

where and is any particular optimal quantizer. Fortunately, due to e.g., (3) there is a very efficient way to estimate . One may split the sample of size in two almost equal parts and estimate based on the first part. Therefore, for the sake of presentation, we assume that in this section.

Remark 2.1.

It is important to note that the boundedness of the vectors in the finite set has nothing in common with the boundedness of the observations . The latter can still be unbounded and the distribution can be heavy-tailed.

We proceed with the main result of this section.

Theorem 2.2.

Fix . Let some satisfying (7) be known and assume that . There is an estimator that depends on and such that, with probability at least ,

Let us now define the estimator that we use in Theorem 2.2. Notice that minimizing with respect to is equivalent to

Fix and assume without loss of generality that is integer. Split the set into blocks of equal size such that . For any real-valued function and random variables define

(8)

Slightly abusing the notation we set

(9)
The estimator of Theorem 2.2. Define
with the number of blocks . If there are many minimizers, we may choose any of them.

The proof of Theorem 2.2 relies on the uniform bound for the median of means estimator. However, instead of restricting our attention to the medians only, we consider the quantiles of means (QOM). That is, for a given level ,

where , for being a non-decreasing rearrangement of the original sequence. For the sake of simplicity, we always assume that is non-integer, such that the quantile is uniquely defined, and, in particular . It will be usually enough to assume that is not even which can be always achieved by adding at most one extra block. Obviously, corresponds to the median of means.

Lemma 2.3.

Fix and consider a separable class of square integrable real-valued functions. Suppose, we have blocks and is a non-integer. It holds that, with probability at least ,

(10)

as well as with probability at least ,

(11)

where are i.i.d. Rademacher signs.

Remark 2.4.

In the case where is fixed, we can take , so that with probability at least ,

where the first term represents the expectation of the empirical process, whereas the second term corresponds to a tail with the sub-Gaussian behavior. Compare this inequality with Talagrand’s inequality for empirical processes, where the assumption almost surely is needed (see Chapter 12 in Boucheron et al. (2013)).

As noticed by Minsker (2018) (see equation (2.7)) an inequality similar to (10) of Lemma 2.3 for follows from the proof of Theorem 2 in (Lecué, Lerasle and Mathieu; 2018). However, to the best of our knowledge, Lemma 2.3 in this form is not presented explicitly in the literature. We provide its proof in the appendix for the sake of completeness.

Proof of Theorem 2.2.

Step 1. First, we provide the high probability part of the analysis. Observe that

where we used since . We have by Lemma 2.3 that, with probability at least ,

where are independent Rademacher signs. Here, we have for each ,

where . Then, since for any , we have

Step 2. Note that can consist of less than points. However, in this case we can always add the copies of some of them and identify with . This does not change and preserves the Voronoi partition of the space since the cells corresponding to the newly added points are empty. Finally, we estimate

(12)

Consider the set of -valued functions such that for we set

(13)

For let . We obviously have . Following the analysis of Section 3.2 in (Maurer; 2016) we have for any two and from ,

This allows us to use the -contraction to upper bound (12) with the quantity scaling linearly in . To do so, we observe that Maurer’s vector contraction inequality (Theorem 3 in (Maurer; 2016)) implies

where , , are independent Rademacher signs, and . We further have by Khintchine’s inequality,

and also

Finally, taking the expectation with respect to and using Jensen’s inequality we obtain an analog of (2). That is,

(14)

Combining the above bounds we prove the claim. ∎

It is by now well known that in our setup in the bounded case (e.g., when almost surely) the right dependence of the excess distortion on the number of clusters is up to logarithmic factors (Fefferman et al.; 2016; Narayanan and Mitter; 2010). It is natural to ask if the same dependence is possible in our Theorem 2.2. First, observe that in the unbounded case, there are some complications. In particular, our parameter can also depend on . This means that the overall dependence of the excess distortion on can be more complicated. Nevertheless, in the next section we show, among other things, that these improvements are possible and, in particular, the -term will be replaced by the -term.

3 Towards Better Bounds Based on

This section is devoted to our main results. We prove almost optimal non-asymptotic bounds for -means. Recall that if we have for any optimal quantizer

unless the support of has less than points. Notice that controls the magnitude of the largest vector in . Indeed, using the centroid condition, Jensen’s inequality and the Cauchy–Schwarz inequality we have for any ,

(15)

This confirms that the mass of the lightest cluster of an optimal quantizer should affect the quality of any empirical quantizer.

Let us return to Example 1.2. In this case we have , , , and the bound (15) is tight. However, the bound of Theorem 2.2 is not tight anymore as it scales as . Indeed, Theorem 2.2 implies the bound which is whenever (15) is tight.

The challenging part is to get the optimal dependence on and in the excess distortion bound. In what follows, we show that the dependence is achievable with respect to these parameters. The result of this form guarantees the consistency for sequences of distributions depending on as long as and the second moment is uniformly bounded. This extends the original asymptotic result of Pollard (1981) to the case where the distribution is allowed to change with .

Suppose that we know the value of for at least one optimal quantizer. Denote for short, . Naturally, we want to find a solution such that the corresponding Voronoi cells are of measure at least which translates into due to concentration. It implies that each cell corresponding to contains enough sample points, which corresponds to the so-called constrained -means clustering. In the algorithmic side of constrained clustering is well studied in the context of optimal transport and has numerous practical applications, see (Ng; 2000; Cuturi and Doucet; 2014; Genevay, Dulac-Arnold and Vert; 2019) and references therein. We have additional motivation to introduce since this quantity appears naturally in the condition implying the so-called fast rates of the excess distortion in the bounded case (Levrard; 2015). Finally, recalling Example 1.2 we know that in any reasonable clustering problem which means that the optimal solution has enough observations in each cell. At the same time, we do not have such a natural preliminary guess on .

As before, the number of blocks depends solely on the desired confidence level. The estimator of this section is translation invariant and the assumption is not needed anymore. Our main result is the following theorem.

Theorem 3.1.

Fix . Suppose, for some optimal quantizer . There is an estimator that depends on and such that, with probability at least ,

Let us now present our estimator.

The estimator of Theorem 3.1. We set
(16)
with the number of blocks .

The idea behind this estimator is quite natural: we guarantee the robustness by using the MOM principle and by restricting our attention only on the cells containing enough points. As already mentioned, this is essentially a robust version of the constrained -means quantizer introduced in Ng (2000).

We introduce several technical results that together will lead us to the proof of Theorem 3.1 at the end of this section. As before, since the estimator we consider is translation invariant, we can assume that in the proof without loss of generality. As previously, our main tool is the concentration of MOM for a suitably chosen subset of . We show that the restriction in (16) implies a convenient bound for the vectors in the resulting empirical quantizer. Let us define the following class of quantizers:

(17)

The following lemma says that with high probability all the solutions corresponding to are bounded which is, of course, natural in view of the proof of Theorem 2.2. However, the key technical observation is that we also need to control the smallest norm by saying that there is at least one center in which is relatively close to the actual expectation. Surprisingly, in order to show this we do not have to use any uniform results that hold simultaneously for the entire class . Therefore, we have the following property.

Lemma 3.2.

With probability at least , it holds that simultaneously for all such that ,

Moreover, with probability at least , it holds that

Remark 3.3.

Note that the first statement of the above lemma gives us a prior bound on the excess distortion . Indeed, since , with probability at least , we have , thus

(18)

Before going to the proof of Lemma 3.2, let us state the following trivial result on empirical quantiles. We postpone its proof to Appendix.

Lemma 3.4.

Let be i.i.d. random values such that   and almost surely. Then for any we have

Proof of Lemma 3.2.

Step 1. First, let us prove the bound on the minimal norm. Consider such that . Then for any (recall that is a ball of radius centered at the origin) it holds that , thus for all ,

and hence

According to Lemma 3.4, with probability at least , it holds that

thus, simultaneously for all satisfying we have

In particular, since is always one of the potential candidates for and , we have

Step 2. Now consider such that there is with . It is easy to see that for any , implies , thus

(19)

Assume for some , then for any . Hence,

At the same time,

Now Chernoff’s bound yields that, with probability at least ,

This implies , what means that none of such can be chosen by our estimator. By the union bound, we finally get that with probability at least . ∎

The next step is to provide a uniform concentration of MOM over a class of quantizers . First, we estimate the -diameter and covering numbers of the functional class corresponding to :

(20)
Lemma 3.5.

For any distribution and any finite set it holds that

(21)

and

(22)
Proof.

Fix and let be such that . Then for any and it holds from (19) that . Therefore,

Further, we easily have using (21)