# Robust Bregman Clustering

Using a trimming approach, we investigate a k-means type method based on Bregman divergences for clustering data possibly corrupted with clutter noise. The main interest of Bregman divergences is that the standard Lloyd algorithm adapts to these distortion measures, and they are well-suited for clustering data sampled according to mixture models from exponential families. We prove that there exists an optimal codebook, and that an empirically optimal codebook converges a.s. to an optimal codebook in the distortion sense. Moreover, we obtain the sub-Gaussian rate of convergence for k-means 1 √() n under mild tail assumptions. Also, we derive a Lloyd-type algorithm with a trimming parameter that can be selected from data according to some heuristic, and present some experimental results.

## Authors

• 2 publications
• 5 publications
• 8 publications
• ### The Informativeness of k-Means and Dimensionality Reduction for Learning Mixture Models

The learning of mixture models can be viewed as a clustering problem. In...
03/30/2017 ∙ by Zhaoqiang Liu, et al. ∙ 0

• ### Robust k-means Clustering for Distributions with Two Moments

We consider the robust algorithms for the k-means clustering problem whe...
02/06/2020 ∙ by Yegor Klochkov, et al. ∙ 0

• ### Optimal Clustering in Anisotropic Gaussian Mixture Models

We study the clustering task under anisotropic Gaussian Mixture Models w...
01/14/2021 ∙ by Xin Chen, et al. ∙ 0

• ### Clustering subgaussian mixtures by semidefinite programming

We introduce a model-free relax-and-round algorithm for k-means clusteri...
02/22/2016 ∙ by Dustin G. Mixon, et al. ∙ 0

• ### Identifiability of Nonparametric Mixture Models and Bayes Optimal Clustering

Motivated by problems in data clustering, we establish general condition...
02/12/2018 ∙ by Bryon Aragam, et al. ∙ 0

• ### Uniform Concentration Bounds toward a Unified Framework for Robust Clustering

Recent advances in center-based clustering continue to improve upon the ...
10/27/2021 ∙ by Debolina Paul, et al. ∙ 0

• ### Generalized Dirichlet-process-means for f-separable distortion measures

DP-means clustering was obtained as an extension of K-means clustering. ...
01/31/2019 ∙ by Masahiro Kobayashi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering is the problem of classifying data in groups of similar points, so that the groups are as homogeneous and at the same time as well separated as possible

DHS

. There are no labels known in advance, so clustering is an unsupervised learning task. To perform clustering, one needs some distance-like function to serve as a proximity measure between points. A very famous clustering method is the standard

-means algorithm (see for instance Llo), based on the Euclidean distance.

Let denote a sample of independent random observations with values in

, with the same distribution as a generic random vector

with distribution . For an integer , in order to group data items in meaningful classes, the -means procedure minimizes the so-called empirical distortion

 Rn(c)=1nn∑i=1minj∈[[1,k]]∥Xi−cj∥2,

over all possible cluster centers or codebooks , with notation for . Here, denotes the empirical measure associated to , defined by

 Pn(A)=1nn∑i=11{Xi∈A},

for every Borel subset of .

It has been shown by Ban05 that the standard -means clustering algorithm (see for instance Llo) can be generalized to general Bregman divergences. Bregman divergences are a broad class of dissimilarity measures indexed by strictly convex functions. Introduced by Br67, they are useful in a wide range of areas, among which statistical learning and data mining (Ban05, CBL), computational geometry Nie07, natural sciences, speech processing and information theory GBG. Squared Euclidean, Mahalanobis, Kullback-Leibler and distances are particular cases of Bregman divergences.

A Bregman divergence is not necessary a true metric, since it may be asymmetric or fail to satisfy the triangle inequality. However, Bregman divergences fulfill an interesting projection property which generalizes the Hilbert projection on a closed convex set, as shown in Br67. Moreover, Ban05 have established that there exists a relation between finite-dimensional Bregman divergences and exponential families. Although they are not true metrics, Bregman divergences satisfy some properties, such as non-negativity and separation, convexity in the first argument and linearity (see Ban05, Nie07). As Bregman divergences represent a natural tool to measure proximity between observations arising from various distributions, we use them here for clustering purpose.

The aim is to find a data-based codebook such that the clustering risk gets close to the optimal risk as the size of the data set grows.

The convergence properties of empirical distortion minimizers are now quite-well understood when the source distribution is assumed to have a finite support Linder02; Fischer10, even in infinite-dimensional cases Biau08; Levrard15

. In real data sets, the source signal is often corrupted by noise, violating in most cases the bounded support assumption. In practice, data are usually pre-processed via an outlier-removal step that requires an important quantity of expertise. From a theoretical point of view, this corruption issue might be tackled winsorizing or trimming classical estimators, or by introducing some new and robust estimators that can adapt heavy-tailed cases. Such estimators can be based on PAC-Bayesian or Median of Means techniques

Catoni18; Joly15; Lecue17

for instance. In a nutshell, these estimator succeed in achieving subGaussian deviation bounds under mild tail condition such as bounded variances and expectations.

In the clustering framework, up to our knowledge, the only theoretically grounded robust procedure is to be found in Cuesta97, where a trimmed -means heuristic is introduced. See also Garcia08

for heteroscedastic trimmed clustering. In some sense, this paper extends this trimming approach to the general framework of clustering with Bregman divergence. Our precise contributions are listed below.

### 1.1 Contribution

• We introduce and provide a robust Bregman clustering algorithm, based on the minimization of a trimmed distortion criterion .

• We give theoretical evidences that such a criterion is adapted to heavy-tailed cases, for instance proving that minimizers exist whenever

has a first-order moment.

• We investigate the finite-sample behavior of trimmed empirical distortion minimizers, and show that our procedure attains state-of-the-art guarantees in terms of convergence towards the optimal. Namely, we derive sub-Gaussian deviation bounds for the excess trimmed distortion under bounded variance condition.

We also give numerical illustrations of the aforementioned results.

## 2 Clustering with trimmed Bregman divergence

### 2.1 Bregman divergences and distortion

A Bregman divergence is defined as follows.

###### Definition 1.

Let be a strictly convex real-valued function defined on a convex set . The Bregman divergence is defined for all by

 dϕ(x,y)=ϕ(x)−ϕ(y)−⟨∇yϕ,x−y⟩.

For every codebook , we set:

 dϕ(x,c):=mini∈[[1,k]]dϕ(x,ci).

Observe that, since is strictly convex, for all , is non-negative and equal to zero if and only if (see (Ro, Theorem 25.1)). Note that by taking , one gets .

Let us present a few other examples.

#### Examples

1. Exponential loss: , from to , leads to .

2. Logistic loss: , from to , leads to .

3. Kullback-Leibler: , from the simplex to , leads to .

Let be a distribution on , and a codebook. The clustering performance of will be measured via its distortion, namely

 R(c)=Pdϕ(u,c),

where, for any function and measure , means integration of with respect to . Whenever only is available, we denote by the corresponding empirical distortion (associated with ). In the case where is a mixture of distributions belonging to an exponential family, there exists a natural choice of Bregman divergence, as detailed in Section 5. Standard Bregman clustering intends to infer a minimizer of via minimizing , and works well in the bounded support case (Fischer10).

### 2.2 Trimmed optimal codebooks

As for classical mean estimation, plain -means is sensitive to outliers. An attempt to address this issue is proposed in Cuesta97; Gordaliza91: for a trim level , both a codebook and a subset of -mass larger than (trimming set) are pursued. This heuristic can be generalized to our framework as follows.

For a measure on , we write (i.e., is a sub-measure of ) if for every Borel set , let denote the set , and . In analogy with Cuesta97 optimal trimming sets and codebooks are designed to achieve the optimal -trimmed -variation,

 Vk,h=inf~P∈P+hinfc∈Ωk~Pdϕ(u,c):=inf~P∈P+hinfc∈ΩkR(~P,c), (1)

that is the best possible -point distortion based on a normalized sub-measure of .

If is a fixed codebook, we denote by (resp ) the open (resp. closed) Bregman ball with radius , (resp. ), and by the smallest radius such that

 P(Bϕ(c,r))≤h≤P(¯Bϕ(c,r)). (2)

We also denote by the radius when the distribution is . Note that is the Bregman divergence to the nearest-neighbor of in . Now, if is defined as the set of measures in that coincides with on , with support included in . An easy result is

###### Lemma 2.

For all , , and ,

 R(~Pc,c)≤R(~P,c).

Equality holds if and only if .

This lemma is a straightforward generalisation of results in (Cuesta97, Lemma 2.1), Gordaliza91 or Merigot. As a consequence, for any codebook we may restrict our attention to sub-measures in .

###### Definition 3.

For , the -trimmed distortion of is defined by

 Rh(c)=hR(~Pc,c),

where .

Note that since does not depend on the whenever , is well-defined. As well, will denote the trimmed distortion corresponding to the distribution . Another simple property of sub-measures can be translated in terms of trimmed distortion.

###### Lemma 4.

Let and . Then

 Rh(c)h≤Rh′(c)h′.

Moreover, equality holds if and only if .

As well, this lemma generalize previous results in Cuesta97; Gordaliza91. Combining Lemma 2 and Lemma 4 ensure that the -trimmed -variation may be achieved via minimizing a -trimmed distortion.

###### Proposition 5.
 Vk,h=1hinfc∈Ω(k)Rh(c).

This proposition is an extension of (Cuesta97, Proposition 2.3). Therefore, a good trimmed codebook in terms of -variation should minimize .

###### Definition 6.

A -trimmed -optimal codebook is any element in .

As exposed below, under mild assumptions on and , such a trimmed -optimal codebook exists.

###### Theorem 7.

Assume that , is and strictly convex and , then the set is not empty.

Note that Theorem 7 only requires . This can be compared with the standard squared Euclidean distance case, where is required for to have minimizers. From now on we denote by a minimizer of , and by a minimizer of the empirical trimmed distortion .

### 2.3 Bregman-Voronoi cells and centroid condition

Similarly to the Euclidean case, the clustering associated with a codebook will be given by a tesselation of the ambient space. To be more precise, for and , the Bregman-Voronoi cell associated with is . Some further results on the geometry of Bregman Voronoi cells might be found in Nie07. Since the ’s do not form a partition, will denote a subset of so that is a partition of (for instance break the ties of the ’s with respect to the lexicographic rule).

###### Proposition 8.

Let and . Assume that for all , , and denote by the codebook that consists in local means of , i.e. . Then

 Rh(c)≥Rh(m),

with equality if and only if for all in , .

Proposition 8 emphasizes the key property that Bregman divergences are minimized by expectations (this is not the case for

distance for instance), and are the only loss functions satisfying that property; see

Ban051. Thus, a straightforward approach to minimize can be based on an iterative scheme, as depicted below.

## 3 Description of the algorithm

The following algorithm is inspired by the trimmed version of the Lloyd’s algorithm; see Cuesta97, but is also a generalization of the Bregman clustering algorithm for uniform finite-supported measures; see (Ban05, Algorithm 1). We assume that we observe , and that the mass parameter equals for some positive integer . We also let denote the subset of corresponding to the -th cluster.

###### Algorithm 1.

Bregman trimmed -means

• Input , ,

• Initialization Sample , ,… from without replacement, .

• Iterations Repeat until stabilization of .

• indices of the smallest values of , .

• For , .

• For , .

• Output , .

As for every EM-type algorithm, initialization is a crucial point that will not be investigated in this paper. Note however that Bregman adaptations of approximate minimization methods such as -means ++ Arthur07 could be an efficient way to address the initialization issue, at least in practice. An easy consequence of Proposition 8 for the empirical measure associated with is the following. For short we denote by the trimmed distortion associated with .

###### Proposition 9.

Algorithm 1 converges to a local minimum of the function .

It is worth mentioning that in full generality the output of Algorithm 1 is not a global minimizer of . However, it is likely that suitable clusterability assumptions as in Kumar10; Tang16; Levrard18 would lead to further guarantees on such an output.

## 4 Convergence of a trimmed empirical distortion minimizer

This section is devoted to investigate the convergence of a minimizer of the empirical trimmed distortion . Throughout this section is assumed to be , and . That is, the closure of the convex hull of the support of is a subset of the interior of .

We begin with a generalization of (Cuesta97, Theorem 3.4) whenever has a unique optimal trimmed codebook .

###### Theorem 10.

Assume that there exists a unique minimizer of . If is continuous and satisfies for some , then:

 limn→+∞D(^cn,c∗h)=0 a% .e.

where and denotes the set of all permutations of . Moreover,

 limn→+∞Rn,h(^cn)=Rh(c∗h) a.e.

Observe that slightly milder conditions are required for the trimmed distortion of to converge towards the optimal at a parametric rate.

###### Theorem 11.

Assume that , where . Then, for

large enough, with probability larger than

, we have

 Rh(^cn)−Rh(c∗h)≤CP√n(1+√x).

Note that Theorem 11 does not require a unique trimmed optimal codebook, and only requires an order moment condition for to achieve a sub-Gaussian rate in terms of trimmed distortion. This condition is in line with the order moment condition required in Joly15 for a robustified estimator of to achieve similar guarantees, as well as the finite-variance condition required in Catoni18 in a mean estimation framework. To derive results in expectation, a technical additional condition is needed.

###### Corollary 12.

If in addition there exist and a convex function such that

 supc∈Bϕ(c0,t)∩F0∥∇cϕ∥≤ψ(t),

with , and , then

 E(Rh(^cn)−Rh(c∗h))≤CP√n.

Note that such a function exists in most of the classical cases. However the moment condition required by Corollary 12 might be quite stronger than the order condition of Theorem 11, as illustrated below.

1. In the plain -means case and , we can choose and . Then the condition of Corollary 12 boils down to .

2. In the case where , , we may also choose , and . The condition of Corollary 12 may be written as , , and .

## 5 Numerical experiments

In this part, we apply Algorithm 1 to mixtures of distributions belonging to some exponential family. As presented in Ban05, a distribution from an exponential family may be associated to a Bregman divergence, by Legendre duality of convex functions. For a particular distribution, the corresponding Bregman divergence is more adapted for the clustering than other divergences.

Recall that an exponential family associated to a proper closed convex function defined on an open parameter space is a family of distributions , such that, for all , , defined on , is absolutely continuous with respect to some distribution , with Radon-Nikodym density defined for all by:

 pψ,θ(x)=exp(⟨x,θ⟩−ψ(θ)).

For this model, the expectation of may be expressed as . We define

 ϕ(μ)=supθ∈Θ{⟨μ,θ⟩−ψ(θ)}.

By Legendre duality, for all such that is defined, we get:

 ϕ(μ)=⟨θ(μ),μ⟩−ψ(θ(μ)),

with . The density of with respect to can be rewritten using the Bregman divergence associated to as follows:

 pψ,θ(x)=exp(−dϕ(x,μ)+ϕ(x)).

In the next experiments, we use Gaussian, Poisson, Binomial and Gamma mixture distributions and the corresponding Bregman divergences. Figure 1 presents the 4 densities together with the functions and , as well as the associated Bregman divergences .

### 5.1 Trimming parameter selection

To assess the good behavior of our procedure with respect to outliers, we propose to perform the same experiment as in Ban05, but with a level of noise. First, we propose in this section a heuristic to select the trimming parameter , that is, the number of points in the sample which are assigned to a cluster and not considered as noise. Our strategy is as follows: we let vary from to the sample size , plot the curve where denotes the optimal empirical distortion at trimming level , and choose by seeking for a cut-point on the curve. Indeed, when the parameter gets large enough, it is likely that the procedure begins to assign outliers to clusters, which dramatically deprecates the empirical distortion.

We consider mixture models of Gaussian, Poisson, Binomial and Gamma distributions. Each of the mixtures consists in three components with equal probabilities

. The means are set to 10, 20 and 40. We set the standard deviation for the Gaussian densities to 5, the number of trials for the Binomial distribution to 100, and the shape parameter for the Gamma distribution to 40. Each time, 100 points are sampled from the mixture distribution and 20 outliers are added, uniformly sampled on

.

We use Algorithm 1

for each of these noisy mixture distributions, using the corresponding divergence, and also make the same experiment with the Cauchy distribution, using the squared Euclidean norm, that is, the Bregman divergence associated with the Gaussian distribution.

We get for the Gaussian mixture, for the Poisson mixture and for the Cauchy mixture.The curves for selecting are depicted in Figure 2 for Gaussian, Poisson and Cauchy mixtures. In Figure 3, we have plotted the clustering results associated to the selected parameters for these distributions.

Note that the selection of the trimming parameter works well in the Cauchy example and the associated clustering seems also good. Thus, the method can adapt to mixture models that are not necessarily from distributions in exponential families.

### 5.2 Comparative performances of Bregman clustering for mixtures with noise

We build 100 120-samples with each time 100 points from the mixture distribution and 20 outliers uniformly sampled on , as in Section 5.1. Most of the time, with these amounts of points, our heuristic leads to a selection of around 103. In the next experiments, we choose . To assess the performance of a partitioning, we compute the normalized mutual information , as defined in SG, between the true clusters and the clusters obtained with Algorithm 1. More specifically, the points corresponding to noise are assigned to one same cluster, and we compute the renormalized mutual information between the true partition, made of the three clusters arising from the mixture and the cluster of noisy points, and the clustering obtained with the trimming procedure with parameter

. Confidence intervals for the normalized mutual information, with level around 5 percent, centered at the mean

, and with length , are derived in Figure 4. Note that, in general, the divergences associated to the distributions provide a better clustering, although it is less marked than for data without noise, as illustrated by the Binomial mixture case.

We conduct the same experiment for mixtures in dimension , with the same parameters. More precisely, is replaced by where the ’s are independent and from the same component of the original unidimensional mixture. We proceed as for dimension 1 with points corresponding to signal and additional points. The Bregman divergences used in the algorithm are the sum of the two coordinates of the corresponding 1-dimensional Bregman divergences. Using the same trimming parameter selection heuristic as above, we choose for the Gaussian, Poisson and Gamma cases, and for the Cauchy distribution and the Binomial case. The corresponding clustering results are depicted in Figure 5 for Gaussian, Poisson and Cauchy mixtures. In Figure 6, we computed the mutual information for the selected trimming parameter.

The heuristic for selecting the trimming set is not as effective as in dimension 1. It may come from the fact that the clutter noise is closer to the signal in dimension 2. The Bregman divergences for the Poisson, Binomial and Gamma distributions take into account the heteroscedasticity of the data, since the variance within a direction or depends on the mean within the direction. In the multidimensional Gaussian setting, Garcia08 proposed an algorithm, implemented in the R package tclust, that adapts to mixtures with different covariances matrices. It may be a future work to adapt our methods to non-Gaussian distributions with non-diagonal covariance matrices.

## 6 Proofs for Section 2.1

### 6.1 Intermediate results

The proof of Theorem 7, Theorem 10 and Theorem 11 make extensive use of the following technical lemmas. The first of them is a global upper bound on the radii , when is in a compact subset of .

###### Lemma 13.

Assume that is and . Then, for every and , there exists such that

 supc∈F0∩¯¯¯¯B(0,K),s≤hrs(c)≤r+.

As a consequence, if is a codebook with a codepoint satisfying and ,

 rs(c)≤r+.
###### Proof of Lemma 13.

Let be such that . Thus, if , . Since , and is , according to the mean value theorem, there exists such that,

 ∀x,y∈B(c,K+K+)∩F0,dϕ(x,y)≤C+∥x−y∥.

Therefore, for every , .

Hence .

At last, if is such that , then . Therefore , hence . ∎

Next, the following Lemma makes connections between the difference of Bregman divergences and distance between codebooks.

###### Lemma 14.

Assume that and is on . Then, for every , there exists such that for every and in , and ,

 ∣∣dϕ(x,c)−dϕ(x,c′)∣∣≤CKD(c,c′)(1+∥x∥),

where we recall that with the set of all permutations of .

###### Proof of Lemma 14.

The set is a convex compact subset of . Let and . Since and are , the mean value theorem yields that

 ∣∣dϕ(x,cj)−dϕ(x,c′j)∣∣ ≤CK∥cj−c′j∥(1+∥x∥),

for some constant . Thus,

 ∣∣dϕ(x,c)−dϕ(x,c′)∣∣≤CK(1+∥x∥)maxj∥cj−c′j∥

and

 ∣∣dϕ(x,c)−dϕ(x,c′)∣∣≤CK(1+∥x∥)D(c,c′).

We will also need a continuity result on the function , stated as the following Lemma.

###### Lemma 15.

Assume that , and is on . Then the map is continuous. Moreover, for every , and , there is such that

###### Proof of Lemma 15.

According to Lemma 2 and Lemma 14, for every ,

 Rh(c)−Rh(c′) ≤hPc′,h|dϕ(u,c)−dϕ(u,c′)| ≤CKD(c,c′)(1+P∥u∥),

for some . As a consequence, when . Now, note that for every ,

 Rh(c)−Rs(c) =Pdϕ(u,c)(1Bϕ(c,rh(c))(u)−1Bϕ(c,rs(c)))(u) ≤rh(c)(h−s)

Moreover, according to Lemma 13, for some , hence the result. ∎

### 6.2 Proof of Lemma 2

For , let denote the

-quantile of the random variable

for . That is,

 F−1c(u)=inf{r≥0∣P(Bϕ(c,r))>u}.

If denotes the -quantile of , for , it holds . Let be a uniform random variable on , then and have the same distribution. Thus, we may write:

 R(~Pc,c) =E~X∗dϕ(~X∗,c) =∫1u=0F−1c(hu)du

Let be a Borel probability measure on such that is a sub-measure of , and let denote the -quantile of for . Since , it holds that . Thus, we may write

 R(~P,c) =∫1u=0~F−1c(u)du ≥R(~Pc,c).

Note that equality holds if and only if for almost all , that is .

### 6.3 Proof of Lemma 4

Set , and recall that denote the -quantile of the random variable for and . Since is non-decreasing, we may write

 Rh(c)h =∫1u=0F−1c(hu)du ≤∫1u=0F−1c(h′u)du =Rh′(c)h′.

Equality holds if and only if for almost all . Since is non-decreasing and right-continuous, entails , for all , or, in other words, . From (2), it follows that . Conversely, equality holds when .

### 6.4 Proof of Theorem 7

In the following, we will use the more concise notation for the optimal risk . For , let denote the operator that maps a codebook onto its local means, that is

 Ts(c)j=Pcsu1Wj(c)(u)Pc,s(Wj(c)),

with and adopting the convention that whenever . The intuition behind the proof of Theorem 7 is that optimal codebooks can be found as iterations of , and that is a bounded operator whenever each Voronoi cell has enough weight. This idea is summarized by the following Lemma, that encompasses Theorem 7.

###### Lemma 16.

For every , if , then

 α:=minj=2,…,kR∗j−1,h−R∗j,h>0.

Moreover there exists