1 Introduction
Clustering is the problem of classifying data in groups of similar points, so that the groups are as homogeneous and at the same time as well separated as possible
DHS. There are no labels known in advance, so clustering is an unsupervised learning task. To perform clustering, one needs some distancelike function to serve as a proximity measure between points. A very famous clustering method is the standard
means algorithm (see for instance Llo), based on the Euclidean distance.Let denote a sample of independent random observations with values in
, with the same distribution as a generic random vector
with distribution . For an integer , in order to group data items in meaningful classes, the means procedure minimizes the socalled empirical distortionover all possible cluster centers or codebooks , with notation for . Here, denotes the empirical measure associated to , defined by
for every Borel subset of .
It has been shown by Ban05 that the standard means clustering algorithm (see for instance Llo) can be generalized to general Bregman divergences. Bregman divergences are a broad class of dissimilarity measures indexed by strictly convex functions. Introduced by Br67, they are useful in a wide range of areas, among which statistical learning and data mining (Ban05, CBL), computational geometry Nie07, natural sciences, speech processing and information theory GBG. Squared Euclidean, Mahalanobis, KullbackLeibler and distances are particular cases of Bregman divergences.
A Bregman divergence is not necessary a true metric, since it may be asymmetric or fail to satisfy the triangle inequality. However, Bregman divergences fulfill an interesting projection property which generalizes the Hilbert projection on a closed convex set, as shown in Br67. Moreover, Ban05 have established that there exists a relation between finitedimensional Bregman divergences and exponential families. Although they are not true metrics, Bregman divergences satisfy some properties, such as nonnegativity and separation, convexity in the first argument and linearity (see Ban05, Nie07). As Bregman divergences represent a natural tool to measure proximity between observations arising from various distributions, we use them here for clustering purpose.
The aim is to find a databased codebook such that the clustering risk gets close to the optimal risk as the size of the data set grows.
The convergence properties of empirical distortion minimizers are now quitewell understood when the source distribution is assumed to have a finite support Linder02; Fischer10, even in infinitedimensional cases Biau08; Levrard15
. In real data sets, the source signal is often corrupted by noise, violating in most cases the bounded support assumption. In practice, data are usually preprocessed via an outlierremoval step that requires an important quantity of expertise. From a theoretical point of view, this corruption issue might be tackled winsorizing or trimming classical estimators, or by introducing some new and robust estimators that can adapt heavytailed cases. Such estimators can be based on PACBayesian or Median of Means techniques
Catoni18; Joly15; Lecue17for instance. In a nutshell, these estimator succeed in achieving subGaussian deviation bounds under mild tail condition such as bounded variances and expectations.
In the clustering framework, up to our knowledge, the only theoretically grounded robust procedure is to be found in Cuesta97, where a trimmed means heuristic is introduced. See also Garcia08
for heteroscedastic trimmed clustering. In some sense, this paper extends this trimming approach to the general framework of clustering with Bregman divergence. Our precise contributions are listed below.
1.1 Contribution

We introduce and provide a robust Bregman clustering algorithm, based on the minimization of a trimmed distortion criterion .

We give theoretical evidences that such a criterion is adapted to heavytailed cases, for instance proving that minimizers exist whenever
has a firstorder moment.

We investigate the finitesample behavior of trimmed empirical distortion minimizers, and show that our procedure attains stateoftheart guarantees in terms of convergence towards the optimal. Namely, we derive subGaussian deviation bounds for the excess trimmed distortion under bounded variance condition.
We also give numerical illustrations of the aforementioned results.
2 Clustering with trimmed Bregman divergence
2.1 Bregman divergences and distortion
A Bregman divergence is defined as follows.
Definition 1.
Let be a strictly convex realvalued function defined on a convex set . The Bregman divergence is defined for all by
For every codebook , we set:
Observe that, since is strictly convex, for all , is nonnegative and equal to zero if and only if (see (Ro, Theorem 25.1)). Note that by taking , one gets .
Let us present a few other examples.
Examples

Exponential loss: , from to , leads to .

Logistic loss: , from to , leads to .

KullbackLeibler: , from the simplex to , leads to .
Let be a distribution on , and a codebook. The clustering performance of will be measured via its distortion, namely
where, for any function and measure , means integration of with respect to . Whenever only is available, we denote by the corresponding empirical distortion (associated with ). In the case where is a mixture of distributions belonging to an exponential family, there exists a natural choice of Bregman divergence, as detailed in Section 5. Standard Bregman clustering intends to infer a minimizer of via minimizing , and works well in the bounded support case (Fischer10).
2.2 Trimmed optimal codebooks
As for classical mean estimation, plain means is sensitive to outliers. An attempt to address this issue is proposed in Cuesta97; Gordaliza91: for a trim level , both a codebook and a subset of mass larger than (trimming set) are pursued. This heuristic can be generalized to our framework as follows.
For a measure on , we write (i.e., is a submeasure of ) if for every Borel set , let denote the set , and . In analogy with Cuesta97 optimal trimming sets and codebooks are designed to achieve the optimal trimmed variation,
(1) 
that is the best possible point distortion based on a normalized submeasure of .
If is a fixed codebook, we denote by (resp ) the open (resp. closed) Bregman ball with radius , (resp. ), and by the smallest radius such that
(2) 
We also denote by the radius when the distribution is . Note that is the Bregman divergence to the nearestneighbor of in . Now, if is defined as the set of measures in that coincides with on , with support included in . An easy result is
Lemma 2.
For all , , and ,
Equality holds if and only if .
This lemma is a straightforward generalisation of results in (Cuesta97, Lemma 2.1), Gordaliza91 or Merigot. As a consequence, for any codebook we may restrict our attention to submeasures in .
Definition 3.
For , the trimmed distortion of is defined by
where .
Note that since does not depend on the whenever , is welldefined. As well, will denote the trimmed distortion corresponding to the distribution . Another simple property of submeasures can be translated in terms of trimmed distortion.
Lemma 4.
Let and . Then
Moreover, equality holds if and only if .
As well, this lemma generalize previous results in Cuesta97; Gordaliza91. Combining Lemma 2 and Lemma 4 ensure that the trimmed variation may be achieved via minimizing a trimmed distortion.
Proposition 5.
This proposition is an extension of (Cuesta97, Proposition 2.3). Therefore, a good trimmed codebook in terms of variation should minimize .
Definition 6.
A trimmed optimal codebook is any element in .
As exposed below, under mild assumptions on and , such a trimmed optimal codebook exists.
Theorem 7.
Assume that , is and strictly convex and , then the set is not empty.
Note that Theorem 7 only requires . This can be compared with the standard squared Euclidean distance case, where is required for to have minimizers. From now on we denote by a minimizer of , and by a minimizer of the empirical trimmed distortion .
2.3 BregmanVoronoi cells and centroid condition
Similarly to the Euclidean case, the clustering associated with a codebook will be given by a tesselation of the ambient space. To be more precise, for and , the BregmanVoronoi cell associated with is . Some further results on the geometry of Bregman Voronoi cells might be found in Nie07. Since the ’s do not form a partition, will denote a subset of so that is a partition of (for instance break the ties of the ’s with respect to the lexicographic rule).
Proposition 8.
Let and . Assume that for all , , and denote by the codebook that consists in local means of , i.e. . Then
with equality if and only if for all in , .
Proposition 8 emphasizes the key property that Bregman divergences are minimized by expectations (this is not the case for
distance for instance), and are the only loss functions satisfying that property; see
Ban051. Thus, a straightforward approach to minimize can be based on an iterative scheme, as depicted below.3 Description of the algorithm
The following algorithm is inspired by the trimmed version of the Lloyd’s algorithm; see Cuesta97, but is also a generalization of the Bregman clustering algorithm for uniform finitesupported measures; see (Ban05, Algorithm 1).
We assume that we observe , and that the mass parameter equals for some positive integer . We also let denote the subset of corresponding to the th cluster.
Algorithm 1.
Bregman trimmed means

Input , ,

Initialization Sample , ,… from without replacement, .

Iterations Repeat until stabilization of .

indices of the smallest values of , .

For , .

For , .


Output , .
As for every EMtype algorithm, initialization is a crucial point that will not be investigated in this paper. Note however that Bregman adaptations of approximate minimization methods such as means ++ Arthur07 could be an efficient way to address the initialization issue, at least in practice. An easy consequence of Proposition 8 for the empirical measure associated with is the following. For short we denote by the trimmed distortion associated with .
Proposition 9.
Algorithm 1 converges to a local minimum of the function .
It is worth mentioning that in full generality the output of Algorithm 1 is not a global minimizer of . However, it is likely that suitable clusterability assumptions as in Kumar10; Tang16; Levrard18 would lead to further guarantees on such an output.
4 Convergence of a trimmed empirical distortion minimizer
This section is devoted to investigate the convergence of a minimizer of the empirical trimmed distortion . Throughout this section is assumed to be , and . That is, the closure of the convex hull of the support of is a subset of the interior of .
We begin with a generalization of (Cuesta97, Theorem 3.4) whenever has a unique optimal trimmed codebook .
Theorem 10.
Assume that there exists a unique minimizer of . If is continuous and satisfies for some , then:
where and denotes the set of all permutations of . Moreover,
Observe that slightly milder conditions are required for the trimmed distortion of to converge towards the optimal at a parametric rate.
Theorem 11.
Note that Theorem 11 does not require a unique trimmed optimal codebook, and only requires an order moment condition for to achieve a subGaussian rate in terms of trimmed distortion. This condition is in line with the order moment condition required in Joly15 for a robustified estimator of to achieve similar guarantees, as well as the finitevariance condition required in Catoni18 in a mean estimation framework. To derive results in expectation, a technical additional condition is needed.
Corollary 12.
If in addition there exist and a convex function such that
with , and , then
5 Numerical experiments
In this part, we apply Algorithm 1 to mixtures of distributions belonging to some exponential family. As presented in Ban05, a distribution from an exponential family may be associated to a Bregman divergence, by Legendre duality of convex functions. For a particular distribution, the corresponding Bregman divergence is more adapted for the clustering than other divergences.
Recall that an exponential family associated to a proper closed convex function defined on an open parameter space is a family of distributions , such that, for all , , defined on , is absolutely continuous with respect to some distribution , with RadonNikodym density defined for all by:
For this model, the expectation of may be expressed as . We define
By Legendre duality, for all such that is defined, we get:
with . The density of with respect to can be rewritten using the Bregman divergence associated to as follows:
In the next experiments, we use Gaussian, Poisson, Binomial and Gamma mixture distributions and the corresponding Bregman divergences. Figure 1 presents the 4 densities together with the functions and , as well as the associated Bregman divergences .
Distribution  

Gaussian  
Poisson  
Binomial  
Gamma 
Distribution  

Gaussian  
Poisson  
Binomial  
Gamma 
5.1 Trimming parameter selection
To assess the good behavior of our procedure with respect to outliers, we propose to perform the same experiment as in Ban05, but with a level of noise. First, we propose in this section a heuristic to select the trimming parameter , that is, the number of points in the sample which are assigned to a cluster and not considered as noise. Our strategy is as follows: we let vary from to the sample size , plot the curve where denotes the optimal empirical distortion at trimming level , and choose by seeking for a cutpoint on the curve. Indeed, when the parameter gets large enough, it is likely that the procedure begins to assign outliers to clusters, which dramatically deprecates the empirical distortion.
We consider mixture models of Gaussian, Poisson, Binomial and Gamma distributions. Each of the mixtures consists in three components with equal probabilities
. The means are set to 10, 20 and 40. We set the standard deviation for the Gaussian densities to 5, the number of trials for the Binomial distribution to 100, and the shape parameter for the Gamma distribution to 40. Each time, 100 points are sampled from the mixture distribution and 20 outliers are added, uniformly sampled on
.We use Algorithm 1
for each of these noisy mixture distributions, using the corresponding divergence, and also make the same experiment with the Cauchy distribution, using the squared Euclidean norm, that is, the Bregman divergence associated with the Gaussian distribution.
We get for the Gaussian mixture, for the Poisson mixture and for the Cauchy mixture.The curves for selecting are depicted in Figure 2 for Gaussian, Poisson and Cauchy mixtures. In Figure 3, we have plotted the clustering results associated to the selected parameters for these distributions.
Note that the selection of the trimming parameter works well in the Cauchy example and the associated clustering seems also good. Thus, the method can adapt to mixture models that are not necessarily from distributions in exponential families.
5.2 Comparative performances of Bregman clustering for mixtures with noise
We build 100 120samples with each time 100 points from the mixture distribution and 20 outliers uniformly sampled on , as in Section 5.1. Most of the time, with these amounts of points, our heuristic leads to a selection of around 103. In the next experiments, we choose . To assess the performance of a partitioning, we compute the normalized mutual information , as defined in SG, between the true clusters and the clusters obtained with Algorithm 1. More specifically, the points corresponding to noise are assigned to one same cluster, and we compute the renormalized mutual information between the true partition, made of the three clusters arising from the mixture and the cluster of noisy points, and the clustering obtained with the trimming procedure with parameter
. Confidence intervals for the normalized mutual information, with level around 5 percent, centered at the mean
, and with length , are derived in Figure 4. Note that, in general, the divergences associated to the distributions provide a better clustering, although it is less marked than for data without noise, as illustrated by the Binomial mixture case.Model  

Gaussian  
Poisson  
Binomial  
Gamma  
Cauchy 
We conduct the same experiment for mixtures in dimension , with the same parameters. More precisely, is replaced by where the ’s are independent and from the same component of the original unidimensional mixture. We proceed as for dimension 1 with points corresponding to signal and additional points. The Bregman divergences used in the algorithm are the sum of the two coordinates of the corresponding 1dimensional Bregman divergences. Using the same trimming parameter selection heuristic as above, we choose for the Gaussian, Poisson and Gamma cases, and for the Cauchy distribution and the Binomial case. The corresponding clustering results are depicted in Figure 5 for Gaussian, Poisson and Cauchy mixtures. In Figure 6, we computed the mutual information for the selected trimming parameter.
The heuristic for selecting the trimming set is not as effective as in dimension 1. It may come from the fact that the clutter noise is closer to the signal in dimension 2. The Bregman divergences for the Poisson, Binomial and Gamma distributions take into account the heteroscedasticity of the data, since the variance within a direction or depends on the mean within the direction. In the multidimensional Gaussian setting, Garcia08 proposed an algorithm, implemented in the R package tclust, that adapts to mixtures with different covariances matrices. It may be a future work to adapt our methods to nonGaussian distributions with nondiagonal covariance matrices.
Model  

Gaussian  
Poisson  
Binomial  
Gamma  
Cauchy 
6 Proofs for Section 2.1
6.1 Intermediate results
The proof of Theorem 7, Theorem 10 and Theorem 11 make extensive use of the following technical lemmas. The first of them is a global upper bound on the radii , when is in a compact subset of .
Lemma 13.
Assume that is and . Then, for every and , there exists such that
As a consequence, if is a codebook with a codepoint satisfying and ,
Proof of Lemma 13.
Let be such that . Thus, if , . Since , and is , according to the mean value theorem, there exists such that,
Therefore, for every , .
Hence .
At last, if is such that , then . Therefore , hence . ∎
Next, the following Lemma makes connections between the difference of Bregman divergences and distance between codebooks.
Lemma 14.
Assume that and is on . Then, for every , there exists such that for every and in , and ,
where we recall that with the set of all permutations of .
Proof of Lemma 14.
The set is a convex compact subset of . Let and . Since and are , the mean value theorem yields that
for some constant . Thus,
and
∎
We will also need a continuity result on the function , stated as the following Lemma.
Lemma 15.
Assume that , and is on . Then the map is continuous. Moreover, for every , and , there is such that
6.2 Proof of Lemma 2
If denotes the quantile of , for , it holds . Let be a uniform random variable on , then and have the same distribution. Thus, we may write:
Let be a Borel probability measure on such that is a submeasure of , and let denote the quantile of for . Since , it holds that . Thus, we may write
Note that equality holds if and only if for almost all , that is .
6.3 Proof of Lemma 4
Set , and recall that denote the quantile of the random variable for and . Since is nondecreasing, we may write
Equality holds if and only if for almost all . Since is nondecreasing and rightcontinuous, entails , for all , or, in other words, . From (2), it follows that . Conversely, equality holds when .
6.4 Proof of Theorem 7
In the following, we will use the more concise notation for the optimal risk . For , let denote the operator that maps a codebook onto its local means, that is
with and adopting the convention that whenever . The intuition behind the proof of Theorem 7 is that optimal codebooks can be found as iterations of , and that is a bounded operator whenever each Voronoi cell has enough weight. This idea is summarized by the following Lemma, that encompasses Theorem 7.
Lemma 16.
For every , if , then
Moreover there exists
Comments
There are no comments yet.